hdfs
play

HDFS Hadoop Distributed File System Motivation File Management - PowerPoint PPT Presentation

HDFS Hadoop Distributed File System Motivation File Management Streaming Data Fault Tolerance 1 Labs Run 227 October (four weeks) at these times: Monday 9am Monday 10am Tuesday 2pm Wednesday 10am Wednesday 2pm Thursday 9am Thursday


  1. HDFS Hadoop Distributed File System Motivation File Management Streaming Data Fault Tolerance 1

  2. Labs Run 2–27 October (four weeks) at these times: Monday 9am Monday 10am Tuesday 2pm Wednesday 10am Wednesday 2pm Thursday 9am Thursday 11am Friday 11am Friday 2pm Lab groups will be chosen online: student.inf.ed.ac.uk. Motivation File Management Streaming Data Fault Tolerance 2

  3. Distributed Map-Reduce Node 1 Node 2 Node 3 Mapper 1 Node 4 Node 5 Node 6 Mapper 2 Mapper 3 Node 7 Node 8 Node 9 Mapper 4 Motivation File Management Streaming Data Fault Tolerance 3

  4. Large Data Sets file sizes going up to petabytes Motivation File Management Streaming Data Fault Tolerance 4

  5. How to get Data to Mappers? Node 1 Node 2 Node 3 ? Mapper 1 Node 4 Node 5 Node 6 Mapper 2 Mapper 3 Node 7 Node 8 Node 9 Mapper 4 Motivation File Management Streaming Data Fault Tolerance 5

  6. How to get Data to Mappers? Node 1 Node 2 Node 3 ! Mapper 1 Node 4 Node 5 Node 6 Mapper 2 Mapper 3 Node 7 Node 8 Node 9 Mapper 4 Motivation File Management Streaming Data Fault Tolerance 6

  7. Bring Mappers to Data! Node 1 Node 2 Node 3 Mapper 1 Node 4 Node 5 Node 6 Mapper 2 Mapper 3 Node 7 Node 8 Node 9 Mapper 4 Motivation File Management Streaming Data Fault Tolerance 7

  8. But disk access latency is so high! Motivation File Management Streaming Data Fault Tolerance 8

  9. But disk access latency is so high! Yes, but throughput is acceptable. Motivation File Management Streaming Data Fault Tolerance 9

  10. Distributed File System Motivation File Management Streaming Data Fault Tolerance 10

  11. Distributed File System HDFS is a GFS (Google File System) clone Motivation File Management Streaming Data Fault Tolerance 11

  12. HDFS Design Choices 1 Support handling of large files across multiple nodes Motivation File Management Streaming Data Fault Tolerance 12

  13. HDFS Design Choices 1 Support handling of large files across multiple nodes 2 Optimise for streaming access Motivation File Management Streaming Data Fault Tolerance 13

  14. HDFS Design Choices 1 Support handling of large files across multiple nodes 2 Optimise for streaming access 3 Run on commodity hardware (e.g. high fault tolerance) Motivation File Management Streaming Data Fault Tolerance 14

  15. Large Files 128 MB Block Size Motivation File Management Streaming Data Fault Tolerance 15

  16. Why so large Blocks? Motivation File Management Streaming Data Fault Tolerance 16

  17. HDFS datanode Linux file system Motivation File Management Streaming Data Fault Tolerance 17

  18. HDFS datanode Linux file system Demo Motivation File Management Streaming Data Fault Tolerance 18

  19. HDFS namenode File namespace /foo/bar block 3df2 instructions to datanode datanode state HDFS datanode HDFS datanode Linux file system Linux file system Motivation File Management Streaming Data Fault Tolerance 19

  20. Optimised for Streaming Successive Read Append Write write once read many Motivation File Management Streaming Data Fault Tolerance 20

  21. HDFS namenode File namespace /foo/bar Application block 3df2 (file name, block id) ctrl flow HDFS Client data flow (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode Linux file system Linux file system block data Motivation File Management Streaming Data Fault Tolerance 21

  22. HDFS namenode File namespace /foo/bar Application block 3df2 (file name, block id) ctrl flow HDFS Client data flow (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode Linux file system Linux file system block data Why so large blocks? Motivation File Management Streaming Data Fault Tolerance 22

  23. HDFS namenode File namespace /foo/bar Application block 3df2 (file name, block id) ctrl flow HDFS Client data flow (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode Linux file system Linux file system block data Why so large blocks? 1 less communication between master and workers Motivation File Management Streaming Data Fault Tolerance 23

  24. HDFS namenode File namespace /foo/bar Application block 3df2 (file name, block id) ctrl flow HDFS Client data flow (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode Linux file system Linux file system block data Why so large blocks? 1 less communication between master and workers 2 reduced communication between client and datanodes Motivation File Management Streaming Data Fault Tolerance 24

  25. HDFS namenode File namespace /foo/bar Application block 3df2 (file name, block id) ctrl flow HDFS Client data flow (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode Linux file system Linux file system block data Why so large blocks? 1 less communication between master and workers 2 reduced communication between client and datanodes 3 less meta data to be saved in namenode Motivation File Management Streaming Data Fault Tolerance 25

  26. Which block location is best for the client? ? Application HDFS Client block data Motivation File Management Streaming Data Fault Tolerance 26

  27. Which block location is best for the client? ? Application HDFS Client block data The closest one! Motivation File Management Streaming Data Fault Tolerance 27

  28. Network is represented as a tree. Distance between two nodes is the sum of their distance to their closest common ancestor. Motivation File Management Streaming Data Fault Tolerance 28

  29. Fault Tolerance Faults are the norm, not the exception. Motivation File Management Streaming Data Fault Tolerance 29

  30. Hadoop keeps three versions by default. Motivation File Management Streaming Data Fault Tolerance 30

  31. How to spread over across the cluster? Motivation File Management Streaming Data Fault Tolerance 31

  32. How to spread over across the cluster? Demo Motivation File Management Streaming Data Fault Tolerance 32

  33. Anatomy of a Write Motivation File Management Streaming Data Fault Tolerance 33

  34. Summary 1 HDFS handles large files across the cluster 2 HDFS is optimised for streaming access to files 3 HDFS runs on commodity hardware and needs to be fault tolerant Motivation File Management Streaming Data Fault Tolerance 34

Recommend


More recommend