the hadoop distributed file system
play

The Hadoop Distributed File System Konstantin Shvachko, Hairong - PowerPoint PPT Presentation

The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA Presented by Haoran Ma, Yifan Qiao Outline Introduction Architecture File I/O Operations and


  1. The Hadoop Distributed File System Konstantin Shvachko, Hairong Kuang, Sanjay Radia, Robert Chansler Yahoo! Sunnyvale, California USA Presented by Haoran Ma, Yifan Qiao

  2. Outline • Introduction • Architecture • File I/O Operations and Replica Management • Practice at Y AHOO ! • Future Work

  3. Introduction • A single dataset is too large —> Divide it and store them on a cluster of commodity hardwares. - What if one of the physical machines fails? • Some applications like MapReduce need high throughput of data access.

  4. Introduction • HDFS is the file system component of Hadoop. It is designed to store very large data sets reliably , and to stream those data sets at high bandwidth to user applications. • These are achieved by replicating file contents on multiple machines(DataNodes).

  5. Introduction • Very Large Distributed File System • Assumes Commodity Hardware - Files are replicated to handle hardware failure • Optimized for Batch Processing - Data locations exposed so that computations can move to where data resides

  6. Introduction Usually 128 MB Source: HDFS Tutorial – A Complete Hadoop HDFS Overview. DATAFLAIR TEAM.

  7. Architecture Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.

  8. Architecture • Stores meta-data such as number of data blocks, replicas and other details in memory • Maintains and manages the DataNodes, and assigns tasks to them

  9. Architecture Store Application Data Source: Hadoop HDFS Architecture Explanation and Assumptions. DATAFLAIR TEAM.

  10. Architecture HDFS Client: a code library that exports the HDFS file system interface

  11. Architecture • How does this architecture achieve high fault- tolerance? • DataNodes Failure • NameNode Failure

  12. Architecture: Failure Recovery for DataNodes Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

  13. Architecture: Failure Recovery for DataNodes Heartbeat Block Report Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

  14. Architecture: Failure Recovery for DataNodes Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

  15. Architecture: Failure Recovery for DataNodes What if NameNode fails?

  16. Architecture: Failure Recovery for NameNode Image = Checkpoint + Journal • Image: The file system metadata that describes the organization of application data as directories and files. • Checkpoint: A persistent record of the image written to disk. • Journal: The modification log of the image. It is also stored in the local host’s native file system.

  17. Architecture: Failure Recovery for NameNode • CheckpointNode : • Periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal. • BackupNode : • A read-only NameNode. • Maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode.

  18. Architecture: Failure Recovery for NameNode • Snapshots • To minimize potential damage to the data stored in the system during upgrades. • Persistently save the current state of the file system( both data and metadata ).

  19. Architecture: Failure Recovery for NameNode • Snapshots • To minimize potential damage to the data Copy on Write stored in the system during upgrades. • Persistently save the current state of the file system( both data and metadata ).

  20. Architecture: Failure Recovery for NameNode NameNode BackupNode Memory: Memory: Image Image Synchronize Disk: CheckpointNode Checkpoint Disk: Journal Combine Return New Checkpoint Empty Journal DataNode (Example) Disk: Snapshot Snapshot (Only Hard Links)

  21. Usage & Management of HDFS Cluster • Basic File I/O operations • Rack Awareness • Replication Management

  22. File I/O Operations • Write Files to HDFS: Single writer, multiple reader (2) unique block ids (1) addBlock (3) write to block Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

  23. File I/O Operations • Write Files to HDFS: Single writer, multiple reader 1. Client consults the (2) unique NameNode to get a lease block ids and destination DataNodes (1) addBlock 2. Client writes a block to (3) write to block DataNodes in a pipeline way 3. DataNodes replicate blocks Source: Understanding Hadoop Clusters and the Network. Brad Hedlund. The visibility of the modification is 4. Client writes a new block not guaranteed! after finishing the previous block

  24. File I/O Operations • Read Files in HDFS 1. the client consults the NameNode to get the list of blocks and their replicas' locations 2. try the nearest replica first, then the second nearest replica, and so on • Identifying corrupted data • Checksums-CRC32

  25. In-cluster Client Reads a File Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

  26. Outside Client Reads a File Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

  27. Rack Awareness Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

  28. Rack Awareness • Benefits: • higher throughput • higher reliability : • an entire rack failure never loses all replicas of a block • better network bandwidth utilization: • reduce inter-rack and inter-node write traffic as much as possible

  29. Rack Awareness • The default HDFS replica placement policy: 1. No DataNode contains more than one replica of any block 2. No rack contains more than two replicas of the same block, provided there are sufficient racks on the cluster

  30. Replication Management • to avoid blocks to be under- or over-replicated Source: Understanding Hadoop Clusters and the Network. Brad Hedlund.

  31. Practice at Yahoo! Cluster Basic Information • Clusters at Yahoo! can be as large as ~3500 nodes with typical configuration: • 2 quad core Xeon processors @ 2.5Ghz • 4 directly attached SATA drives (1TB each, 4TB total ) • 16GB RAM • 1-Gbit Ethernet • Total 9.8PB of storage available, 3.3PB available for user applications when replicating blocks 3 times

  32. Practice at Yahoo! Data Durability • Uncorrelated nodes failure: • Chance of a node failed during a month ~0.8% (Naive estimation for a node failure probability during a year is ~9.2% ) • Chance of losing a block during a year < 0.5% • Correlated nodes failure: • HDFS tolerates a rack switch failure • But a core switch failure or cluster power loss can lose some blocks

  33. Practice at Yahoo! • Benchmarks Read (MB/s Write (MB/s Scenario per node) per node) DFSIO 66 40 < 130 < 130 7200 RPM Desktop HDD [6] (typical 50-120) (typical 50-120) Table1: Contrived benchmark compared with typical HDD performance Read (MB/s Write (MB/s Scenario per node) per node) Busy Cluster 1.02 1.09 Table2: HDFS performance in a production cluster

  34. Practice at Yahoo! • Benchmarks HDFS I/O Bytes/s Bytes (TB) Nodes Maps Reduces Time / s Per Node Aggregate (GB) (MB) 1 1460 8000 2700 62 32 22.1 1000 3658 80000 20000 58500 34.2 9.35 Table 3: Sort benchmark 1000TB is too large to fit in the node memory intermediate results spill to disks and occupy disk bandwidth

  35. Practice at Yahoo! • Benchmarks Operation Throughput (ops/s) Open file for read 126 100 involve modifying Create file 5600 nodes, Rename file 8300 can be the bottleneck in large scale Delete file 20 700 DataNode Heartbeat 300 000 Blocks report (blocks/s) 639 700 Table4: NameNode throughput benchmark

  36. Summary: HDFS: Two Easy Pieces* • Reliability • Throughput *: The title is from two great books: Six Easy Pieces: Essentials Of Physics Explained By Its Most Brilliant Teacher , by Richard P . Feynman, and Operating Systems: Three Easy Pieces , by Remzi H. Arpaci-Dusseau and Andrea C. Arpaci-Dusseau

  37. Summary: HDFS: Reliability • System Design: • Split files into blocks and replicate them (typical 3) • For NameNode: • Checkpoint + Journal can restore the latest image • BackupNode • Snapshot • NameNode is the single point of failure of the whole system - NOT GOOD! • For DataNodes: • Rack Awareness + Replica Placement Policy, never lose a block if a rack fails • Replica Management, to avoid blocks to be under-replicated • Snapshot

  38. Summary: HDFS: Throughput • System Design • Split files into large blocks (128MB) - good for streaming access and parallel access • Provide APIs that expose the location of blocks - facilitating applications to schedule computation tasks to where the data reside • NameNode - Not Good for High Throughput and Scalability • Single node handles all requests from clients and manages all DataNodes • DataNodes • Rack Awareness & replica placement policy - better utilizing network bandwidth • Write files in a pipeline way • Read files from the nearest DataNode first

  39. Future Work (Out of Date!) • Automated failover solution • Zookeeper • Scalability of the NameNode • multiple namespaces to share physical storage • Drawbacks: • Advantages: • isolate namespaces • management cost • improve the overall availability • generalize the block storage abstraction

  40. Thank you.

Recommend


More recommend