Pufferbench: Evaluating and Optimizing Malleability of Distributed Storage Nathanaël Cheriere, Matthieu Dorier, Gabriel Antoniu PDSW-DISCS 2018, Dallas
Data is everywhere High variety of applications High variety of needs
Resource requirements vary in time Day/night cycles Weekly cycles Workflows
Dynamically adjust the amount of resources? Why? Problem: • Satisfy resource requirements What about task/data colocation? • Peaks • Low • Local data access • Avoid idle nodes • Easy scalability ü Save money ü Save energy ? Storage system malleability ü Computing resources malleability
Two operations: Decommission Commission Constraints: Problems: No data losses Long data transfers • • Maintain fault tolerance • Balance data •
What is the duration of storage rescaling on a given platform? • Previous works: lower bounds • Useful but unrealistic • Many simplifications • Need a tool to measure it on real hardware How fast can one scale down a distributed file system?, N. Cheriere, G. Antoniu, Bigdata 2017 A Lower Bound for the Commission Times in Replication-Based Distributed Storage Systems. N. Cheriere, M. Dorier, G. Antoniu. [Research Report – Submitted to JPDC] 2018
A benchmark: Pufferbench Goals: • Measure the duration of rescaling on a platform • Serve as a quick prototyping testbed for rescaling mechanisms How: • Do all I/Os that are needed by a rescaling
Main steps 1. Migration Planning 2. Data Generation 3. Execution 4. Statistics Aggregation
Software Architecture
Software Architecture MetadataGenerator: Generate information about files on the storage (number,size)
Software Architecture DataDistributionGenerator: Assign files to storage nodes
Software Architecture DataTransferScheduler: Compute data transfers needed for rescaling
Software Architecture IODispatcher: Assign transfer instructions to storage and network
Software Architecture Storage: Interface with the storage devices
Software Architecture Network: Exchange data between nodes
Software Architecture DataDistributionValidator: Compute statistics about data placement (load, replication)
Validation Comparison to lower bounds Hardware Matching hypotheses: • Up to 40 nodes • Load balancing (50 GB per node) • 16 cores, 2.4 GHz • Uniform data distribution • 128 GB RAM • Data replication • 558 GB disk • 10 Gbps ethernet Differences: • Hardware is not identical • Storage has latency • Network has latency and interferences
Pufferbench is close to lower bounds! 30 30 Decommission times Decommission times 25 25 Pufferbench Pufferbench Time to decommission (s) Time to decommission (s) 150 150 Theoretical minimum Theoretical minimum Decommission 20 20 100 100 15 15 10 10 50 50 5 5 0 0 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Number of decommissionned nodes (out of a cluster of 20) Number of decommissionned nodes (to a cluster of 20) Within 16% of lower bounds 40 40 250 250 Commission times Commission times Lower bounds are realistic Pufferbench Pufferbench 35 35 Time to commission (s) Time to commission (s) Theoretical minimum Theoretical minimum 200 200 30 30 Commission 25 25 150 150 20 20 100 100 15 15 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of commissionned nodes (to a cluster of 10) Number of commissionned nodes (to a cluster of 10) In memory storage On drive storage
Use case: HDFS Question: How fast can the rescaling in HDFS be? No modifications of HDFS With Pufferbench: • Reproduce initial conditions • Aim for same final data placement
Pufferbench matching HDFS’s rescaling Load balanced • Mostly random • Random placement • Replicated 3 times • Chunks of 128 MiB •
HDFS needs better disk I/Os Commission 700 700 700 35 35 35 Decommission times Decommission times 30 30 30 Measured on HDFS Measured on HDFS Time to decommission (s) Time to decommission (s) Pufferbench Pufferbench 500 500 500 25 25 25 Theoretical minimum Theoretical minimum 3 x 20 20 20 300 300 300 15 15 15 10 10 10 100 100 100 5 5 5 0 0 0 0 0 0 1 2 3 4 5 6 7 1 2 3 4 5 6 7 Number of decommissionned nodes (out of a cluster of 20) Number of decommissionned nodes (to a cluster of 20) In memory storage On drive storage Improvement possible on disk access patterns
HDFS is far from optimal performances! Commission 1000 1000 1000 Commission times Commission times 200 200 200 Measured on HDFS Measured on HDFS 800 800 800 Time to commission (s) Time to commission (s) Pufferbench Pufferbench Theoretical minimum Theoretical minimum 150 150 150 600 600 600 14 x 100 100 100 400 400 400 200 200 200 50 50 50 0 0 0 0 0 0 0 5 10 15 20 25 30 0 5 10 15 20 25 30 Number of commissionned nodes (to a cluster of 10) Number of commissionned nodes (to a cluster of 10) In memory storage On drive storage Improvement possible on algorithms, disk access patterns, pipelining
Setup duration Setup overhead for the commission in memory: • HDFS: 26 h • Pufferbench: 53 min Good for prototyping: • Fast evaluation • Light setup
To conclude Pufferbench: • Evaluate the viability of storage malleability on platforms • Quickly prototype and evaluate rescaling mechanisms Available at https://gitlab.inria.fr/Puffertools/Pufferbench Can be installed with Spack
To conclude Pufferbench: • Evaluate the viability of storage malleability on platforms • Quickly prototype and evaluate rescaling mechanisms Available at https://gitlab.inria.fr/Puffertools/Pufferbench Can be installed with Spack Thank you! Questions? nathanael.cheriere@irisa.fr
Recommend
More recommend