7
play

7 Things To Know When Buying for an ! Alekh - PowerPoint PPT Presentation

7 Things To Know When Buying for an ! Alekh Jindal, Jorge Quian, Jens Dittrich 1 What Shoes? Why Shoes? Analyzing MR Jobs (HadoopToSQL, Manimal) Generating MR Jobs (PigLatin, Hive) Executing MR Jobs (Hadoop++,


  1. 7 Things To Know When Buying for an ! Alekh Jindal, Jorge Quiané, Jens Dittrich

  2. 1 What Shoes? Why Shoes?

  3. Analyzing MR Jobs (HadoopToSQL, Manimal) Generating MR Jobs (PigLatin, Hive) Executing MR Jobs (Hadoop++, epiC) Data Layouts & Access Paths !! 3

  4. 2 Why Elephant Needs Different Shoes?

  5. Very Large Scale Storage & Execution DBMS MapReduce 5

  6. Large Data Block Sizes DBMS MapReduce 8 KB 1 GB 6

  7. Block Level Data Replication DBMS MapReduce 001 alex bsc 002 tim msc 001 alex bsc 002 tim msc 003 mat bsc 003 mat bsc 004 joel bsc 004 joel bsc 005 phil msc 005 phil msc 006 ron msc 006 ron msc 007 neo bsc 008 jack msc 007 neo bsc 009 jens bsc 008 jack msc 010 tom msc 009 jens bsc 010 tom msc 7

  8. 3 What’s Wrong with Old Shoes?

  9. Current Data Layouts in Hadoop Row Column* PAX** (default) 001 alex bsc 002 tim msc 003 mat bsc 004 joel bsc 005 phil msc 006 ron msc 007 neo bsc 008 jack msc 009 jens bsc 010 tom msc * A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, April, 2011 ** Y. He et al. RCFile: A fast and space-efficient data placement structure in MapReduce-based warehouse systems. ICDE, 2011 9

  10. Current Data Layouts in Hadoop Row Column PAX Non-required Reads Network Costs Data Block Placement Tuple Reconstruction 10

  11. Current Data Layouts in Hadoop 5 Trojan Layout Row Layout Column Layout Row Column PAX PAX Layout 4 Optimal Layout Non-required Reads Data Access Cost [sec] 3 Network Costs Data Block Placement 2 Tuple Reconstruction 1 0 5 10 15 20 25 30 Number of Referenced Attributes (Out of 30) 10

  12. 4 What Shoes do We Propose?

  13. Trojan Data Layouts Replica 2 Replica 1 Replica 3 12

  14. Trojan Data Layouts Row Column PAX Trojan Non-required Reads Network Costs Data Block Placement Tuple Reconstruction 13

  15. Challenges in Trojan Data Layouts How do we design shoe for one leg? How do we design shoes for all legs? How do we make the shoes from the design? 14

  16. 5 How Do We Design the Shoes?

  17. Single Replica Columns Column groups Filter Novel Column Group Interestingness Interesting Column groups Pack Column Group Packing as 0 - 1 Knapsack Complete & disjoint column groups 16

  18. Multiple Replicas Queries Query groups Filter Interesting Query groups Pack Complete & disjoint query groups 17

  19. Multiple Replicas Filter Pack Replica 1 Replica 2 Replica 3 Columns Columns Columns Column groups Filter Column groups Filter Column groups Filter Interesting Interesting Interesting Column groups Column groups Column groups Pack Pack Pack Complete & disjoint Complete & disjoint Complete & disjoint column groups column groups column groups 18

  20. Q 1 , Q 2 , Q 3 , Q 4 , Q 5 , Q 6 , Q 7 , Q 8 Multiple Replicas Filter TPC-H Customer Pack Q 2 , Q 3 , Q 4 Q 1 , Q 6 , Q 7 , Q 8 Q 5 Replica 1 Replica 2 Replica 3 Columns Columns Columns Name Column groups Filter Column groups Filter Column groups Filter Custkey Custkey, Nationkey Mktsegment Interesting Interesting Interesting Mktsegment Column groups Column groups Column groups Custkey, Name, Address, Name, Address, Phone, Pack Pack Phone, AcctBal Pack Nationkey, Phone, AcctBal, Mktsegment, Complete & disjoint Complete & disjoint Complete & disjoint AcctBal, Comment Address, Nationkey, Comment Comment column groups column groups column groups 19

  21. Trojan Layout Advantages • Multiple layouts for a given workload • Default row layout still available • Specialized replicas for different query sub-class • Divide and conquer layout computation 20

  22. 6 How do We Ride the Elephant?

  23. Putting It All Together Create trojan layout configuration file in HDFS Load dataset layout- 1 layout- 2 layout- 3 Supply referenced attributes in JobConf Query itemize UDF to transparently read the referenced attributes Three Optimization Options: Schedule ? - data locality (default) - best layout - best layout & locality 22

  24. 7 How were the Field Trials?

  25. Setup • Datasets TPC-H Lineitem, TPC-H Customer, SSB LineOrder, SDSS PhotoObj • Queries First 8 queries from the respective benchmark for each table • Methodology focus on scan and projection operators i.e. map-phase-only jobs improvement: record reader time (I/O and tuple reconstruction) • Hardware 50 virtual nodes in a 10 node cluster 24

  26. Per-replica Trojan Layout Performance TPC-H Lineitem 5 over Hadoop-Row over Hadoop-PAX Improvement Factor 4 3 2 1 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 TPC-H Queries over Hadoop-PAX 25

  27. Layout Quality #Non-required #Joins in Tuple Attributes Read Reconstruction 525 0 H ADOOP -R OW H ADOOP -PAX 0 139 HYRISE* Layout 2 64 Trojan Layout 14 20 > 14 % improvement over HYRISE * M. Grund et al. HYRISE - A Main Memory Hybrid Storage Engine. PVLDB, November, 2010. 26

  28. Scheduling Decisions TPC-H Lineitem 5 Best-Layout & Locality Scheduling Penalty 4 Best-Layout Locality (default) 3 2 1 0 Q1 Q2 Q3 Q4 Q5 Q6 Q7 Q8 27

  29. Summary • Data layouts crucial to MR job performance • Exploit default data block replication in MR • Novel algorithm to compute per-replica layouts • Improvement: 4.8 x over Row, 3.5 x over PAX • Better than HYRISE; 14 % improvement 28

Recommend


More recommend