large scale data analysis bridging the gap
play

Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz - PowerPoint PPT Presentation

Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz Kargin, Sarath Kumar, Vinay Setty Outline Motivation: Parallel DBMS vs Map/Reduce Schema & Benchmarks Overview Original(Pavlo) Map/Reduce Plans Improved(SAVY)


  1. Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz Kargin, Sarath Kumar, Vinay Setty

  2. Outline ●Motivation: Parallel DBMS vs Map/Reduce ●Schema & Benchmarks Overview ●Original(Pavlo) Map/Reduce Plans ●Improved(SAVY) Design & Implementation ●Improving Hadoop ○Indexing ○Co-Partitioning ●Experiments ●Conclusion

  3. Motivation ●Ever growing data ○About 20TB per Google crawl! ●Computing Solutions ○High-end server: 1625.60€/core, 97.66€/GB ○Share-nothing nodes: 299.50€/core, 166.33 €/GB ●Two Paradigms ○Parallel DBMS ○Map/Reduce

  4. Parallel DBMS Query Data Data Data Data scan scan scan scan sort sort sort sort Merge [DeWitt, D. and Gray, J. 1992. ]

  5. Parallel DBMS: Advantages ●Can be column based ○Example: Vertica ●Local joins possible ○Partition based on join key ●Can work on compressed data ○reduced data transfer ●Flexible query plans ●Supports Declarative languages like SQL

  6. Parallel DBMS - Shortcomings ●Not free of cost ●Not open source ●Cannot scale to thousands of nodes: why? ○Less fault tolerant ○Assumes homogeneous nodes ●Not so easy to achieve high performance ○Needs highly skilled DBA ○Needs high maintenance

  7. Map/Reduce(Hadoop): Advantages ●Free of cost ●Open source ●Fault tolerant ●Scales well to thousands of nodes ●Less maintenance ●Flexible query framework

  8. Map/Reduce(Hadoop): Shortcomings ●Lack of inbuilt Indexing Current Focus ●Cannot guarantee local joins Current Focus ●Performance degradation for SQL like queries ○Multiple MR phases Current Focus ○Each MR phase adds extra cost ●No Flexible query plans ●Data transfer not optimized

  9. Benchmarks and Schema

  10. Schema CREATE TABLE Documents ( url VARCHAR (100) PRIMARY KEY, contents TEXT ); CREATE TABLE Rankings ( pageURL VARCHAR (100) PRIMARY KEY, pageRank INT, avgDuration INT );

  11. Schema CREATE TABLE UserVisits ( sourceIP VARCHAR(16), destURL VARCHAR(100), visitDate DATE, adRevenue FLOAT, userAgent VARCHAR(64), countryCode VARCHAR(3), languageCode VARCHAR(6), searchWord VARCHAR(32), duration INT );

  12. Benchmarks 1&2 ●Selection task (Benchmark 1) ○SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X; ●Aggregation task (Benchmark 2) ○SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; ○SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);

  13. Benchmark 3: Join Task Projection & Aggregation ●SELECT INTO Temp sourceIP, AVG (pageRank) as avgPageRank, SUM (adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV Join WHERE R.pageURL = UV.destURL AND UV. visitDate BETWEEN Date(‘2000-01-15’) AND Date selectio n (‘2000-01-22’) GROUP BY UV.sourceIP; ●SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1 ;

  14. Original (Pavlo) MR Plans

  15. Benchmark 1 Extra MR job to merge results Phase 1 Phase 2 Reducer HDFS Mappers Mapper s Map() Result Data Identity PageRank > 10? Map() Reduce Resul Result Identity Data PageRank > t 10? Map() Result Identity Data PageRank > 10? SELECT pageURL, pageRank FROM Rankings WHERE pageRank > 10;

  16. Benchmark 2: Phase 1 Phase 1 Combiner Mapper Reduce HDFS s s rs Inter. Map: Reduce: su Dat Result1 Resu split Aggr m a lt Inter. Map: su Result1 Dat Reduce: Resu split m a lt Aggr Inter. Map: Result1 Resu sum Dat Reduce split lt a Aggr

  17. Benchmark 2: Phase 2 Extra MR job to merge Phase 2 results HDFS Reducer Mapper s Result1 Identity Result1 Reduce Resul Identity t Result1 Identity

  18. Benchmark 3 – Phase 1 Phase 1 Mapper Reduce HDFS s rs < Source IP , URL , PageRank , Inter. adReveune > join predicate User Result1 Resu Ranking s rank visits lt < Source IP , URL , PageRank , adReveune > Inter. join predicate User Result1 Resu Ranking s visits lt rank < Source IP , URL , PageRank , adReveune > Inter. join User Result1 Resu Ranking predicate s visits lt ranks Also classifies two Also classifies two types of records types of records

  19. Benchmark 3 – Phase 2 Phase 2 Mappers Reducers HDFS <Source IP, Avg(PR), Max < Source IP , URL , PageRank , (Sum(adRevenue))> Inter. adReveune > Avg(PR), Sum Identity Result1 Resu (adRevnue) lt < Source IP , URL , PageRank , <Source IP, Avg(PR), Max Inter. adReveune > (Sum(adRevenue))> Avg(PR), Sum Identity Result1 Resu (adRevnue) lt < Source IP , URL , PageRank , Inter. <Source IP, Avg(PR), Max adReveune > Avg(PR), Sum (Sum(adRevenue))> Identity Result1 Resu (adRevnue) lt

  20. Benchmark 3 – Phase 3 Phase 3 HDFS Mapper Reducer <Source IP, Avg(PR), Max s (Sum(adRevenue))> Inter. Identity Resu lt Source IP, Avg(PR), Sum <Source IP, Avg(PR), Max (adRevenue) (Sum(adRevenue))> Inter. Max(Sum Identity Final Resu (adRevnue) Result lt <Source IP, Avg(PR), Max Inter. (Sum(adRevenue))> Identity Resu lt

  21. Improved (Savy) MR Plans

  22. Binary Data ●Eliminates delimiters ●Avoids splitting ●Makes tuples of fixed length ●Helps in indexing

  23. Benchmark 1 Extra MR job to merge results Phase 1 Phase 2 HDFS Mappers Reducer PageRank > Result Data 10? PageRank > Result Reduce Resul Data 10? t PageRank > Result Data 10? Binary data

  24. Benchmark 2 Extra MR job to merge results Phase 1 Phase 2 Combiners Mapper Reduce HDFS Reduce Reducer s rs r Inter. su split Aggr Res Dat Resu m ult a lt Inter. su merge split Aggr Result Res Aggr Dat Res Resu m ult a ult lt Inter. Res split Aggr Resu sum Dat ult lt a Binary data

  25. Benchmark 3(Design I) – Phase 1 Phase 1 HDFS Reducers Record Mappers < Source IP , URL , PageRank , Reader adReveune > Inter. s join User Ranking Resu Identity Result1 predicate s rank visits lt < Source IP , URL , PageRank , Inter. adReveune > join Identity User Ranking Resu Result1 predicate s visits lt rank < Source IP , URL , PageRank , Inter. adReveune > join User Ranking Identity Resu Result1 predicate s visits lt ranks Binary Easy to classify data (just look at record size)

  26. Benchmark 3(Design I) – Phase 2 Phase 2 Combiner HDFS Mapper Reducer s s s < Source IP , URL , PageRank , Inter. adReveune > Avg(PR), Identity Result1 Resu Sum lt Source IP, Avg(PR), Sum (adRevnu (adRevenue) e < Source IP , URL , PageRank , Inter. adReveune > Max(Sum Final Avg(PR), Identity Result1 Resu (adRevnue) Sum Result lt (adRevnue < Source IP , URL , PageRank , Inter. adReveune > Avg(PR), Identity Result1 Resu Sum lt (adRevnu No e Phase 3!

  27. Benchmark 3(Design II) – Phase 1 Phase 1 HDFS R Reducers RR Mappers <Source IP, Sum (adReveune), <Dest URLs>> Inter. Max(Sum predic User Result1 Identity Resu (adRevenue)) ate visits lt <Source IP, Sum (adReveune), <Dest URLs>> Inter. Max(Sum predic Identity User Result1 Resu (adRevenue)) ate visits lt <Source IP, Sum (adReveune), <Dest URLs>> Inter. Max(Sum predic User Identity Result1 Resu (adRevenue)) ate visits lt Only Very small data UserVisit (Top R records) s

  28. Benchmark 3(Design I) – Phase 2 Phase 2 HDFS Record Single Mapper Reader <Source IP, Sum(adReveune), s <Dest URLs>> Rankin Read Result1 gs Source IP, Avg(PR), Sum (adRevenue) Max(sum(adRevenue)) <Source IP, Sum(adReveune), Final <Dest URLs>> & Ranking Read Result1 Result Join s <Source IP, Sum(adReveune), <Dest URLs>> Rankin Read Result1 gs

  29. Improving Hadoop

  30. Improving Hadoop ●Improve Selection (Indexing) ●Improve Join (Co-partitioning)

  31. Indexing ●Data Loading ○index and load data into DFS ●Query Execution ○ index look-up and selection ● Implementation on Hadoop

  32. Data Loading ●Partitioning ●Sorting ●Bulk Loading ●HID Splits

  33. Data Loading

  34. Partitioning Split input data at tuple boundaries

  35. Partitioning Split input data at tuple boundaries

  36. Partitioning Split input data at tuple boundaries

  37. Partitioning Split input data at tuple boundaries

  38. Sorting Sort each split on the index key

  39. Sorting Sort each split on the index key

  40. Bulk Loading Bulk load CSS tree index

  41. HID Split Construct H eader- I ndex- D ata Split

  42. HID Split Construct H eader- I ndex- D ata Split

  43. HID Split Construct H eader- I ndex- D ata Split Header: Index end offset Data end offset Start index key End index key

  44. HID Split Construct H eader- I ndex- D ata Split Header: Index end offset Data end offset Start index key End index key

  45. Query Execution ●Partitioning ●Split selection ●Index lookup ●Extractor

  46. Query Execution

  47. Partitioning Read header to get HID boundaries

  48. Partitioning Read header to get HID boundaries

  49. Partitioning Read header to get HID boundaries

  50. Partitioning Read header to get HID boundaries

Recommend


More recommend