Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz Kargin, Sarath Kumar, Vinay Setty
Outline ●Motivation: Parallel DBMS vs Map/Reduce ●Schema & Benchmarks Overview ●Original(Pavlo) Map/Reduce Plans ●Improved(SAVY) Design & Implementation ●Improving Hadoop ○Indexing ○Co-Partitioning ●Experiments ●Conclusion
Motivation ●Ever growing data ○About 20TB per Google crawl! ●Computing Solutions ○High-end server: 1625.60€/core, 97.66€/GB ○Share-nothing nodes: 299.50€/core, 166.33 €/GB ●Two Paradigms ○Parallel DBMS ○Map/Reduce
Parallel DBMS Query Data Data Data Data scan scan scan scan sort sort sort sort Merge [DeWitt, D. and Gray, J. 1992. ]
Parallel DBMS: Advantages ●Can be column based ○Example: Vertica ●Local joins possible ○Partition based on join key ●Can work on compressed data ○reduced data transfer ●Flexible query plans ●Supports Declarative languages like SQL
Parallel DBMS - Shortcomings ●Not free of cost ●Not open source ●Cannot scale to thousands of nodes: why? ○Less fault tolerant ○Assumes homogeneous nodes ●Not so easy to achieve high performance ○Needs highly skilled DBA ○Needs high maintenance
Map/Reduce(Hadoop): Advantages ●Free of cost ●Open source ●Fault tolerant ●Scales well to thousands of nodes ●Less maintenance ●Flexible query framework
Map/Reduce(Hadoop): Shortcomings ●Lack of inbuilt Indexing Current Focus ●Cannot guarantee local joins Current Focus ●Performance degradation for SQL like queries ○Multiple MR phases Current Focus ○Each MR phase adds extra cost ●No Flexible query plans ●Data transfer not optimized
Benchmarks and Schema
Schema CREATE TABLE Documents ( url VARCHAR (100) PRIMARY KEY, contents TEXT ); CREATE TABLE Rankings ( pageURL VARCHAR (100) PRIMARY KEY, pageRank INT, avgDuration INT );
Schema CREATE TABLE UserVisits ( sourceIP VARCHAR(16), destURL VARCHAR(100), visitDate DATE, adRevenue FLOAT, userAgent VARCHAR(64), countryCode VARCHAR(3), languageCode VARCHAR(6), searchWord VARCHAR(32), duration INT );
Benchmarks 1&2 ●Selection task (Benchmark 1) ○SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X; ●Aggregation task (Benchmark 2) ○SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; ○SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);
Benchmark 3: Join Task Projection & Aggregation ●SELECT INTO Temp sourceIP, AVG (pageRank) as avgPageRank, SUM (adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV Join WHERE R.pageURL = UV.destURL AND UV. visitDate BETWEEN Date(‘2000-01-15’) AND Date selectio n (‘2000-01-22’) GROUP BY UV.sourceIP; ●SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1 ;
Original (Pavlo) MR Plans
Benchmark 1 Extra MR job to merge results Phase 1 Phase 2 Reducer HDFS Mappers Mapper s Map() Result Data Identity PageRank > 10? Map() Reduce Resul Result Identity Data PageRank > t 10? Map() Result Identity Data PageRank > 10? SELECT pageURL, pageRank FROM Rankings WHERE pageRank > 10;
Benchmark 2: Phase 1 Phase 1 Combiner Mapper Reduce HDFS s s rs Inter. Map: Reduce: su Dat Result1 Resu split Aggr m a lt Inter. Map: su Result1 Dat Reduce: Resu split m a lt Aggr Inter. Map: Result1 Resu sum Dat Reduce split lt a Aggr
Benchmark 2: Phase 2 Extra MR job to merge Phase 2 results HDFS Reducer Mapper s Result1 Identity Result1 Reduce Resul Identity t Result1 Identity
Benchmark 3 – Phase 1 Phase 1 Mapper Reduce HDFS s rs < Source IP , URL , PageRank , Inter. adReveune > join predicate User Result1 Resu Ranking s rank visits lt < Source IP , URL , PageRank , adReveune > Inter. join predicate User Result1 Resu Ranking s visits lt rank < Source IP , URL , PageRank , adReveune > Inter. join User Result1 Resu Ranking predicate s visits lt ranks Also classifies two Also classifies two types of records types of records
Benchmark 3 – Phase 2 Phase 2 Mappers Reducers HDFS <Source IP, Avg(PR), Max < Source IP , URL , PageRank , (Sum(adRevenue))> Inter. adReveune > Avg(PR), Sum Identity Result1 Resu (adRevnue) lt < Source IP , URL , PageRank , <Source IP, Avg(PR), Max Inter. adReveune > (Sum(adRevenue))> Avg(PR), Sum Identity Result1 Resu (adRevnue) lt < Source IP , URL , PageRank , Inter. <Source IP, Avg(PR), Max adReveune > Avg(PR), Sum (Sum(adRevenue))> Identity Result1 Resu (adRevnue) lt
Benchmark 3 – Phase 3 Phase 3 HDFS Mapper Reducer <Source IP, Avg(PR), Max s (Sum(adRevenue))> Inter. Identity Resu lt Source IP, Avg(PR), Sum <Source IP, Avg(PR), Max (adRevenue) (Sum(adRevenue))> Inter. Max(Sum Identity Final Resu (adRevnue) Result lt <Source IP, Avg(PR), Max Inter. (Sum(adRevenue))> Identity Resu lt
Improved (Savy) MR Plans
Binary Data ●Eliminates delimiters ●Avoids splitting ●Makes tuples of fixed length ●Helps in indexing
Benchmark 1 Extra MR job to merge results Phase 1 Phase 2 HDFS Mappers Reducer PageRank > Result Data 10? PageRank > Result Reduce Resul Data 10? t PageRank > Result Data 10? Binary data
Benchmark 2 Extra MR job to merge results Phase 1 Phase 2 Combiners Mapper Reduce HDFS Reduce Reducer s rs r Inter. su split Aggr Res Dat Resu m ult a lt Inter. su merge split Aggr Result Res Aggr Dat Res Resu m ult a ult lt Inter. Res split Aggr Resu sum Dat ult lt a Binary data
Benchmark 3(Design I) – Phase 1 Phase 1 HDFS Reducers Record Mappers < Source IP , URL , PageRank , Reader adReveune > Inter. s join User Ranking Resu Identity Result1 predicate s rank visits lt < Source IP , URL , PageRank , Inter. adReveune > join Identity User Ranking Resu Result1 predicate s visits lt rank < Source IP , URL , PageRank , Inter. adReveune > join User Ranking Identity Resu Result1 predicate s visits lt ranks Binary Easy to classify data (just look at record size)
Benchmark 3(Design I) – Phase 2 Phase 2 Combiner HDFS Mapper Reducer s s s < Source IP , URL , PageRank , Inter. adReveune > Avg(PR), Identity Result1 Resu Sum lt Source IP, Avg(PR), Sum (adRevnu (adRevenue) e < Source IP , URL , PageRank , Inter. adReveune > Max(Sum Final Avg(PR), Identity Result1 Resu (adRevnue) Sum Result lt (adRevnue < Source IP , URL , PageRank , Inter. adReveune > Avg(PR), Identity Result1 Resu Sum lt (adRevnu No e Phase 3!
Benchmark 3(Design II) – Phase 1 Phase 1 HDFS R Reducers RR Mappers <Source IP, Sum (adReveune), <Dest URLs>> Inter. Max(Sum predic User Result1 Identity Resu (adRevenue)) ate visits lt <Source IP, Sum (adReveune), <Dest URLs>> Inter. Max(Sum predic Identity User Result1 Resu (adRevenue)) ate visits lt <Source IP, Sum (adReveune), <Dest URLs>> Inter. Max(Sum predic User Identity Result1 Resu (adRevenue)) ate visits lt Only Very small data UserVisit (Top R records) s
Benchmark 3(Design I) – Phase 2 Phase 2 HDFS Record Single Mapper Reader <Source IP, Sum(adReveune), s <Dest URLs>> Rankin Read Result1 gs Source IP, Avg(PR), Sum (adRevenue) Max(sum(adRevenue)) <Source IP, Sum(adReveune), Final <Dest URLs>> & Ranking Read Result1 Result Join s <Source IP, Sum(adReveune), <Dest URLs>> Rankin Read Result1 gs
Improving Hadoop
Improving Hadoop ●Improve Selection (Indexing) ●Improve Join (Co-partitioning)
Indexing ●Data Loading ○index and load data into DFS ●Query Execution ○ index look-up and selection ● Implementation on Hadoop
Data Loading ●Partitioning ●Sorting ●Bulk Loading ●HID Splits
Data Loading
Partitioning Split input data at tuple boundaries
Partitioning Split input data at tuple boundaries
Partitioning Split input data at tuple boundaries
Partitioning Split input data at tuple boundaries
Sorting Sort each split on the index key
Sorting Sort each split on the index key
Bulk Loading Bulk load CSS tree index
HID Split Construct H eader- I ndex- D ata Split
HID Split Construct H eader- I ndex- D ata Split
HID Split Construct H eader- I ndex- D ata Split Header: Index end offset Data end offset Start index key End index key
HID Split Construct H eader- I ndex- D ata Split Header: Index end offset Data end offset Start index key End index key
Query Execution ●Partitioning ●Split selection ●Index lookup ●Extractor
Query Execution
Partitioning Read header to get HID boundaries
Partitioning Read header to get HID boundaries
Partitioning Read header to get HID boundaries
Partitioning Read header to get HID boundaries
Recommend
More recommend