Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz - PowerPoint PPT Presentation

Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz Kargin, Sarath Kumar, Vinay Setty

Outline ●Motivation: Parallel DBMS vs Map/Reduce ●Schema & Benchmarks Overview ●Original(Pavlo) Map/Reduce Plans ●Improved(SAVY) Design & Implementation ●Improving Hadoop ○Indexing ○Co-Partitioning ●Experiments ●Conclusion

Motivation ●Ever growing data ○About 20TB per Google crawl! ●Computing Solutions ○High-end server: 1625.60€/core, 97.66€/GB ○Share-nothing nodes: 299.50€/core, 166.33 €/GB ●Two Paradigms ○Parallel DBMS ○Map/Reduce

Parallel DBMS Query Data Data Data Data scan scan scan scan sort sort sort sort Merge [DeWitt, D. and Gray, J. 1992. ]

Parallel DBMS: Advantages ●Can be column based ○Example: Vertica ●Local joins possible ○Partition based on join key ●Can work on compressed data ○reduced data transfer ●Flexible query plans ●Supports Declarative languages like SQL

Parallel DBMS - Shortcomings ●Not free of cost ●Not open source ●Cannot scale to thousands of nodes: why? ○Less fault tolerant ○Assumes homogeneous nodes ●Not so easy to achieve high performance ○Needs highly skilled DBA ○Needs high maintenance

Map/Reduce(Hadoop): Advantages ●Free of cost ●Open source ●Fault tolerant ●Scales well to thousands of nodes ●Less maintenance ●Flexible query framework

Map/Reduce(Hadoop): Shortcomings ●Lack of inbuilt Indexing Current Focus ●Cannot guarantee local joins Current Focus ●Performance degradation for SQL like queries ○Multiple MR phases Current Focus ○Each MR phase adds extra cost ●No Flexible query plans ●Data transfer not optimized

Benchmarks and Schema

Schema CREATE TABLE Documents ( url VARCHAR (100) PRIMARY KEY, contents TEXT ); CREATE TABLE Rankings ( pageURL VARCHAR (100) PRIMARY KEY, pageRank INT, avgDuration INT );

Schema CREATE TABLE UserVisits ( sourceIP VARCHAR(16), destURL VARCHAR(100), visitDate DATE, adRevenue FLOAT, userAgent VARCHAR(64), countryCode VARCHAR(3), languageCode VARCHAR(6), searchWord VARCHAR(32), duration INT );

Benchmarks 1&2 ●Selection task (Benchmark 1) ○SELECT pageURL, pageRank FROM Rankings WHERE pageRank > X; ●Aggregation task (Benchmark 2) ○SELECT sourceIP, SUM(adRevenue) FROM UserVisits GROUP BY sourceIP; ○SELECT SUBSTR(sourceIP, 1, 7), SUM(adRevenue) FROM UserVisits GROUP BY SUBSTR(sourceIP, 1, 7);

Benchmark 3: Join Task Projection & Aggregation ●SELECT INTO Temp sourceIP, AVG (pageRank) as avgPageRank, SUM (adRevenue) as totalRevenue FROM Rankings AS R, UserVisits AS UV Join WHERE R.pageURL = UV.destURL AND UV. visitDate BETWEEN Date(‘2000-01-15’) AND Date selectio n (‘2000-01-22’) GROUP BY UV.sourceIP; ●SELECT sourceIP, totalRevenue, avgPageRank FROM Temp ORDER BY totalRevenue DESC LIMIT 1 ;

Original (Pavlo) MR Plans

Benchmark 1 Extra MR job to merge results Phase 1 Phase 2 Reducer HDFS Mappers Mapper s Map() Result Data Identity PageRank > 10? Map() Reduce Resul Result Identity Data PageRank > t 10? Map() Result Identity Data PageRank > 10? SELECT pageURL, pageRank FROM Rankings WHERE pageRank > 10;

Benchmark 2: Phase 1 Phase 1 Combiner Mapper Reduce HDFS s s rs Inter. Map: Reduce: su Dat Result1 Resu split Aggr m a lt Inter. Map: su Result1 Dat Reduce: Resu split m a lt Aggr Inter. Map: Result1 Resu sum Dat Reduce split lt a Aggr

Benchmark 2: Phase 2 Extra MR job to merge Phase 2 results HDFS Reducer Mapper s Result1 Identity Result1 Reduce Resul Identity t Result1 Identity

Benchmark 3 – Phase 1 Phase 1 Mapper Reduce HDFS s rs < Source IP , URL , PageRank , Inter. adReveune > join predicate User Result1 Resu Ranking s rank visits lt < Source IP , URL , PageRank , adReveune > Inter. join predicate User Result1 Resu Ranking s visits lt rank < Source IP , URL , PageRank , adReveune > Inter. join User Result1 Resu Ranking predicate s visits lt ranks Also classifies two Also classifies two types of records types of records

Benchmark 3 – Phase 2 Phase 2 Mappers Reducers HDFS <Source IP, Avg(PR), Max < Source IP , URL , PageRank , (Sum(adRevenue))> Inter. adReveune > Avg(PR), Sum Identity Result1 Resu (adRevnue) lt < Source IP , URL , PageRank , <Source IP, Avg(PR), Max Inter. adReveune > (Sum(adRevenue))> Avg(PR), Sum Identity Result1 Resu (adRevnue) lt < Source IP , URL , PageRank , Inter. <Source IP, Avg(PR), Max adReveune > Avg(PR), Sum (Sum(adRevenue))> Identity Result1 Resu (adRevnue) lt

Benchmark 3 – Phase 3 Phase 3 HDFS Mapper Reducer <Source IP, Avg(PR), Max s (Sum(adRevenue))> Inter. Identity Resu lt Source IP, Avg(PR), Sum <Source IP, Avg(PR), Max (adRevenue) (Sum(adRevenue))> Inter. Max(Sum Identity Final Resu (adRevnue) Result lt <Source IP, Avg(PR), Max Inter. (Sum(adRevenue))> Identity Resu lt

Improved (Savy) MR Plans

Binary Data ●Eliminates delimiters ●Avoids splitting ●Makes tuples of fixed length ●Helps in indexing

Benchmark 1 Extra MR job to merge results Phase 1 Phase 2 HDFS Mappers Reducer PageRank > Result Data 10? PageRank > Result Reduce Resul Data 10? t PageRank > Result Data 10? Binary data

Benchmark 2 Extra MR job to merge results Phase 1 Phase 2 Combiners Mapper Reduce HDFS Reduce Reducer s rs r Inter. su split Aggr Res Dat Resu m ult a lt Inter. su merge split Aggr Result Res Aggr Dat Res Resu m ult a ult lt Inter. Res split Aggr Resu sum Dat ult lt a Binary data

Benchmark 3(Design I) – Phase 1 Phase 1 HDFS Reducers Record Mappers < Source IP , URL , PageRank , Reader adReveune > Inter. s join User Ranking Resu Identity Result1 predicate s rank visits lt < Source IP , URL , PageRank , Inter. adReveune > join Identity User Ranking Resu Result1 predicate s visits lt rank < Source IP , URL , PageRank , Inter. adReveune > join User Ranking Identity Resu Result1 predicate s visits lt ranks Binary Easy to classify data (just look at record size)

Benchmark 3(Design I) – Phase 2 Phase 2 Combiner HDFS Mapper Reducer s s s < Source IP , URL , PageRank , Inter. adReveune > Avg(PR), Identity Result1 Resu Sum lt Source IP, Avg(PR), Sum (adRevnu (adRevenue) e < Source IP , URL , PageRank , Inter. adReveune > Max(Sum Final Avg(PR), Identity Result1 Resu (adRevnue) Sum Result lt (adRevnue < Source IP , URL , PageRank , Inter. adReveune > Avg(PR), Identity Result1 Resu Sum lt (adRevnu No e Phase 3!

Benchmark 3(Design II) – Phase 1 Phase 1 HDFS R Reducers RR Mappers <Source IP, Sum (adReveune), <Dest URLs>> Inter. Max(Sum predic User Result1 Identity Resu (adRevenue)) ate visits lt <Source IP, Sum (adReveune), <Dest URLs>> Inter. Max(Sum predic Identity User Result1 Resu (adRevenue)) ate visits lt <Source IP, Sum (adReveune), <Dest URLs>> Inter. Max(Sum predic User Identity Result1 Resu (adRevenue)) ate visits lt Only Very small data UserVisit (Top R records) s

Benchmark 3(Design I) – Phase 2 Phase 2 HDFS Record Single Mapper Reader <Source IP, Sum(adReveune), s <Dest URLs>> Rankin Read Result1 gs Source IP, Avg(PR), Sum (adRevenue) Max(sum(adRevenue)) <Source IP, Sum(adReveune), Final <Dest URLs>> & Ranking Read Result1 Result Join s <Source IP, Sum(adReveune), <Dest URLs>> Rankin Read Result1 gs

Improving Hadoop

Improving Hadoop ●Improve Selection (Indexing) ●Improve Join (Co-partitioning)

Indexing ●Data Loading ○index and load data into DFS ●Query Execution ○ index look-up and selection ● Implementation on Hadoop

Data Loading ●Partitioning ●Sorting ●Bulk Loading ●HID Splits

Data Loading

Partitioning Split input data at tuple boundaries

Sorting Sort each split on the index key

Bulk Loading Bulk load CSS tree index

HID Split Construct H eader- I ndex- D ata Split

HID Split Construct H eader- I ndex- D ata Split Header: Index end offset Data end offset Start index key End index key

Query Execution ●Partitioning ●Split selection ●Index lookup ●Extractor

Query Execution

Partitioning Read header to get HID boundaries

Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz - PowerPoint PPT Presentation

Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz Kargin, Sarath Kumar, Vinay Setty Outline Motivation: Parallel DBMS vs Map/Reduce Schema & Benchmarks Overview Original(Pavlo) Map/Reduce Plans Improved(SAVY)

Bridging the Gap on Breaches: What Makes the Difference? Sponsored By: Bridging the Gap on

Large Scale Circuit Placement: Large Scale Circuit Placement: Gap and Progress Gap and Progress

Bridging the Gender Pay Gap By: Christine Acquah Gender Pay Gap, What is it? The gap between

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Bridging the Gap between Data Diversity and Data Dependencies Jean-Marc Petit INSA Lyon,

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Bridging The Gap Between Information Security & IT Audit Agenda Introductions

Bridging the Gap: An overview of CPRITs Early Translational Research Award (ETRA) and SEED

UCF FINANCIALS THE N EXT G EN Fit-Gap Kick Off April 17, 2018 AGENDA How are fit-gap sessions

MCP gap bottom bottom electrode gap Anode

Artificial Immune Systems Artificial Immune Systems and Data Mining: Bridging the and Data

Research on Race Bridging for 2020 Ben Bolender Assistant Division Chief Population Estimates

2017 Training The Company Gap Training GAP training is about bridging courses within the retail

The Deliverables of Gap Analysis Conducted by NAMA: To Provide a Comprehensive Gap Analysis

Meeting #2 September 12, 2011 1 Review priority gap areas identified by Gap Analysis and

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Efficient and Decentralized PageRank Approximation in a P2P Network Josiane Xavier Parreira ,

G o o g l e ma t r i x o f t h e w o r l d n e t w o r k o f e c o

GraphBench: A Benchmark Suite for Graph Computing Systems Presented by Lei Wang INSTI TITU

CS475 / CS675 Lecture 18: June 30, 2016 QR Method with Shifts Google Page Rank Reading: [TB]

Graph Algorithms for Community Detection & Recommendations Mark Needham & Amy Hodler,

ADVANCED MEMORY SYSTEMS Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Community Issues Grants 2019 Request for Proposals 1 Horizons Mission A community foundation

using Windows Communication Foundation Captator Tlf: +45 8620 4242 www.captator.dk Henrik Lykke

Sambuz

Useful Links

Newsletter

Mail Us

Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz - PowerPoint PPT Presentation

Large-Scale Data Analysis: Bridging the Gap Alekh Jindal, Yagiz Kargin, Sarath Kumar, Vinay Setty Outline Motivation: Parallel DBMS vs Map/Reduce Schema & Benchmarks Overview Original(Pavlo) Map/Reduce Plans Improved(SAVY)

Bridging the Gap on Breaches: What Makes the Difference? Sponsored By: Bridging the Gap on

Large Scale Circuit Placement: Large Scale Circuit Placement: Gap and Progress Gap and Progress

Bridging the Gender Pay Gap By: Christine Acquah Gender Pay Gap, What is it? The gap between

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Bridging the Gap between Data Diversity and Data Dependencies Jean-Marc Petit INSA Lyon,

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

Bridging The Gap Between Information Security &amp; IT Audit Agenda Introductions

Bridging the Gap: An overview of CPRITs Early Translational Research Award (ETRA) and SEED

UCF FINANCIALS THE N EXT G EN Fit-Gap Kick Off April 17, 2018 AGENDA How are fit-gap sessions

MCP gap bottom bottom electrode gap Anode

Artificial Immune Systems Artificial Immune Systems and Data Mining: Bridging the and Data

Research on Race Bridging for 2020 Ben Bolender Assistant Division Chief Population Estimates

2017 Training The Company Gap Training GAP training is about bridging courses within the retail

The Deliverables of Gap Analysis Conducted by NAMA: To Provide a Comprehensive Gap Analysis

Meeting #2 September 12, 2011 1 Review priority gap areas identified by Gap Analysis and

Large-Scale Machine Learning at Twitter 2 Large-Scale Machine Learning at Twitter Jimmy Lin and

Efficient and Decentralized PageRank Approximation in a P2P Network Josiane Xavier Parreira ,

G o o g l e ma t r i x o f t h e w o r l d n e t w o r k o f e c o

GraphBench: A Benchmark Suite for Graph Computing Systems Presented by Lei Wang INSTI TITU

CS475 / CS675 Lecture 18: June 30, 2016 QR Method with Shifts Google Page Rank Reading: [TB]

Graph Algorithms for Community Detection &amp; Recommendations Mark Needham &amp; Amy Hodler,

ADVANCED MEMORY SYSTEMS Mahdi Nazm Bojnordi Assistant Professor School of Computing University

Community Issues Grants 2019 Request for Proposals 1 Horizons Mission A community foundation

using Windows Communication Foundation Captator Tlf: +45 8620 4242 www.captator.dk Henrik Lykke

Sambuz

Useful Links

Newsletter

Mail Us

Bridging The Gap Between Information Security & IT Audit Agenda Introductions

Graph Algorithms for Community Detection & Recommendations Mark Needham & Amy Hodler,