interactive analysis of web scale database
play

Interactive Analysis of Web-Scale Database Presented by Jian Fang - PowerPoint PPT Presentation

Dremel: Interactive Analysis of Web-Scale Database Presented by Jian Fang Most parts of these slides are stolen from here: http://bit.ly/HIPzeG What is Dremel Trillion-record, multi-terabyte datasets at interactive speed Scales to


  1. Dremel: Interactive Analysis of Web-Scale Database Presented by Jian Fang Most parts of these slides are stolen from here: http://bit.ly/HIPzeG

  2. What is Dremel  Trillion-record, multi-terabyte datasets at interactive speed  Scales to thousands of nodes  Fault and straggler tolerant execution  Nested data model  Complex datasets; normalization is prohibitive  Columnar storage and processing  Tree architecture (as in web search)  Interoperates with Google's data management tools  In situ data access (e.g., GFS, Bigtable)  MapReduce pipelines

  3. Widely used inside Google  Analysis of crawled web documents  Tracking install data for applications on Android Market  Spam analysis  Results of tests run on Google's distributed build system  Disk I/O statistics for hundreds of thousands of disks  ……

  4. Outline  Nested columnar storage  Query processing  Experiments  Observations

  5. Common Storage Layer  Google File System  Fault tolerance  Fast response time  Data can be manipulated easily

  6. Rows vs Columns r 1 DocId: 10 Links A Forward: 20 * * Name . . . Language B E Code: 'en-us' * Country: 'us' C D Url: 'http://A' r 1 Name Url: 'http://B' r 1 r 1 r 1 r 2 r 2 r 2 r 2 Read less, cheaper . . . decompression Challenge: preserve structure, reconstruct from a subset of fields

  7. r 1 DocId: 10 Nested Data Model Links Forward: 20 Forward: 40 Forward: 60 Name Language message Document { Code: 'en-us' required int64 DocId; Country: 'us' Language optional group Links { Code: 'en' repeated int64 Backward; Url: 'http://A' Name repeated int64 Forward; Url: 'http://B' } Name Language repeated group Name { Code: 'en-gb' repeated group Language { Country: 'gb' required string Code; r 2 optional string Country; DocId: 20 Links } Backward: 10 optional string Url; Backward: 30 Forward: 80 } Name } Url: 'http://C' 7

  8. Repetition and Definition Levels r 1 DocId: 10 Links  Values alone do not convey the Forward: 20 structure of a record Forward: 40 Forward: 60  Repetition levels Name R = 0 Language  It tells us at what repeated field in the Code: 'en-us' field’s path the value has repeated Country: 'us' R = 2 Language  Example: r1, Name.Language.Code Code: 'en' Url: 'http://A'  Repetition level: [0,2,1,1] Name R = 1 Language Code: NULL Url: 'http://B' Name R = 1 Language Code: 'en-gb' Country: 'gb'

  9. Repetition and Definition Levels r 1  Definition Levels DocId: 10 Links  Specifying how many fields in a path Forward: 20 Forward: 40 that could be undefined are actually Forward: 60 present in the record Name  Example: Name.Language.Country Language Code: 'en-us' Country: 'us' Language Missing Country, Code: 'en' d = 2 Url: 'http://A' Name Missing Language, Url: 'http://B' d = 1 Name Language Code: 'en-gb' Country: 'gb'

  10. Column-striped representation DocId Name.Url Links.Forward Links.Backward value r d value r d value r d value r d 10 0 0 http://A 0 2 20 0 2 NULL 0 1 20 0 0 http://B 1 2 40 1 2 10 0 2 NULL 1 1 60 1 2 30 1 2 http://C 0 2 80 0 2 Name.Language.Code Name.Language.Country value r d value r d en-us 0 2 us 0 3 en 2 2 NULL 2 2 NULL 1 1 NULL 1 1 en-gb 1 2 gb 1 3 NULL 0 1 NULL 0 1

  11. Record Assembly FSM DocId message Document { 0 required int64 DocId; 0 optional group Links { 1 1 Links.Backward Links.Forward repeated int64 Backward; 0 repeated int64 Forward; } 0,1,2 repeated group Name { Name.Language.Code Name.Language.Country repeated group Language { required string Code; 2 optional string Country; Name.Ur } 0,1 1 l optional string Url; } 0 } Transitions labeled with repetition levels

  12. Reading two fields s 1 DocId: 10 Name Language Country: 'us' DocId Language 0 Name Name 1,2 Name.Language.Country Language Country: 'gb' 0 s 2 DocId: 20 Name

  13. Query Processing  Optimized for select-project-aggregate  Very common class of interactive queries  Single scan  Within-record and cross-record aggregation  Approximations: count(distinct), top-k  Joins, temp tables, UDFs/TVFs, etc.

  14. Serving Tree client root server intermediate . . . servers . . . leaf servers (with local . . . storage) storage layer (e.g., GFS)

  15. Example: count() SELECT A, COUNT(B) FROM T SELECT A, SUM(c) 0 GROUP BY A FROM (R 1 1 UNION ALL R 1 10) T = {/gfs/1, /gfs/2, …, /gfs/100000} GROUP BY A R 1 1 R 1 2 SELECT A, COUNT(B) AS c SELECT A, COUNT(B) AS c 1 . . . FROM T 1 1 GROUP BY A FROM T 1 2 GROUP BY A T 1 1 = {/gfs/1, …, /gfs/10000} T 1 2 = {/gfs/10001, …, /gfs/20000} SELECT A, COUNT(B) AS c . . . 3 FROM T 3 1 GROUP BY A T 3 1 = {/gfs/1} Data access ops

  16. Experiments Table Number of Size (unrepl., Number Data Repl.  1 PB of real data name records compressed) of fields center factor (uncompressed, non- 3 × replicated) T1 85 billion 87 TB 270 A 3 × T2 24 billion 13 TB 530 A  100K-800K tablets per table 3 × T3 4 billion 70 TB 1200 A  Experiments run during 3 × T4 1+ trillion 105 TB 50 B business hours 2 × T5 1+ trillion 20 TB 30 B

  17. Read from disk "cold" time on local disk, averaged over 30 runs time (sec) ( e ) parse as from records C++ objects 10x speedup using columnar objects ( d ) read + storage decompress records ( c ) parse as from columns columns C++ objects ( b ) assemble 2-4x overhead of records using records ( a ) read + decompress number of fields Table partition: 375 MB (compressed), 300K rows, 125 columns

  18. MapReduce and Dremel Execution Avg # of terms in txtField in 85 billion record table T1 execution time (sec) on 3000 nodes Sawzall program ran on MR: num_recs: table sum of int; num_words: table sum of int; emit num_recs <- 1; emit num_words <- count_words(input.txtField); 87 TB 0.5 TB 0.5 TB SELECT SUM(count_words(txtField)) / COUNT(*) Q1: FROM T1 MR overheads: launch jobs, schedule 0.5M tasks, assemble records

  19. Impact of serving tree depth execution time (sec) (returns 1M records) (returns 100s of records) SELECT country, SUM(item.amount) FROM T2 Q2: GROUP BY country 40 billion nested items SELECT domain, SUM(item.amount) FROM T2 Q3: WHERE domain CONTAINS ’ .net ’ GROUP BY domain

  20. Observations  Possible to analyze large disk-resident datasets interactively on commodity hardware  1T records, 1000s of nodes  MR can benefit from columnar storage just like a parallel DBMS  But record assembly is expensive  Interactive SQL and MR can be complementary  Parallel DBMSes may benefit from serving tree architecture just like search engines

  21. More Information  Big Query: http://code.google.com/apis/bigquery/  Apache Drill: http://incubator.apache.org/drill/index.html Thank You

Recommend


More recommend