Dremel: Interactice Analysis of Web-Scale Datasets By Sergey - PowerPoint PPT Presentation

Dremel: Interactice Analysis of Web-Scale Datasets By Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Alex Zahdeh 1 / 32

Overview ● Scalable, interactive ad-hoc query system for analysis of read-only nested data ● Multi-level execution trees, columnar data layout ● Capable of aggregation queries over trillion row tables in seconds ● Scales to thousands of CPUs and petabytes of data 2 / 32

Motivation ● Need to deal with vast amounts of data spread out over multiple commodity machines ● Interactive queries require speed ● Response times make a qualitative difference in many analysis tasks 3 / 32

Applications of Dremel Analysis of crawled web documents. ● Tracking install data for applications on Android Market ● Crash reporting for Google products ● OCR results from Google Books ● Spam analysis ● Debugging of map tiles on Google Maps ● Disk I/O statistics for hundreds of thousands of disks ● Symbols and dependencies in Google's codebase ● 4 / 32

Data Exploration Example 1.Extract billions of signals from web pages using MapReduce 2.Ad hoc SQL query against Dremel DEFINE TABLE t AS /path/to/data/* SELECT TOP(signal, 100), COUNT(*) FROM t 3.More MR based processing 5 / 32

Background ● Requires a common storage layer – Google uses GFS ● Requires shared storage format – Protocol Buffers 6 / 32

Data Model (Protocol Buffers) ● Nested layout ● Each record consists of one or many data fields ● Fields have a name, type, and multiplicity ● Can specify optional/required fields ● Platform neutral ● Extensible 7 / 32

Data Model Example 8 / 32

Nested Columnar Storage ● Store all values of a given field consecutively ● Improve retreival efficiency ● Challenges – Lossless representation of record structure in columnar format – Fast encoding and decoding (assembly) of records 9 / 32

Repetition Levels ● Need to disambiguate field repetition and record repetition ● Must store a repetition level to each value 10 / 32

Definition Levels ● Specifies how many fields that could be undefined are actually present in the record ● Stored with each value 11 / 32

Definition Levels Example 12 / 32

Encoding ● Each column stored as a set of blocks ● Each block contains: – Repetition level – Definition level – Compressed field values ● NULLS not explicity stored (determined by definition level) 13 / 32

Splitting Records into Columns ● Create a tree of field writers whose structure matches the field heirarchy ● Update field writers only when they have their own data ● Don't propogate state down the tree unless absolutely necessary 14 / 32

Record Assembly ● Finite State Machine that reads the field values and levels and appends the values sequentially to output record ● States correspond to a field reader ● Transitions labeled with repetition levels 15 / 32

Record Assembly FSM 16 / 32

Query Language ● Based on SQL, designed to be efficiently implementable on columnar nested storage ● Each statement takes as input one or more nested tables and their schemas ● Produces a nested table and its output schema 17 / 32

Query Example 18 / 32

Query Execution ● Multi-level serving tree to execute queries ● Partitions of table spread out across leaf servers ● Queries aggregated on the way up ● Designed for "small" results (<1M records) 19 / 32

Query Dispatcher ● Fault tolerance ● Job scheduling – Slots are available execution threads on leaf servers – Amount of data processed larger than number of slots ● Straggler tolerance – Redispatch work that is taking too long 20 / 32

Experiments ● Several datsets ● All tables three way replicated ● Contain from 100k to 800k tablets of various sizes ● Goals – Examine access characteristics on a single machine – Show benefits of columnar storage for MR execution – Show Dremel's performance 21 / 32

Datasets 22 / 32

Record vs Column Storage 23 / 32 300k record fragment of Table T1 (1GB) used

MR vs Dremel (for aggregation queries) ● Single field access ● 3000 workers 24 / 32

Serving Tree Level Impact 25 / 32

Execution Time Histogram 26 / 32

Scaling Dremel 27 / 32

Query Response Distribution (1 month) 28 / 32

Observations Scan based queries can be executed at In a multi user environment a larger ● ● interactive speeds on disk resident system can benefit from economies of datasets of up to 1 trillion records scale while offering a better user experience Near linear scalability in the number of ● columns and servers is achievable for Can terminate queries much earlier and ● systems containing thousands of nodes return most of the data to tradeoff speed and accuracy MR benefits from columnar storage ● Getting to the last few percent within tight ● Record assembly and parsing are ● time bounds is hard expensive Software layers need to be optimized – to directly consume column-oriented database 29 / 32

Related Work Large Scale Computing Query Language ● ● Map Reduce, Hadoop Recursive Algebra and Query – – Optimizations for Nested Hybrid database/ computation ● Relations HadoopDB – Pig – Columnar Representation of ● Parallel Data Processing ● Nested Data Scope – Xmill – DryadLINQ – Data Model ● Complex value models – Nested relational models – 30 / 32

Discussion Topics ● Assumes read-only queries; could this be extended to data cleaning systems that we have seen perviously? – Replica consistency issues, etc. ● Protocol buffers was changed to not support optional / required fields. Why might that be? ● How common are queries with “small“ results sets? 31 / 32

Thanks for watching! 32 / 32

Dremel: Interactice Analysis of Web-Scale Datasets By Sergey - PowerPoint PPT Presentation

Dremel: Interactice Analysis of Web-Scale Datasets By Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Alex Zahdeh 1 / 32 Overview Scalable, interactive ad-hoc

Dremel: Interactive Analysis of Web-Scale Datasets CS 744 BIG DATA PHIL MARTINKUS Motivation

Dremel: Interac-ve Analysis of Web-Scale Datasets By Frank

Dremel: Interactive Analysis of Web- Scale Datasets S E R G E Y M E L N I K , A N D R E Y G U

Interactive Analysis of Web-Scale Database Presented by Jian Fang Most parts of these slides are

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

documentation Overview The datasets Common data manipulations Analysis using weights

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

An Introduction to Analysis of Multiple Gene Expression Datasets Pratyaksha Wirapati Statistical

Deep Learning and Hardware: Matching the Demands from the Machine Learning Community Ekapol

Classifying local four gluon S-matrices Subham Dutta Chowdhury November 20, 2020 YITP Strings

EE 721: Types and Functions in VHDL September 9, 2009 1 Overview VHDL views the system being

Week 06 Lectures 1/102 Recap on Implementing Selection Selection = select * from R where C yields

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database Stephan

P o s t g r e S Q L a s a C o l u m n a r S t o r e DCPUG May 2014 Reston, VA Stephen Frost

Check MIB <draft-nunzi-check-mib-00.txt> Giorgio Nunzi, Juergen Quittek, Marcus Brunner,

Intro to Oracle TimesTen --- By Sima Zhu Why in-memory? Basic Architecture TimesTen

Dremel: Interactice Analysis of Web-Scale Datasets By Sergey - PowerPoint PPT Presentation

Dremel: Interactice Analysis of Web-Scale Datasets By Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis Presented by: Alex Zahdeh 1 / 32 Overview Scalable, interactive ad-hoc

Dremel: Interactive Analysis of Web-Scale Datasets CS 744 BIG DATA PHIL MARTINKUS Motivation

Dremel: Interac-ve Analysis of Web-Scale Datasets By Frank

Dremel: Interactive Analysis of Web- Scale Datasets S E R G E Y M E L N I K , A N D R E Y G U

Interactive Analysis of Web-Scale Database Presented by Jian Fang Most parts of these slides are

Web Services Web Services Towards Web Services Towards Web Services Towards Web Services A

1 Examples The ETH-80 Dataset (Bastian Leibe and Bernt Schiele) The Caltech 101 average image

Large-Scale Web Applications Mendel Rosenblum CS142 Lecture Notes - Large-Scale Web Apps Web

Learning with Large Datasets L eon Bottou NEC Laboratories America Why Large-scale Datasets?

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Lecture 1: Semantic Web and RDF Aidan Hogan aidhog@gmail.com THE WEB The Web is now 26 years

documentation Overview The datasets Common data manipulations Analysis using weights

Web Application Security Attacks on the Web Attacker Web User Application Web Database Web

Web Mining Web Mining to automatically discover and extract information from Web

Web Scraping 1 / 9 Web Scraping Two ways to mine data from the web The hard way, by web

Agenda Web MVC-2: Apache Struts Drawbacks with Web Model 1 Web Model 2 (Web MVC) Rimon

An Introduction to Analysis of Multiple Gene Expression Datasets Pratyaksha Wirapati Statistical

Deep Learning and Hardware: Matching the Demands from the Machine Learning Community Ekapol

Classifying local four gluon S-matrices Subham Dutta Chowdhury November 20, 2020 YITP Strings

EE 721: Types and Functions in VHDL September 9, 2009 1 Overview VHDL views the system being

Week 06 Lectures 1/102 Recap on Implementing Selection Selection = select * from R where C yields

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database Stephan

P o s t g r e S Q L a s a C o l u m n a r S t o r e DCPUG May 2014 Reston, VA Stephen Frost

Check MIB &lt;draft-nunzi-check-mib-00.txt&gt; Giorgio Nunzi, Juergen Quittek, Marcus Brunner,

Intro to Oracle TimesTen --- By Sima Zhu Why in-memory? Basic Architecture TimesTen

Check MIB <draft-nunzi-check-mib-00.txt> Giorgio Nunzi, Juergen Quittek, Marcus Brunner,