Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. - PowerPoint PPT Presentation

Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006 Adapted by S. Sudarshan from a talk by Erik Paulson, UW Madison

Google Scale � Lots of data � Copies of the web, satellite data, user data, email and USENET, Subversion backing store � Many incoming requests � No commercial system big enough � Couldn’t afford it if there was one � Might not have made appropriate design choices � Firm believers in the End-to-End argument � 450,000 machines (NYTimes estimate, June 14 th 2006 �

Building Blocks � Scheduler (Google WorkQueue) � Google Filesystem � Chubby Lock service � Two other pieces helpful but not required � Sawzall � MapReduce (despite what the Internet says) � BigTable: build a more application-friendly storage service using these parts �

Google File System � Large-scale distributed “filesystem” � Master: responsible for metadata � Chunk servers: responsible for reading and writing large chunks of data � Chunks replicated on 3 machines, master responsible for ensuring replicas exist � OSDI ’04 Paper �

Chubby � {lock/file/name} service � Coarse-grained locks, can store small amount of data in a lock � 5 replicas, need a majority vote to be active � Also an OSDI ’06 Paper �

Data model: a big map � <Row, Column, Timestamp> triple for key - lookup, insert, and delete API � Arbitrary “columns” on a row-by-row basis � Column family:qualifier. Family is heavyweight, qualifier lightweight � Column-oriented physical store- rows are sparse! � Does not support a relational model � No table-wide integrity constraints � No multirow transactions �

SSTable � Immutable, sorted file of key-value pairs � Chunks of data plus an index � Index is of block ranges, not values SSTable 64K 64K 64K block block block Index �

Tablet � Contains some range of rows of the table � Built out of multiple SSTables Start:aardvark End:apple Tablet SSTable SSTable 64K 64K 64K 64K 64K 64K block block block block block block Index Index �

Table � Multiple tablets make up the table � SSTables can be shared � Tablets do not overlap, SSTables can overlap Tablet Tablet apple boat aardvark apple_two_E SSTable SSTable SSTable SSTable �

Finding a tablet � Stores: Key: table id + end row, Data: location � Cached at clients, which may detect data to be incorrect � in which case, lookup on hierarchy performed � Also prefetched (for range queries) ��

Servers � Tablet servers manage tablets, multiple tablets per server. Each tablet is 100-200 MB � Each tablet lives at only one server � Tablet server splits tablets that get too big � Master responsible for load balancing and fault tolerance ��

Master’s Tasks � Use Chubby to monitor health of tablet servers, restart failed servers � Tablet server registers itself by getting a lock in a specific directory chubby � Chubby gives “lease” on lock, must be renewed periodically � Server loses lock if it gets disconnected � Master monitors this directory to find which servers exist/are alive � If server not contactable/has lost lock, master grabs lock and reassigns tablets � GFS replicates data. Prefer to start tablet server on same machine that the data is already at ��

Master’s Tasks (Cont) � When (new) master starts � grabs master lock on chubby � Ensures only one master at a time � Finds live servers (scan chubby directory) � Communicates with servers to find assigned tablets � Scans metadata table to find all tablets � Keeps track of unassigned tablets, assigns them � Metadata root from chubby, other metadata tablets assigned before scanning. ��

Metadata Management � Master handles � table creation, and merging of tablet � Tablet servers directly update metadata on tablet split, then notify master � lost notification may be detected lazily by master ��

Editing a table � Mutations are logged, then applied to an in-memory memtable � May contain “deletion” entries to handle updates � Group commit on log: collect multiple updates before log flush Tablet Memtable Insert Memory Insert boat apple_two_E tablet log Delete Insert Delete SSTable SSTable GFS Insert ��

Compactions � Minor compaction – convert the memtable into an SSTable � Reduce memory usage � Reduce log traffic on restart � Merging compaction � Reduce number of SSTables � Good place to apply policy “keep only N versions” � Major compaction � Merging compaction that results in only one SSTable � No deletion records, only live data ��

Locality Groups � Group column families together into an SSTable � Avoid mingling data, e.g. page contents and page metadata � Can keep some groups all in memory � Can compress locality groups � Bloom Filters on SSTables in a locality group � bitmap on keyvalue hash, used to overestimate which records exist � avoid searching SSTable if bit not set � Tablet movement � Major compaction (with concurrent updates) � Minor compaction (to catch up with updates) without any concurrent updates � Load on new server without requiring any recovery action ��

Log Handling � Commit log is per server, not per tablet (why?) � complicates tablet movement � when server fails, tablets divided among multiple servers � can cause heavy scan load by each such server � optimization to avoid multiple separate scans: sort log by (table, rowname, LSN), so logs for a tablet are clustered, then distribute � GFS delay spikes can mess up log write (time critical) � solution: two separate logs, one active at a time � can have duplicates between these two ��

Immutability � SSTables are immutable � simplifies caching, sharing across GFS etc � no need for concurrency control � SSTables of a tablet recorded in METADATA table � Garbage collection of SSTables done by master � On tablet split, split tables can start off quickly on shared SSTables, splitting them lazily � Only memtable has reads and updates concurrent � copy on write rows, allow concurrent read/write ��

Microbenchmarks ��

��

Application at Google ��

Lessons learned � Interesting point- only implement some of the requirements, since the last is probably not needed � Many types of failure possible � Big systems need proper systems-level monitoring � Value simple designs ��

Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. - PowerPoint PPT Presentation

Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006 Adapted by S. Sudarshan from a talk by Erik Paulson, UW

BigTable CS 452 BigTable In the early 2000s, Google had way more data than anybody else did

Bigtable David Wyrobnik, MEng Overview What is Bigtable? Data Model API

OpenTSDB + Bigtable Integrating time series database with Google Cloud Bigtable Danil Zburivsky,

Bigtable, Spanner and Flat Datacenter Storage by Onur Karaman and Karan Parikh Introducing

Accumulo Extensions to Googles Bigtable Apache Accumulo Design Intro to Bigtable

Distributed Transactions Dan Ports, CSEP 552 Today Bigtable (from last week) Overview of

Lecture: The Google Bigtable

The Google Storage Stack (Chubby, GFS, BigTable) Dan Ports, CSEP 552 Today Three

Online Bigtable merge compaction Neal E. Young 1 Claire Mathieu Carl Staelin Arman Yousefia

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at Google JBD, Google (@rakyll)

CSE 6350 File and Storage System Infrastructure in Data centers Supporting Internet-wide Services

Bigtable: A Distributed Storage System for Structured Data Alvanos Michalis April 6, 2009

Websites from Presentation Search Engines Google https://www.google.com/ Google Scholar

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (& 6 TIPS!) BRAINJAR HOW GOOGLE

Containers At Scale At Google, the Google Cloud Platform and Beyond Joe Beda jbeda@google.com

Spanner Doug Woos (based on slides by Dan Ports) Bigtable in retrospect Definitely a useful,

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment Nuno A.

ICT CT Fra rame mework works for Analysi ysing ng, , Op Optimiz imizing ing and Building

Concurrency and Scheduling Analysis of Real-time Embedded Software on Multi-core Processors

Transportation Update City Council Meeting December 12, 2017 2 Agenda Outcomes of the

Specification Languages Presented by Cecilia Ekelin Purpose of the language To express the

Benefits Management v4.0 May 2009 1 eHealth Benefits Team Our aim: To enable eHealth

Effective Validation of Firmware Enabling firmware development and validation to keep pace with

Design Consideration for Cloud Applications Agenda How cloud applications are different?

Sambuz

Useful Links

Newsletter

Mail Us

Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. - PowerPoint PPT Presentation

Google Bigtable Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber Google, Inc. OSDI 2006 Adapted by S. Sudarshan from a talk by Erik Paulson, UW

BigTable CS 452 BigTable In the early 2000s, Google had way more data than anybody else did

Bigtable David Wyrobnik, MEng Overview What is Bigtable? Data Model API

OpenTSDB + Bigtable Integrating time series database with Google Cloud Bigtable Danil Zburivsky,

Bigtable, Spanner and Flat Datacenter Storage by Onur Karaman and Karan Parikh Introducing

Accumulo Extensions to Googles Bigtable Apache Accumulo Design Intro to Bigtable

Distributed Transactions Dan Ports, CSEP 552 Today Bigtable (from last week) Overview of

Lecture: The Google Bigtable

The Google Storage Stack (Chubby, GFS, BigTable) Dan Ports, CSEP 552 Today Three

Online Bigtable merge compaction Neal E. Young 1 Claire Mathieu Carl Staelin Arman Yousefia

RPC Metrics at Google JBD, Google (@rakyll) gRPC Metrics at Google JBD, Google (@rakyll)

CSE 6350 File and Storage System Infrastructure in Data centers Supporting Internet-wide Services

Bigtable: A Distributed Storage System for Structured Data Alvanos Michalis April 6, 2009

Websites from Presentation Search Engines Google https://www.google.com/ Google Scholar

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (&amp; 6 TIPS!) BRAINJAR HOW GOOGLE

Containers At Scale At Google, the Google Cloud Platform and Beyond Joe Beda jbeda@google.com

Spanner Doug Woos (based on slides by Dan Ports) Bigtable in retrospect Definitely a useful,

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment Nuno A.

ICT CT Fra rame mework works for Analysi ysing ng, , Op Optimiz imizing ing and Building

Concurrency and Scheduling Analysis of Real-time Embedded Software on Multi-core Processors

Transportation Update City Council Meeting December 12, 2017 2 Agenda Outcomes of the

Specification Languages Presented by Cecilia Ekelin Purpose of the language To express the

Benefits Management v4.0 May 2009 1 eHealth Benefits Team Our aim: To enable eHealth

Effective Validation of Firmware Enabling firmware development and validation to keep pace with

Design Consideration for Cloud Applications Agenda How cloud applications are different?

Sambuz

Useful Links

Newsletter

Mail Us

BRAINJAR HOW GOOGLE THINKS AND DISPELLING 3 GOOGLE MYTHS (& 6 TIPS!) BRAINJAR HOW GOOGLE