Spanner A distributed database system Presented by Yue Xia

Background - Developed by Google initially as a key-value storage system - Developers want traditional database features like query language - Evolved to a full featured SQL system - Now used by teams across many parts of Google and Alphabet

Overview - Distributed computing - Architecture - Replication - Partitioning - Interleaving - Range Extraction - Distributed Union - Distributed Join - Data storage - PAX - LSM Tree - Conclusion and Discussion

Replication - Multiple datacenters in different geographic locations - Data replicated in each datacenter - Run query on nearest datacenter

Partitioning Id Name Department - Each datacenter has multiple servers 3 Alice ‘A’ Shard 1: Department 2 Eve ‘A’ = ‘A’ - Data row-range sharded (partitioned) Shard 2: 1 Carol ‘B’ Department - Shards distributed across servers in each 4 Bob ‘C’ = ‘B’ to ‘C‘ datacenter Shard 3: 0 George ‘D’ Department 5 Fred ‘D’ = ‘D’

Interleaving Parent & child tables interleaved Customer Id Customer Name and co-located Order # Price customer 1 Alice order 1 $2 Customer Join Order only needs one scan of the interleaved table order 2 $4 customer 2 Bob order 1 $6

Query Execution 1. Go to the nearest datacenter 2. Extract key range 3. Run query only on shards covering the key range

Range Extraction - Goal Given a query, we want to know: - What shards to access - What fragments of shards to access (seek into smaller key ranges instead of scanning the full shard)

Range Extraction - Filter Tree AND Example: scan Table filter A=1 && ((B=’a’ && C=1) A=1 OR || (B>’a’ && C=2) AND AND Construct a tree according to the filter condition. B=’a’ C=1 B>’a’ C=2

Range Extraction - Filter Tree AND Example: scan Table filter A=1 && ((B=’a’ && C=1) A=1 OR || (B>’a’ && C=2) A:[1,1] AND AND First find the range for A. Assign leaf node an initial B=’a’ C=1 B>’a’ C=2 interval A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] A:[- ∞ , ∞ ]

Range Extraction - Filter Tree A:[1,1] ∩ [- ∞ , ∞ ] = [1,1] AND Find the interval for each node from bottom to top. A:[- ∞ , ∞ ] AND is intersection and OR A=1 OR is union. A:[1,1] A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] AND AND The range for A is [1,1] B=’a’ C=1 B>’a’ C=2 A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] A:[- ∞ , ∞ ]

Range Extraction - Filter Tree A:[1,1] AND Then find the range for B B:[- ∞ , ∞ ] ∩ [‘a’, ∞ ] = [‘a’, ∞ ] A:[- ∞ , ∞ ] A=1 OR B:[‘a’,’a’]U(‘a’, ∞ ]=[‘a’, ∞ ] A:[1,1] B:[- ∞ , ∞ ] A:[- ∞ , ∞ ] AND A:[- ∞ , ∞ ] AND B:[‘a’,’a’] ∩[ - ∞ , ∞ ] = [‘a’,’a’] B:(’a’, ∞ ] ∩[ - ∞ , ∞ ] = (’a’, ∞ ] B=’a’ C=1 B>’a’ C=2 A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] B:[‘a’,’a’] B:[- ∞ , ∞ ] B:(‘a’, ∞ ] B:[- ∞ , ∞ ]

Range Extraction - Filter Tree A:[1,1] AND Then C B:[‘a’, ∞ ] C:[1,1] if B=’a’, [2,2] if B>’a’ Note that C’s range A:[- ∞ , ∞ ] B:[‘a’, ∞ ] depends on B A=1 OR C:[1,1] if B=’a’, [2,2] if B>’a’ A:[1,1] B:[- ∞ , ∞ ] A:[- ∞ , ∞ ] C:[- ∞ , ∞ ] AND A:[- ∞ , ∞ ] AND B:[‘a’,’a’] B:(’a’, ∞ ] C:[1,1] C:[2,2] B=’a’ C=1 B>’a’ C=2 A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] A:[- ∞ , ∞ ] B:[‘a’,’a’] B:[- ∞ , ∞ ] B:(‘a’, ∞ ] B:[- ∞ , ∞ ] C:[- ∞ , ∞ ] C:[1,1] C:[- ∞ , ∞ ] C:[2, 2 ]

Range Extraction - Query Rewrite Rewrite filtered scan to self-join JOIN scan Table filter A=1 && ((B=’a’ && C=1) || JOIN (B>’a’ && C=2) becomes: Scan3(Table) Filter A=@A And B=@B Where the conditions Scan2(Table) And C = 1 if B=‘a’, C=2 if B>’a’ Scan1(Table) Filter A=@A output columns of interest in red are from the filter Filter A in [1,1] AND B in [‘a’,∞] output @A tree output @A,@B

Range Extraction - Execution Scan1(Table) Filter A in [1,1] record id A B C output @A 1 1 ‘9’ 0 2 1 ‘a’ 0 Output: 1 3 1 ‘a’ 1 4 1 ‘ab’ 0 5 1 ‘b’ 1 6 2 ‘c’ 2 …….

Range Extraction - Execution Scan2(Table) JOIN Filter A=@A AND B in [‘a’,∞] output @A, @B A record id A B C Output: 1 1 1 ‘9’ 0 1, ‘a’ 2 1 ‘a’ 0 1, ’ab’, 3 1 ‘a’ 1 1, ‘b’, 4 1 ‘ab’ 0 5 1 ‘b’ 1 6 2 ‘c’ 2 …….

Range Extraction - Execution Scan3(Table) JOIN Filter A=@A And B=@B record id A B C A B And C = 1 if B=‘a’, C=2 if B>’a’ output columns of interest 1 ‘a’ 1 1 ‘9’ 0 (output record_id if the table is sharded 1 ‘ab’ 2 1 ‘a’ 0 according to record_id) 1 ‘b’ 3 1 ‘a’ 1 Output: 4 1 ‘ab’ 0 3 5 1 ‘b’ 1 6 2 ‘c’ 2 …….

Range Extraction - Query Rewrite Join 3 tables. JOIN Too slow? JOIN Scan3(Table) Filter A=@A And B=@B Scan2(Table) And C = 1 if B=‘a’, C=2 if B>’a’ Scan1(Table) Filter A=@A output columns of interest Filter A in [1,1] AND B in [‘a’,∞] output @A output @A, @B

Range Extraction - Actual Execution Scan1(Table) Filter A in [1,1] record id A B C output @A 1 1 ‘9’ 0 2 1 ‘a’ 0 Output 1 without accessing the data as A is fixed to 1. 3 1 ‘a’ 1 4 1 ‘ab’ 0 5 1 ‘b’ 1 6 2 ‘c’ 2 …….

Range Extraction - Execution Scan2(Table) JOIN Filter A=@A AND B in [‘a’,∞] output @B A record id A B C Seek the first record with A = 1 and B = ‘a’ 1 1 1 ‘9’ 0 instead of scanning the whole table. 2 1 ‘a’ 0 3 1 ‘a’ 1 Record 2 is found. Output 1, ‘a’ 4 1 ‘ab’ 0 Then seek the next record with A = 1 and B 5 1 ‘b’ 1 != ‘a’, which is record 4. Output 1, ‘ab’ (skip records with B=’a’) 6 2 ‘c’ 2 …….

Range Extraction - Execution Scan3(Table) JOIN Filter A=@A And B=@B record id A B C A B And C = 1 if B=‘a’, C=2 if B>’a’ output columns of interest 1 ‘a’ 1 1 ‘9’ 0 1 ‘ab’ 2 1 ‘a’ 0 Similar to Scan2, seek instead of scan. 1 ‘b’ 3 1 ‘a’ 1 Can finger 4 1 ‘ab’ 0 5 1 ‘b’ 1 6 2 ‘c’ 2 …….

Range Extraction - Conclusion - Filter Tree: find range for each column values. - Rewrite filter to multiple self joins. - Execute with seek instead of scan.

Distributed Union - A new relational algebra operator - Send subquery to each shard and concatenate results

Distributed Union - Replace scan with distributed union of scan Scan(Table) -> DistributedUnion[shard ⊆ T](Scan(shard)) - Pull distributed union above as many operations as possible. (push computation to each server)

Distributed Union Some operations can be directly rewritten by: Op(DistributedUnion[shard ⊆ T](Scan(shard))) -> DistributedUnion[shard ⊆ T](Op(Scan(shard))) Example: - Basic operations like projection, filtering… - Group by K or Ordered by K if sharded according to K - Join of interleaved table

Distributed Union Some need extra processing: Op(DistributedUnion[shard ⊆ T](Scan(shard))) -> Op_Final(DistributedUnion[shard ⊆ T](Op_Local(Scan(shard)))) Example: - Top(5) can be done by finding the top 5 in each shard and then finding the top 5 among the results from all shards

Distributed Union - Optimization - Multiple levels of distributed union Distributed Union - On large shards, further parallelize between subshards Distributed Distributed Union Union - Detect locally hosted shards and avoid remote call Shard Shard Shard Shard 1 2 3 4

Distributed Union - Optimization - Range extraction: Extract key range Map key range to shards Send to min # of servers such that they contain all the required shards Only run query on required shards

Batched Distributed Join - Join can also be distributed - Send batches of left table to each shards - Join batch with local shards - Union

Batched Distributed Join - Optimization - Select left table to fit in a batch - Range extraction for each batch - Construct the minimum batch to be sent to each shard

Data Storage - PAX - Data stored in Partition Attributes Across (PAX) layout - Records are horizontally partitioned in pages - In each page all values of each attribute are grouped together - Greatly improves cache performance

Data Storage - PAX Table PAX Page Id Name Age 0,1,2,... 0 Alice 15 Alice,Bob,Carol,... 1 Bob 20 2 Carol 25 15,20,25,... ... ... ...

Data Storage - PAX Not Used N-ary Storage Model (NSM) Page Cache Table 0,Alice,15|1,Bob,20| Bob,20,2 2,Carol,25|... Id Name Age 0 Alice 15 1 Bob 20 PAX Page Cache 2 Carol 25 0,1,2,... 15,20,25 ... ... ... Alice,Bob,Carol,... 15,20,25,... - SELECT age WHERE age> 20 - Cache miss will cache the asked value and the values next to it

Data Storage - LSM Tree - Insert, update or delete would require rewriting the whole file. Id Age Id Age 0 5 0 5 3 10 insert 1 15 3 10 2 20 1 15 2 20

Spanner A distributed database system Presented by Yue Xia - PowerPoint PPT Presentation

Spanner A distributed database system Presented by Yue Xia Background - Developed by Google initially as a key-value storage system - Developers want traditional database features like query language - Evolved to a full featured SQL

Cloud Spanner Rohit Gupta, Solutions Engineer @rohitforcloud Todays goals Provide a brief

Spanner : Google's Globally-Distributed Database James Sedgwick and Kayhan Dursun Spanner - A

Geometric Spanner Networks Spanner Networks M. Farshi Course Outline Mohammad Farshi Textbook

Geometric Spanner Networks Course Outline Textbook Introduction Algorithms Review Greedy

2PC, Linearizability, Spanner 2020-04-17 Nikita Borisov - UIUC 12 Topics for Today

Spanner: Googles Globally-Distributed Database Wilson Hsieh representing a host of authors

Spanner: Googles Globally-Distributed Database Corbett, Dean, et al. Jinliang Wei CMU CSD

The 5 -graph is a spanner Prosenjit Bose, Pat Morin, Andr e van Renssen and Sander

Bigtable, Spanner and Flat Datacenter Storage by Onur Karaman and Karan Parikh Introducing

Competitive Routing on a Bounded-Degree Plane Spanner Prosenjit Bose, Rolf Fagerberg, Andr e

External Consistency and Spanner CS425/ECE428 SPRING 2020 NIKITA BORISOV, UIUC Transactions

Spanner Doug Woos (based on slides by Dan Ports) Bigtable in retrospect Definitely a useful,

Spanner Stephanie New Overview Scalable, multi-version, globally distributed, and synchronously

Building Spanner Better clocks stronger semantics Alex Lloyd Senior Staff Software Engineer

A Spanner for the Day After Kevin Buchin 1 Sariel Har-Peled 2 ah 1 D aniel Ol 1 Eindhoven

t-spanners for Transmission Graphs Using the Path-Greedy Algorithm Stav Ashur and Paz Carmi

Scaling Machine Learning Rahul Dave, for cs109b github https://github.com/rahuldave/dasktut

Reviewing PubChem laboratory chemical safety summaries for different user types Brian Murphy

Algebraic and Logical Query Languages Thomas Schwarz, SJ Bags, Lists, Sets Bags are

Homing and Synchronizing Sequences Sven Sandberg Information Technology Department Uppsala

Numeral systems & data structures Jyrki Katajainen 1) Amr Elmasry 2) and Claus Jensen 3) 1)

Towards Principled Methodologies and Efficient Algorithms for Minimax Machine Learning Tuo Zhao

Array New Syllabus 2019-20 Visit : python.mykvs.in for regular updates NUMPY - ARRAY NumPy

Neural Classification of Linguistic Coherence using Long Short-Term Memories Pashutan Modaresi,