cockroachdb
play

CockroachDB Architecture of a Geo-Distributed SQL Database Nathan - PowerPoint PPT Presentation

CockroachDB Architecture of a Geo-Distributed SQL Database Nathan VanBenschoten (@natevanben), Staff Software Engineer CockroachDB: Geo-distributed SQL Database Make Data Easy Distributed Horizontally scalable to grow with your


  1. Transaction Atomicity lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie* lady sunny* node 1 node 4 figment stella peetey zee jack sunny* transactions carl zee TXN1 : PENDING dagne transactions lady carl figment lula peetey TXN1 : PENDING jack lady carl muddy dagne lula peetey ozzie* lady A C carl muddy dagne peetey K dagne ozzie* lady pinetop figment peetey sooshi jack BEGIN TXN1 stella WRITE[sunny] sunny* WRITE[ozzie] zee transactions TXN1 : PENDING GATEWAY

  2. Transaction Atomicity lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie* lady sunny* node 1 node 4 figment stella peetey zee jack sunny* transactions carl zee TXN1 : PENDING dagne transactions lady carl figment lula peetey TXN1 : PENDING jack lady carl muddy dagne lula peetey ozzie* lady carl muddy dagne peetey dagne ozzie* lady pinetop figment peetey sooshi jack BEGIN TXN1 stella WRITE[sunny] sunny* WRITE[ozzie] zee COMMIT transactions TXN1 : COMMITTED GATEWAY

  3. Transaction Atomicity lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie* lady sunny* node 1 node 4 figment stella peetey zee jack sunny* transactions carl zee TXN1 : COMMITTED dagne transactions lady carl figment lula peetey TXN1 : COMMITTED jack lady carl muddy dagne lula peetey ozzie* lady carl muddy dagne peetey dagne ozzie* lady pinetop figment peetey sooshi jack BEGIN TXN1 stella WRITE[sunny] sunny* WRITE[ozzie] zee COMMIT transactions TXN1 : COMMITTED GATEWAY

  4. Transaction Atomicity lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie* lady sunny* node 1 node 4 figment stella peetey zee K jack sunny* C transactions A carl zee TXN1 : COMMITTED dagne transactions lady carl figment lula peetey TXN1 : COMMITTED jack lady carl muddy dagne lula peetey ozzie* lady carl muddy dagne peetey dagne ozzie* lady pinetop figment peetey sooshi jack BEGIN TXN1 stella WRITE[sunny] sunny* WRITE[ozzie] zee COMMIT transactions TXN1 : COMMITTED GATEWAY

  5. Transaction Atomicity lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie* lady sunny* node 1 node 4 figment stella peetey zee jack sunny* transactions carl zee TXN1 : COMMITTED dagne transactions lady carl figment lula peetey TXN1 : COMMITTED jack lady carl muddy dagne ACK lula peetey ozzie* lady carl muddy dagne peetey dagne ozzie* lady pinetop figment peetey sooshi jack BEGIN TXN1 stella WRITE[sunny] sunny* WRITE[ozzie] zee COMMIT transactions TXN1 : COMMITTED GATEWAY

  6. Transaction Atomicity lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie lady sunny node 1 node 4 figment stella peetey zee jack sunny transactions carl zee TXN1 : COMMITTED dagne transactions lady carl figment lula peetey TXN1 : COMMITTED jack lady carl muddy dagne ACK lula peetey ozzie lady carl muddy dagne peetey dagne ozzie lady pinetop figment peetey sooshi jack BEGIN TXN1 stella WRITE[sunny] sunny WRITE[ozzie] zee COMMIT transactions TXN1 : COMMITTED GATEWAY

  7. AGENDA Introduction ● Ranges and Replicas ● Transactions ● SQL Data in a KV World ● SQL Execution ● SQL Optimization ●

  8. SQL S tructured Q uery L anguage Declarative, not imperative • These are the results I want vs perform these operations in this sequence Relational data model • Typed: INT, FLOAT, STRING, ... • Schemas: tables, rows, columns, foreign keys

  9. SQL: Tabular Data in a KV World SQL data has columns and types?!? How do we store typed and columnar data in a key-value store? • The SQL data model needs to be mapped to KV data

  10. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT ) ID Name Price Key Value 1 Bat 1.11 /1 “Bat”,1.11 2 Ball 2.22 /2 “Ball”,2.22 3 Glove 3.33 /3 “Glove”,3.33

  11. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT ) ID Name Price Key Value 1 Bat 1.11 /<Table>/<Index>/1 “Bat”,1.11 2 Ball 2.22 /<Table>/<Index>/2 “Ball”,2.22 3 Glove 3.33 /<Table>/<Index>/3 “Glove”,3.33

  12. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT ) ID Name Price Key Value 1 Bat 1.11 /inventory/primary/1 “Bat”,1.11 2 Ball 2.22 /inventory/primary/2 “Ball”,2.22 3 Glove 3.33 /inventory/primary/3 “Glove”,3.33

  13. AGENDA Introduction ● Ranges and Replicas ● Transactions ● SQL Data in a KV World ● SQL Execution ● SQL Optimization ●

  14. SQL Execution Relational operators • Projection ( SELECT <columns> ) • Selection ( WHERE <filter> ) • Aggregation ( GROUP BY <columns> ) • Join ( JOIN ), Union ( UNION ), Intersect ( INTERSECT ) • Scan ( FROM <table> ) • Sort ( ORDER BY ) Technically, not a relational operator ○

  15. SQL Execution • Relational expressions have input expressions and scalar expressions The filter expression has 1 input expression and a predicate scalar expression ○ The scan expression has zero inputs ○ • Query plan is a tree of relational expressions • SQL execution takes a query plan and runs the operations to completion

  16. SQL Execution: Example SELECT name FROM inventory WHERE name >= “b” AND name < “c”

  17. SQL Execution: Scan SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan inventory

  18. SQL Execution: Filter SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan Filter inventory name >= “b” AND name < “c”

  19. SQL Execution: Project SELECT name FROM inventory WHERE name >= “b” AND name < “c” Project Scan Filter name inventory name >= “b” AND name < “c”

  20. SQL Execution: Project SELECT name FROM inventory WHERE name >= “b” AND name < “c” Project Scan Filter Results name inventory name >= “b” AND name < “c”

  21. SQL Execution: Index Scans SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan inventory@name [“b” - “c”) The filter gets pushed into the scan

  22. SQL Execution: Index Scans SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan Project Results inventory@name [“b” - “c”) name

  23. SQL Execution: Correctness Correct SQL execution involves lots of bookkeeping • User defined tables, and indexes • Queries refer to table and column names • Execution uses table and column IDs • NULL handling

  24. SQL Execution: Performance Performant SQL execution • Tight, well written code • Operator specialization hash group by, stream group by ○ hash join, merge join, lookup join, zig-zag join ○ • Distributed execution

  25. Distributed SQL Execution Network latencies and throughput are important Push fragments of computation as close to the data as possible Leverage aggregate compute resources

  26. Distributed SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Scan Scan Scan customers customers customers scan scan scan

  27. Distributed SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Scan Scan Scan customers customers customers Group-By Group-By Group-By “country” “country” “country” group-by group-by group-by

  28. Distributed SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Scan Scan Scan customers customers customers Group-By Group-By Group-By “country” “country” “country” group-by Group-By “country”

  29. AGENDA Introduction ● Ranges and Replicas ● Transactions ● SQL Data in a KV World ● SQL Execution ● SQL Optimization ●

  30. SQL Optimization An optimizer explores many plans that are logically equivalent to a given query and chooses the best one Parse Prep Search Execute AST Memo Plan Parse SQL Fold Constants Cost-based transformations Check Types Resolve Names Report Semantic Errors Compute properties Retrieve and attach stats Cost-independent transformations

  31. SQL Optimization: Cost-Independent Transformations • Some transformations always make sense Constant folding ○ Filter push-down ○ ... ○ • These transformations are cost-independent If the transformation can be applied to the query, it is applied ○ • D omain S pecific L anguage for transformations Compiled down to code which efficiently matches query fragments in the memo ○ ~200 transformations currently defined ○

  32. SQL Optimization: Cost-Based Transformations • Some transformations are not universally good Index selection ○ Join reordering ○ ... ○ • These transformations are cost-based When should the transformation be applied? ○ Need to try both paths and maintain both the original and transformed query ○ State explosion: thousands of possible query plans ○ Memo data structure maintains a forest of query plans ■ Estimate cost of each query, select query with lowest cost ○ • Costing Based on table statistics and estimating cardinality of inputs to relational expressions ○

  33. SQL Optimization: Cost-based Index Selection The index to use for a query is affected by multiple factors • Filters and join conditions • Required ordering ( ORDER BY ) • Implicit ordering ( GROUP BY ) • Covering vs non-covering (i.e. is an index-join required) • Locality

  34. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y

  35. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y Scan Filter Sort a@primary x > 10 y

  36. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y Scan Filter Sort a@primary x > 10 y Scan Sort a@x [10 - ) y

  37. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y Scan Filter Sort a@primary x > 10 y Scan Sort a@x [10 - ) y Scan Filter a@y x > 10

  38. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y Scan Filter Sort a@primary x > 10 y Lowest Scan Sort Cost a@x [10 - ) y 10 10 Scan Filter a@y x > 10 100,000 10

  39. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y Scan Filter Sort a@primary x > 10 y Scan Sort a@x [10 - ) y 50,000 50,000 Lowest Scan Filter Cost a@y x > 10 100,000 50,000

  40. Locality-Aware SQL Optimization Network latencies and throughput are important Duplicate read-mostly data in each locality Plan queries to use data from the same locality

  41. Locality-Aware SQL Optimization Three copies of the CREATE TABLE postal_codes ( id INT PRIMARY KEY, postal_codes table data code STRING, INDEX idx_eu (id) STORING (code), INDEX idx_usw (id) STORING (code) Use replication constraints to ) pin the copies to different geographic regions (US-East, US-West, EU)

  42. Locality-Aware SQL Optimization Optimizer includes locality in CREATE TABLE postal_codes ( id INT PRIMARY KEY, cost model code STRING, INDEX idx_eu (id) STORING (code), INDEX idx_usw (id) STORING (code) Automatically selects index ) from same locality: primary , idx_eu , or idx_usw SELECT * FROM postal_codes

  43. Conclusion Distributed, replicated, transactional key-value store ● Monolithic key space ● Raft replication of ranges (~64MB) ● Replica placement signals: space, diversity, load, latency ● Transactions coordinate changes across ranges ● Mapping SQL data to KV storage ● Distributed SQL execution ● Distributed SQL optimization ●

  44. Thank You www.cockroachlabs.com github.com/cockroachdb/cockroach

  45. Transactional Updates INSERT[sunny] DOGS Transactions used to insert 1 carl - jack carl records into ranges 2 lady - peetey dagne figment 3 pinetop - zee jack lady lady lula lula muddy muddy carl lady pinetop peetey peetey dagne lula sooshi pinetop pinetop sooshi sooshi figment muddy stella ? stella stella ✓ INSERT[sunny] jack peetey zee zee zee Space available in range? - YES

  46. Transactional Updates INSERT[sunny] DOGS Transactions used to insert 1 carl - jack carl records into ranges 2 lady - peetey dagne figment 3 pinetop - zee jack lady lady lula lula muddy muddy carl lady pinetop peetey peetey dagne lula sooshi pinetop pinetop sooshi sooshi figment muddy stella stella stella ✓ INSERT[sunny] jack peetey sunny zee zee zee

  47. Range Splits INSERT[rudy] DOGS BUT… what happens when 1 carl - jack carl a range is full? 2 lady - peetey dagne figment 3 pinetop - zee jack lady lady lula lula muddy muddy carl lady pinetop peetey peetey ? INSERT[rudy] ✓ dagne lula sooshi pinetop pinetop Space available in range? - NO sooshi sooshi figment muddy stella stella stella jack peetey sunny zee zee zee

  48. Range Splits INSERT[rudy] DOGS Ranges are automatically 1 carl - jack carl split, a new range index is 2 lady - peetey dagne created & order maintained figment 3 pinetop - sooshi jack 4 stella - zee lady lady lula lula muddy muddy carl lady pinetop stella peetey peetey INSERT[rudy] ✓ sunny dagne lula rudy pinetop pinetop split range and insert zee sooshi sooshi figment muddy sooshi stella stella jack peetey zee zee

  49. Rebalancing Replicas Scale: Add a node NEW node 2 If we add a node to the cluster, node 5 node 3 CockroachDB automatically node 1 node 4 redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides

  50. Rebalancing Replicas Scale: Add a node NEW node 2 If we add a node to the cluster, node 5 node 3 CockroachDB automatically node 1 node 4 redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides Movement is decomposed into adding a replica followed by removing a replica

  51. Rebalancing Replicas Scale: Add a node NEW node 2 If we add a node to the cluster, node 5 node 3 CockroachDB automatically node 1 node 4 redistributed replicas to even load across the cluster Uses the replica placement heuristics from previous slides Movement is decomposed into adding a replica followed by removing a replica

  52. Rebalancing Replicas Loss of a node Permanent Failure node 2 node 5 node 3 node 1 node 4 If a node goes down, the Raft group realizes a replica is missing and replaces it with a new replica on an active node Uses the replica placement heuristics from previous slides

  53. Rebalancing Replicas Loss of a node Permanent Failure node 2 node 5 node 3 node 1 node 4 If a node goes down, the Raft group realizes a replica is missing and replaces it with a new replica on an active node Uses the replica placement heuristics from previous slides

  54. Rebalancing Replicas Loss of a node Temporary Failure node 2 node 5 node 3 node 1 node 4 If a node goes down for a moment, the leaseholder can “catch up” any replica that is behind

  55. Transactions: Pipelining Pipelined Serial

  56. Transactions: Pipelining Pipelined Serial txn:sunny (pending) s u n n y s u n n y BEGIN WRITE[sunny]

  57. Transactions: Pipelining Pipelined Serial txn:sunny (pending) s u n n y o z z i e s u n n y BEGIN WRITE[sunny] o z WRITE[ozzie] z i e

  58. Transactions: Pipelining Pipelined Serial txn:sunny (pending) s u n n y o z z i e txn:sunny (staged) s u n n y [ k BEGIN e y s : s u n n y , o z z i e ] WRITE[sunny] o z WRITE[ozzie] z i e COMMIT t x n : s u n n y ( c o m m i t ) [ k e y s : s u n n y , o z z i e ]

  59. Transactions: Pipelining Pipelined Serial txn:sunny (pending) s u n n y o z z i e txn:sunny (staged) s u n n y [ k BEGIN e y s : s u n n y , o z z i e ] WRITE[sunny] o z WRITE[ozzie] z i e COMMIT Committed once all t t x n : operations complete s u n n y ( c o m m i t ) [ k e y s : s u n n y , o z z i e ] We replaced the centralized commit marker with a distributed one * “Proved” with TLA+

  60. A Simple Transaction lady carl pinetop lula peetey sooshi node 1 node 1 carl pinetop INSERT INTO DOGS (sunny); muddy dagne stella dagne sooshi peetey lady zee node 1 node 1 figment stella jack zee carl lady carl dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack peetey lady muddy dagne figment peetey lady jack pinetop sooshi stella zee

  61. A Simple Transaction: One Range lady carl pinetop lula peetey sooshi node 1 node 1 carl pinetop INSERT INTO DOGS (sunny); muddy dagne stella dagne sooshi peetey lady zee node 1 node 1 figment stella jack zee carl lady carl dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack peetey lady muddy dagne figment peetey lady jack pinetop sooshi BEGIN stella WRITE[sunny] sunny COMMIT zee GATEWAY NOTE: a gateway can be ANY CockroachDB instance. It can find the leaseholder for any range and execute a transaction

  62. A Simple Transaction: One Range lady carl pinetop lula peetey sooshi node 1 node 1 carl pinetop INSERT INTO DOGS (sunny); muddy dagne stella dagne sooshi peetey lady sunny node 1 node 1 figment stella zee jack sunny carl lady carl zee dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack peetey lady muddy dagne figment peetey lady jack ACK pinetop sooshi BEGIN stella WRITE[sunny] sunny COMMIT zee GATEWAY

  63. A Simple Transaction: One Range lady carl pinetop lula peetey sooshi node 1 node 1 carl pinetop INSERT INTO DOGS (sunny); muddy dagne stella dagne sooshi peetey lady sunny node 1 node 1 figment stella zee jack sunny carl lady carl zee dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack peetey lady muddy dagne figment ACK peetey lady jack pinetop sooshi BEGIN stella WRITE[sunny] sunny COMMIT zee GATEWAY

  64. Ranges CockroachDB implements order-preserving data distribution • Automates sharding of key/value data into “ranges” • Supports efficient range scans • Requires an indexing structure Foundational capability that enables efficient distribution of data across nodes within a CockroachDB cluster * This approach is also used by Bigtable (tablets), HBase (regions) & Spanner (ranges)

  65. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) ) ID Name Price Key Value ∅ 1 Bat 1.11 /inventory/name_idx/”Bat”/1 ∅ 2 Ball 2.22 /inventory/name_idx/”Ball”/2 ∅ 3 Glove 3.33 /inventory/name_idx/”Glove”/3

  66. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) ) ID Name Price Key Value ∅ 1 Bat 1.11 /inventory/name_idx/”Bat”/1 ∅ 2 Ball 2.22 /inventory/name_idx/”Ball”/2 ∅ 3 Glove 3.33 /inventory/name_idx/”Glove”/3 4 Bat 4.44

  67. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) ) ID Name Price Key Value ∅ 1 Bat 1.11 /inventory/name_idx/”Bat”/1 ∅ 2 Ball 2.22 /inventory/name_idx/”Ball”/2 ∅ 3 Glove 3.33 /inventory/name_idx/”Glove”/3 ∅ 4 Bat 4.44 /inventory/name_idx/”Bat”/4

  68. SQL Execution: Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

  69. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

  70. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States United States 1 Hans Germany Jacques France Marie France Susan United States

  71. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States United States 1 Hans Germany Germany 1 Jacques France Marie France Susan United States

  72. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States United States 1 Hans Germany Germany 1 Jacques France France 1 Marie France Susan United States

  73. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States United States 1 Hans Germany Germany 1 Jacques France France 2 Marie France Susan United States

More recommend