cockroachdb
play

CockroachDB Architecture of a Geo-Distributed SQL Database Peter - PowerPoint PPT Presentation

CockroachDB Architecture of a Geo-Distributed SQL Database Peter Mattis (@petermattis), Co-founder & CTO CockroachDB: Geo-distributed SQL Database Make Data Easy Distributed Horizontally scalable to grow with your application


  1. Distributed Transactions lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi peetey lady zee node 1 node 4 figment stella jack zee carl lady carl dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack peetey lady muddy dagne figment peetey lady jack pinetop sooshi stella zee

  2. Distributed Transactions lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi peetey lady zee node 1 node 4 figment stella jack zee carl lady carl dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack peetey lady muddy dagne figment peetey lady jack pinetop sooshi BEGIN TXN1 stella WRITE[sunny] zee transactions TXN1 : PENDING GATEWAY

  3. Distributed Transactions lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi peetey lady sunny node 1 node 4 figment stella zee jack sunny carl lady carl zee dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack peetey lady muddy dagne figment peetey lady jack pinetop sooshi BEGIN TXN1 stella WRITE[sunny] sunny zee transactions TXN1 : PENDING GATEWAY

  4. Distributed Transactions lady carl pinetop pinetop lula peetey sooshi sooshi node 2 node 3 carl pinetop pinetop INSERT INTO dogs muddy dagne stella stella VALUES (sunny, ozzie) dagne sooshi sooshi peetey lady sunny sunny node 1 node 4 ACK figment stella stella zee zee jack sunny sunny carl lady carl zee zee dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack peetey lady muddy dagne figment peetey lady jack pinetop pinetop sooshi sooshi BEGIN TXN1 stella stella WRITE[sunny] sunny sunny zee zee transactions TXN1 : PENDING GATEWAY

  5. Distributed Transactions lady carl pinetop pinetop lula peetey sooshi sooshi node 2 node 3 carl pinetop pinetop INSERT INTO dogs muddy dagne stella stella VALUES (sunny, ozzie) dagne sooshi sooshi peetey lady sunny sunny node 1 node 4 figment stella stella zee zee jack sunny sunny carl lady carl zee zee dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack peetey lady muddy dagne figment peetey lady jack pinetop pinetop sooshi sooshi BEGIN TXN1 stella stella WRITE[sunny] sunny sunny zee zee transactions TXN1 : PENDING GATEWAY

  6. Distributed Transactions lady carl pinetop pinetop lula peetey sooshi sooshi node 2 node 3 carl pinetop pinetop INSERT INTO dogs muddy dagne stella stella VALUES (sunny, ozzie) dagne sooshi sooshi ozzie lady sunny sunny node 1 node 4 figment stella stella peetey zee zee jack sunny sunny carl lady carl zee zee dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack peetey lady muddy dagne figment peetey lady jack ACK pinetop pinetop sooshi sooshi BEGIN TXN1 stella stella WRITE[sunny] sunny sunny WRITE[ozzie] zee zee transactions TXN1 : PENDING GATEWAY

  7. Distributed Transactions lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie lady sunny node 1 node 4 figment stella peetey zee jack sunny carl lady carl zee dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack ozzie lady muddy dagne figment peetey ozzie lady jack peetey pinetop sooshi BEGIN TXN1 stella WRITE[sunny] sunny WRITE[ozzie] zee transactions TXN1 : PENDING GATEWAY

  8. Distributed Transactions lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie lady sunny node 1 node 4 figment stella peetey zee jack sunny carl lady carl zee dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack ACK ozzie lady muddy dagne figment peetey ozzie lady jack peetey pinetop sooshi BEGIN TXN1 stella WRITE[sunny] sunny WRITE[ozzie] zee transactions TXN1 : PENDING GATEWAY

  9. Distributed Transactions lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie lady sunny node 1 node 4 figment stella peetey zee jack sunny carl lady carl zee dagne lula peetey lady carl carl figment muddy dagne ACK lula peetey dagne jack ozzie lady muddy dagne figment peetey ozzie lady jack peetey pinetop sooshi BEGIN TXN1 stella WRITE[sunny] sunny WRITE[ozzie] zee transactions TXN1 : PENDING GATEWAY

  10. Distributed Transactions lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie lady sunny node 1 node 4 figment stella peetey zee jack sunny carl lady carl zee dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack ozzie lady muddy dagne figment peetey ozzie lady jack peetey pinetop sooshi BEGIN TXN1 stella WRITE[sunny] sunny WRITE[ozzie] zee COMMIT transactions TXN1 : COMMIT GATEWAY

  11. Distributed Transactions lady carl pinetop lula peetey sooshi node 2 node 3 carl pinetop INSERT INTO dogs muddy dagne stella VALUES (sunny, ozzie) dagne sooshi ozzie lady sunny node 1 node 4 figment stella peetey zee jack sunny carl lady carl zee dagne lula peetey lady carl carl figment muddy dagne lula peetey dagne jack ozzie lady muddy dagne figment ACK peetey ozzie lady jack peetey pinetop sooshi BEGIN TXN1 stella WRITE[sunny] sunny WRITE[ozzie] zee COMMIT GATEWAY

  12. Transactions: Pipelining Pipelined Serial

  13. Transactions: Pipelining Pipelined Serial txn:sunny (pending) s u n n y s u n n y BEGIN WRITE[sunny]

  14. Transactions: Pipelining Pipelined Serial txn:sunny (pending) s u n n y o z z i e s u n n y BEGIN WRITE[sunny] o z WRITE[ozzie] z i e

  15. Transactions: Pipelining Pipelined Serial txn:sunny (pending) s u n n y o z z i e txn:sunny (staged) s u n n y [ k BEGIN e y s : s u n n y , o z z i e ] WRITE[sunny] o z WRITE[ozzie] z i e COMMIT t x n : s u n n y ( c o m m i t ) [ k e y s : s u n n y , o z z i e ]

  16. Transactions: Pipelining Pipelined Serial txn:sunny (pending) s u n n y o z z i e txn:sunny (staged) s u n n y [ k BEGIN e y s : s u n n y , o z z i e ] WRITE[sunny] o z WRITE[ozzie] z i e COMMIT Committed once all t t x n : operations complete s u n n y ( c o m m i t ) [ k e y s : s u n n y , o z z i e ] We replaced the centralized commit marker with a distributed one * “Proved” with TLA+

  17. AGENDA Introduction ● Ranges and Replicas ● Transactions ● SQL Data in a KV World ● SQL Execution ● SQL Optimization ●

  18. SQL S tructured Q uery L anguage Declarative, not imperative • These are the results I want vs perform these operations in this sequence Relational data model • Typed: INT, FLOAT, STRING, ... • Schemas: tables, rows, columns, foreign keys

  19. SQL: Tabular Data in a KV World SQL data has columns and types?!? How do we store typed and columnar data in a distributed, replicated, transactional key-value store? • The SQL data model needs to be mapped to KV data • Reminder: keys and values are lexicographically sorted

  20. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT ) ID Name Price Key Value 1 Bat 1.11 /1 “Bat”,1.11 2 Ball 2.22 /2 “Ball”,2.22 3 Glove 3.33 /3 “Glove”,3.33

  21. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT ) ID Name Price Key Value 1 Bat 1.11 /<Table>/<Index>/1 “Bat”,1.11 2 Ball 2.22 /<Table>/<Index>/2 “Ball”,2.22 3 Glove 3.33 /<Table>/<Index>/3 “Glove”,3.33

  22. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT ) ID Name Price Key Value 1 Bat 1.11 /inventory/primary/1 “Bat”,1.11 2 Ball 2.22 /inventory/primary/2 “Ball”,2.22 3 Glove 3.33 /inventory/primary/3 “Glove”,3.33

  23. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) ) ID Name Price Key Value ∅ 1 Bat 1.11 /inventory/name_idx/”Bat”/1 ∅ 2 Ball 2.22 /inventory/name_idx/”Ball”/2 ∅ 3 Glove 3.33 /inventory/name_idx/”Glove”/3

  24. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) ) ID Name Price Key Value ∅ 1 Bat 1.11 /inventory/name_idx/”Bat”/1 ∅ 2 Ball 2.22 /inventory/name_idx/”Ball”/2 ∅ 3 Glove 3.33 /inventory/name_idx/”Glove”/3 4 Bat 4.44

  25. SQL Data Mapping: Inventory Table CREATE TABLE inventory ( id INT PRIMARY KEY, name STRING, price FLOAT, INDEX name_idx (name) ) ID Name Price Key Value ∅ 1 Bat 1.11 /inventory/name_idx/”Bat”/1 ∅ 2 Ball 2.22 /inventory/name_idx/”Ball”/2 ∅ 3 Glove 3.33 /inventory/name_idx/”Glove”/3 ∅ 4 Bat 4.44 /inventory/name_idx/”Bat”/4

  26. AGENDA Introduction ● Ranges and Replicas ● Transactions ● SQL Data in a KV World ● SQL Execution ● SQL Optimization ●

  27. SQL Execution Relational operators • Projection ( SELECT <columns> ) • Selection ( WHERE <filter> ) • Aggregation ( GROUP BY <columns> ) • Join ( JOIN ), union ( UNION ), intersect ( INTERSECT ) • Scan ( FROM <table> ) • Sort ( ORDER BY ) Technically, not a relational operator ○

  28. SQL Execution • Relational expressions have input expressions and scalar expressions For example, a “filter” expression has 1 input expression and a scalar expression that ○ filters the rows from the child The scan expression has zero inputs ○ • Query plan is a tree of relational expressions • SQL execution takes a query plan and runs the operations to completion

  29. SQL Execution: Example SELECT name FROM inventory WHERE name >= “b” AND name < “c”

  30. SQL Execution: Scan SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan inventory

  31. SQL Execution: Filter SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan Filter inventory name >= “b” AND name < “c”

  32. SQL Execution: Project SELECT name FROM inventory WHERE name >= “b” AND name < “c” Project Scan Filter name inventory name >= “b” AND name < “c”

  33. SQL Execution: Project SELECT name FROM inventory WHERE name >= “b” AND name < “c” Project Scan Filter Results name inventory name >= “b” AND name < “c”

  34. SQL Execution: Index Scans SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan inventory@name [“b” - “c”) The filter gets pushed into the scan

  35. SQL Execution: Index Scans SELECT name FROM inventory WHERE name >= “b” AND name < “c” Scan Project Results inventory@name [“b” - “c”) name

  36. SQL Execution: Correctness Correct SQL execution involves lots of bookkeeping • User defined tables, and indexes • Queries refer to table and column names • Execution uses table and column IDs • NULL handling

  37. SQL Execution: Performance Performant SQL execution • Tight, well written code • Operator specialization hash group by, stream group by ○ hash join, merge join, lookup join, zig-zag join ○ • Distributed execution

  38. SQL Execution: Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

  39. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

  40. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States United States 1 Hans Germany Jacques France Marie France Susan United States

  41. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States United States 1 Hans Germany Germany 1 Jacques France Marie France Susan United States

  42. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States United States 1 Hans Germany Germany 1 Jacques France France 1 Marie France Susan United States

  43. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States United States 1 Hans Germany Germany 1 Jacques France France 2 Marie France Susan United States

  44. SQL Execution: Hash Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States United States 2 Hans Germany Germany 1 Jacques France France 2 Marie France Susan United States

  45. SQL Execution: Group By Revisited SELECT COUNT(*), country FROM customers GROUP BY country Name Country Bob United States Hans Germany Jacques France Marie France Susan United States

  46. SQL Execution: Sort on Grouping Column(s) SELECT COUNT(*), country FROM customers GROUP BY country Name Country Jacques France Marie France Hans Germany Bob United States Susan United States

  47. SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Jacques France France 1 Marie France Hans Germany Bob United States Susan United States

  48. SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Jacques France France 2 Marie France Hans Germany Bob United States Susan United States

  49. SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Jacques France France 2 Marie France Germany 1 Hans Germany Bob United States Susan United States

  50. SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Jacques France France 2 Marie France Germany 1 Hans Germany United States 1 Bob United States Susan United States

  51. SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Name Country Jacques France France 2 Marie France Germany 1 Hans Germany United States 2 Bob United States Susan United States

  52. Distributed SQL Execution Network latencies and throughput are important considerations in geo-distributed setups Push fragments of computation as close to the data as possible

  53. Distributed SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Scan Scan Scan customers customers customers scan scan scan

  54. Distributed SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Scan Scan Scan customers customers customers Group-By Group-By Group-By “country” “country” “country” group-by group-by group-by

  55. Distributed SQL Execution: Streaming Group By SELECT COUNT(*), country FROM customers GROUP BY country Scan Scan Scan customers customers customers Group-By Group-By Group-By “country” “country” “country” group-by Group-By “country”

  56. AGENDA Introduction ● Ranges and Replicas ● Transactions ● SQL Data in a KV World ● SQL Execution ● SQL Optimization ●

  57. SQL Optimization An optimizer explores many plans that are logically equivalent to a given query and chooses the best one Parse Prep Search Execute AST Memo Plan Parse SQL Fold Constants Cost-based transformations Check Types Resolve Names Report Semantic Errors Compute properties Retrieve and attach stats Cost-independent transformations

  58. SQL Optimization: Cost-Independent Transformations • Some transformations always make sense Constant folding ○ Filter push-down ○ Decorrelating subqueries * ○ ... ○ • These transformations are cost-independent If the transformation can be applied to the query, it is applied ○ • D omain S pecific L anguage for transformations Compiled down to code which efficiently matches query fragments in the memo ○ ~200 transformations currently defined ○ * Actually cost-based, but we’re treating it as cost-independent right now

  59. SQL Optimization: Filter Push-Down SELECT * FROM a JOIN b WHERE x > 10 Initial plan Scan a@primary Filter Join Results x > 10 Scan b@primary

  60. SQL Optimization: Filter Push-Down SELECT * FROM a JOIN b WHERE x > 10 After filter push-down Scan Filter a@primary x > 10 Join Results Scan Filter b@primary x > 10

  61. SQL Optimization: Cost-Based Transformations • Some transformations are not universally good Index selection ○ Join reordering ○ ... ○ • These transformations are cost-based When should the transformation be applied? ○ Need to try both paths and maintain both the original and transformed query ○ State explosion: thousands of possible query plans ○ Memo data structure maintains a forest of query plans ■ Estimate cost of each query, select query with lowest cost ○ • Costing Based on table statistics and estimating cardinality of inputs to relational expressions ○

  62. SQL Optimization: Cost-based Index Selection The index to use for a query is affected by multiple factors • Filters and join conditions • Required ordering ( ORDER BY ) • Implicit ordering ( GROUP BY ) • Covering vs non-covering (i.e. is an index-join required) • Locality

  63. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y

  64. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y Scan Filter Sort a@primary x > 10 y

  65. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y Scan Filter Sort a@primary x > 10 y Scan Sort a@x [10 - ) y

  66. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y Scan Filter Sort a@primary x > 10 y Scan Sort a@x [10 - ) y Scan Filter a@y x > 10

  67. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y Scan Filter Sort a@primary x > 10 y Lowest Scan Sort Cost a@x [10 - ) y 10 10 Scan Filter a@y x > 10 100,000 10

  68. SQL Optimization: Cost-based Index Selection Required orderings affect index selection SELECT * Sorting is expensive if there are a lot of rows FROM a Sorting can be the better option if there are few rows WHERE x > 10 ORDER BY y Scan Filter Sort a@primary x > 10 y Scan Sort a@x [10 - ) y 50,000 50,000 Lowest Scan Filter Cost a@y x > 10 100,000 50,000

  69. Locality-Aware SQL Optimization Network latencies and throughput are important considerations in geo-distributed setups Duplicate read-mostly data in each locality Plan queries to use data from the same locality

  70. Locality-Aware SQL Optimization Three copies of the CREATE TABLE postal_codes ( id INT PRIMARY KEY, postal_codes table data code STRING, INDEX idx_eu (id) STORING (code), INDEX idx_usw (id) STORING (code) Use replication constraints to ) pin the copies to different geographic regions (US-East, US-West, EU)

More recommend