building a transactional key value store that scales to
play

Building a Transactional Key-Value Store That Scales to 100+ Nodes - PowerPoint PPT Presentation

Building a Transactional Key-Value Store That Scales to 100+ Nodes Siddon Tang at PingCAP (Twitter: @siddontang; @pingcap) 1 About Me Chief Engineer at PingCAP Leader of TiKV project My other open-source projects: go-mysql


  1. Building a Transactional Key-Value Store That Scales to 100+ Nodes Siddon Tang at PingCAP (Twitter: @siddontang; @pingcap) 1

  2. About Me ● Chief Engineer at PingCAP ● Leader of TiKV project ● My other open-source projects: ○ go-mysql ○ go-mysql-elasticsearch ○ LedisDB ○ raft-rs ○ etc.. 2

  3. Agenda ● Why did we build TiKV? ● How do we build TiKV? ● Going beyond TiKV 3

  4. Why? Is it worthwhile to build another Key-Value store? 4

  5. We want to build a distributed relational database to solve the scaling problem of MySQL!!! 5

  6. Inspired by Google F1 + Spanner Client MySQL Client F1 TiDB Spanner TiKV 6

  7. How? 7

  8. A High Building, A Low Foundation 8

  9. What we need to build... 1. A high-performance Key-Value engine to store data 2. A consensus model to ensure data consistency in different machines 3. A transaction model to meet ACID compliance across machines 4. A network framework for communication 5. A scheduler to manage the whole cluster 9

  10. Choose a Language! 10

  11. Hello Rust 11

  12. Rust...? 12

  13. Rust - Cons (2 years ago): ● Makes you think differently ● Long compile time ● Lack of libraries and tools ● Few Rust programmers ● Uncertain future Rust Learning Curve Time 13

  14. Rust - Pros: ● Blazing Fast ● Memory safety ● Thread safety ● No GC ● Fast FFI ● Vibrant package ecosystem 14

  15. Let’s start from the beginning! 15

  16. Key-Value engine 16

  17. Why RocksDB? ● High Write/Read Performance ● Stability ● Easy to be embedded in Rust ● Rich functionality ● Continuous development ● Active community 17

  18. RocksDB: The data is in one machine. We need fault tolerance. 18

  19. Consensus Algorithm 19

  20. Raft - Roles ● Leader ● Follower ● Candidate 20

  21. Raft - Election Start Follower Election Timeout, Receive higher Start new election. Find leader or term msg receive higher term msg Election, re-campaign Leader Candidate Receive majority vote 21

  22. Raft - Log Replicated State Machine Client State State State Machine Machine Machine Raft Raft Raft Module Module Module a 1 a 1 a 1 b 2 b 2 b 2 Log Log Log a <- 1 b <- 2 a <- 1 b <- 2 a <- 1 b <- 2 22

  23. Raft - Optimization ● Leader appends logs and sends msgs in parallel ● Prevote ● Pipeline ● Batch ● Learner ● Lease based Read ● Follower Read 23

  24. A Raft can’t manage a huge dataset. So we need Multi-Raft!!! 24

  25. Multi-Raft: Data sharding Hash Sharding Range Sharding (TiKV) Chunk 1 (-∞, a) Key Hash [a, b) Chunk 2 Dataset Dataset Chunk 3 (b, +∞) 25

  26. Multi-Raft in TiKV Range Sharding A - B Raft Group Region 1 Region 1 Region 1 Raft Group Region 2 Region 2 Region 2 B - C Raft Group Region 3 Region 3 Region 3 C - D Node 1 Node 2 Node 3 26

  27. Multi-Raft: Split and Merge Split Region A Region A Region B Region A Region A Region B Merge Node 1 Region A Node 2 Region A Region B 27

  28. Multi-Raft: Scalability How to Move Region A? Region A’ Region B’ Node 1 Node 2 28

  29. Multi-Raft: Scalability How to Move Region A? Region A Region A’ Region B’ Node 1 Node 2 Add Replica 29

  30. Multi-Raft: Scalability Transfer Leader How to Move Region A? Region A’ Region A Region B’ Node 1 Node 2 30

  31. Multi-Raft: Scalability How to Move Region A? Region A’ Region B’ Node 1 Node 2 Remove Replica 31

  32. How to ensure cross-region data consistency? 32

  33. Distributed Transaction Begin Region 1 Region 1 Region 1 Raft Group Set a = 1 Region 2 Region 2 Region 2 Raft Group Set b = 2 Commit 33

  34. Transaction in TiKV ● Optimized two phase commit, inspired by Google Percolator ● Multi-version concurrency control ● Optimistic Commit ● Snapshot Isolation ● Use Timestamp Oracle to allocate unique timestamp for transactions 34

  35. Percolator Optimization ● Use a latch on TiDB to support pessimistic commit ● Concurrent Prewrite ○ We are formally proving it with TLA+ 35

  36. How to communicate with each other? RPC Framework! 36

  37. Hello gRPC 37

  38. Why gRPC? ● Widely used ● Supported by many languages ● Works with Protocol Buffers and FlatBuffers ● Rich interface ● Benefits from HTTP/2 38

  39. TiKV Stack Client gRPC gRPC gRPC Txn KV API Txn KV API Txn KV API Transaction Transaction Transaction Raft Group Raft Raft Raft RocksDB RocksDB RocksDB TiKV Instance TiKV Instance TiKV Instance 39

  40. How to manage 100+ nodes? 40

  41. Scheduler in TiKV We are Gods!!! TiKV TiKV PD PD TiKV TiKV PD Placement Drivers TiKV TiKV TiKV TiKV 41

  42. Scheduler - How PD PD’ PD Add Replica Store Heatbeat Remove Replica Schedule Operator Region Heatbeat Transfer Leader ... TiKV TiKV TiKV 42

  43. Scheduler - Goal ● Make the load and data size balanced ● Avoid hotspot performance issue 43

  44. Scheduler - Region Count Balance Assume the Regions are about the same size R1 R3 R2 R4 R6 R5 R1 R3 R5 R2 R4 R6 44

  45. Scheduler - Region Count Balance Regions’ sizes are not the same R1 - 0 MB R3 - 0 MB R5 - 64 MB R2 - 0 MB R4 - 64 MB R6 - 96 MB 45

  46. Scheduler - Region Size balance Use size for calculation R1 - 0 MB R3 - 0 MB R5 - 64 MB R2 - 0 MB R4 - 64 MB R6 - 96 MB R1 - 0 MB R3 - 0 MB R2 - 0 MB R5 - 64 MB R4 - 64 MB R6 - 96 MB 46

  47. Scheduler - Region Size Balance Some regions are very hot for Read/Write Hot Cold R1 R3 R5 Normal R2 R4 R6 47

  48. Scheduler - Hot balance TiKV reports region Read/Write traffic to PD R1 R3 R5 R2 R4 R6 R1 R2 R5 R3 R4 R6 48

  49. Scheduler - More ● More balances… ○ Weight Balance - High-weight TiKV will save more data ○ Evict Leader Balance - Some TiKV node can’t have any Raft leader ● OpInfluence - Avoid over frequent balancing 49

  50. Geo-Replication 50

  51. Scheduler - Cross DC DC DC DC Rack Rack Rack Rack Rack Rack R1 R1 R2 R2 R1 R2 DC DC DC Rack Rack Rack Rack Rack Rack R1 R2 R1 R2 R1 R2 51

  52. Scheduler - three DCs in two cities DC - Seattle 1 DC - Seattle 2 DC - Santa Clara Rack Rack Rack Rack Rack Rack R1 R2 R1 R2 R1’ R2’ DC - Seattle 1 DC - Seattle 2 DC - Santa Clara Rack Rack Rack Rack Rack Rack R1’ R2 R1 R2’ R1 R2 52

  53. Going beyond TiKV 53

  54. TiDB HTAP Solution PD PD PD

  55. Cloud-Native KV ... ...

  56. Who’s Using TiKV Now? 56

  57. To sum up, TiKV is ... ● An open-source, unifying distributed storage layer that supports: ○ Strong consistency ○ ACID compliance ○ Horizontal scalability ○ Cloud-native architecture ● Building block to simplify building other systems ○ So far: TiDB (MySQL), TiSpark (SparkSQL), Toutiao.com (metadata service for their own S3), Ele.me (Redis Protocol Layer) ○ Sky is the limit! 57

  58. Thank you! Email: tl@pingcap.com Github: siddontang Twitter: @siddontang; @pingcap 58

Recommend


More recommend