Building a Transactional Key-Value Store That Scales to 100+ Nodes Siddon Tang at PingCAP (Twitter: @siddontang; @pingcap) 1
About Me ● Chief Engineer at PingCAP ● Leader of TiKV project ● My other open-source projects: ○ go-mysql ○ go-mysql-elasticsearch ○ LedisDB ○ raft-rs ○ etc.. 2
Agenda ● Why did we build TiKV? ● How do we build TiKV? ● Going beyond TiKV 3
Why? Is it worthwhile to build another Key-Value store? 4
We want to build a distributed relational database to solve the scaling problem of MySQL!!! 5
Inspired by Google F1 + Spanner Client MySQL Client F1 TiDB Spanner TiKV 6
How? 7
A High Building, A Low Foundation 8
What we need to build... 1. A high-performance Key-Value engine to store data 2. A consensus model to ensure data consistency in different machines 3. A transaction model to meet ACID compliance across machines 4. A network framework for communication 5. A scheduler to manage the whole cluster 9
Choose a Language! 10
Hello Rust 11
Rust...? 12
Rust - Cons (2 years ago): ● Makes you think differently ● Long compile time ● Lack of libraries and tools ● Few Rust programmers ● Uncertain future Rust Learning Curve Time 13
Rust - Pros: ● Blazing Fast ● Memory safety ● Thread safety ● No GC ● Fast FFI ● Vibrant package ecosystem 14
Let’s start from the beginning! 15
Key-Value engine 16
Why RocksDB? ● High Write/Read Performance ● Stability ● Easy to be embedded in Rust ● Rich functionality ● Continuous development ● Active community 17
RocksDB: The data is in one machine. We need fault tolerance. 18
Consensus Algorithm 19
Raft - Roles ● Leader ● Follower ● Candidate 20
Raft - Election Start Follower Election Timeout, Receive higher Start new election. Find leader or term msg receive higher term msg Election, re-campaign Leader Candidate Receive majority vote 21
Raft - Log Replicated State Machine Client State State State Machine Machine Machine Raft Raft Raft Module Module Module a 1 a 1 a 1 b 2 b 2 b 2 Log Log Log a <- 1 b <- 2 a <- 1 b <- 2 a <- 1 b <- 2 22
Raft - Optimization ● Leader appends logs and sends msgs in parallel ● Prevote ● Pipeline ● Batch ● Learner ● Lease based Read ● Follower Read 23
A Raft can’t manage a huge dataset. So we need Multi-Raft!!! 24
Multi-Raft: Data sharding Hash Sharding Range Sharding (TiKV) Chunk 1 (-∞, a) Key Hash [a, b) Chunk 2 Dataset Dataset Chunk 3 (b, +∞) 25
Multi-Raft in TiKV Range Sharding A - B Raft Group Region 1 Region 1 Region 1 Raft Group Region 2 Region 2 Region 2 B - C Raft Group Region 3 Region 3 Region 3 C - D Node 1 Node 2 Node 3 26
Multi-Raft: Split and Merge Split Region A Region A Region B Region A Region A Region B Merge Node 1 Region A Node 2 Region A Region B 27
Multi-Raft: Scalability How to Move Region A? Region A’ Region B’ Node 1 Node 2 28
Multi-Raft: Scalability How to Move Region A? Region A Region A’ Region B’ Node 1 Node 2 Add Replica 29
Multi-Raft: Scalability Transfer Leader How to Move Region A? Region A’ Region A Region B’ Node 1 Node 2 30
Multi-Raft: Scalability How to Move Region A? Region A’ Region B’ Node 1 Node 2 Remove Replica 31
How to ensure cross-region data consistency? 32
Distributed Transaction Begin Region 1 Region 1 Region 1 Raft Group Set a = 1 Region 2 Region 2 Region 2 Raft Group Set b = 2 Commit 33
Transaction in TiKV ● Optimized two phase commit, inspired by Google Percolator ● Multi-version concurrency control ● Optimistic Commit ● Snapshot Isolation ● Use Timestamp Oracle to allocate unique timestamp for transactions 34
Percolator Optimization ● Use a latch on TiDB to support pessimistic commit ● Concurrent Prewrite ○ We are formally proving it with TLA+ 35
How to communicate with each other? RPC Framework! 36
Hello gRPC 37
Why gRPC? ● Widely used ● Supported by many languages ● Works with Protocol Buffers and FlatBuffers ● Rich interface ● Benefits from HTTP/2 38
TiKV Stack Client gRPC gRPC gRPC Txn KV API Txn KV API Txn KV API Transaction Transaction Transaction Raft Group Raft Raft Raft RocksDB RocksDB RocksDB TiKV Instance TiKV Instance TiKV Instance 39
How to manage 100+ nodes? 40
Scheduler in TiKV We are Gods!!! TiKV TiKV PD PD TiKV TiKV PD Placement Drivers TiKV TiKV TiKV TiKV 41
Scheduler - How PD PD’ PD Add Replica Store Heatbeat Remove Replica Schedule Operator Region Heatbeat Transfer Leader ... TiKV TiKV TiKV 42
Scheduler - Goal ● Make the load and data size balanced ● Avoid hotspot performance issue 43
Scheduler - Region Count Balance Assume the Regions are about the same size R1 R3 R2 R4 R6 R5 R1 R3 R5 R2 R4 R6 44
Scheduler - Region Count Balance Regions’ sizes are not the same R1 - 0 MB R3 - 0 MB R5 - 64 MB R2 - 0 MB R4 - 64 MB R6 - 96 MB 45
Scheduler - Region Size balance Use size for calculation R1 - 0 MB R3 - 0 MB R5 - 64 MB R2 - 0 MB R4 - 64 MB R6 - 96 MB R1 - 0 MB R3 - 0 MB R2 - 0 MB R5 - 64 MB R4 - 64 MB R6 - 96 MB 46
Scheduler - Region Size Balance Some regions are very hot for Read/Write Hot Cold R1 R3 R5 Normal R2 R4 R6 47
Scheduler - Hot balance TiKV reports region Read/Write traffic to PD R1 R3 R5 R2 R4 R6 R1 R2 R5 R3 R4 R6 48
Scheduler - More ● More balances… ○ Weight Balance - High-weight TiKV will save more data ○ Evict Leader Balance - Some TiKV node can’t have any Raft leader ● OpInfluence - Avoid over frequent balancing 49
Geo-Replication 50
Scheduler - Cross DC DC DC DC Rack Rack Rack Rack Rack Rack R1 R1 R2 R2 R1 R2 DC DC DC Rack Rack Rack Rack Rack Rack R1 R2 R1 R2 R1 R2 51
Scheduler - three DCs in two cities DC - Seattle 1 DC - Seattle 2 DC - Santa Clara Rack Rack Rack Rack Rack Rack R1 R2 R1 R2 R1’ R2’ DC - Seattle 1 DC - Seattle 2 DC - Santa Clara Rack Rack Rack Rack Rack Rack R1’ R2 R1 R2’ R1 R2 52
Going beyond TiKV 53
TiDB HTAP Solution PD PD PD
Cloud-Native KV ... ...
Who’s Using TiKV Now? 56
To sum up, TiKV is ... ● An open-source, unifying distributed storage layer that supports: ○ Strong consistency ○ ACID compliance ○ Horizontal scalability ○ Cloud-native architecture ● Building block to simplify building other systems ○ So far: TiDB (MySQL), TiSpark (SparkSQL), Toutiao.com (metadata service for their own S3), Ele.me (Redis Protocol Layer) ○ Sky is the limit! 57
Thank you! Email: tl@pingcap.com Github: siddontang Twitter: @siddontang; @pingcap 58
Recommend
More recommend