Database Replication in Tashkent CSEP 545 Transaction Processing Sameh Elnikety
Replication for Performance Expensive Limited scalability 2
DB Replication is Challenging • Single database system – Large, persistent state – Transactions – Complex software • Replication challenges – Maintain consistency – Middleware replication 3
Background Standalone Replica 1 DBMS 4
Background Replica 1 Replica 2 Load Balancer Replica 3 5
Read Tx Read tx does Replica 1 not change DB state Replica 2 T Load Balancer Replica 3 6
Update Tx 1/2 Update tx Replica 1 changes DB state Replica 2 T Load Balancer ws ws Replica 3 7
Update Tx 1/2 Update tx Apply (or commit) Replica 1 changes T everywhere ws DB state Replica 2 T Load Balancer ws ws ws Replica 3 Example: ws T1 : { set x = 1 } 8
Update Tx 2/2 Update tx Replica 1 changes DB state Replica 2 T T Ordering Load ws ws Balancer Replica 3 ws ws 9
Update Tx 2/2 Update tx Commit updates Replica 1 changes ws in order DB state ws Replica 2 Ordering Load ws T Balancer ws ws ws Replica 3 Replica 3 Example: ws T T1 : { set x = 1 } ws T2 : { set x = 7 } 10
Sub-linear Scalability Wall Replica 1 ws ws Replica 2 Ordering Load ws T Balancer ws ws ws Replica 3 Replica 3 Replica 4 ws T ws 11
This Talk • General scaling techniques – Address fundamental bottlenecks – Synergistic, implemented in middleware – Evaluated experimentally 12
Super-linear Scalability 120 37 X 100 25 X 80 TPS 60 12 X 40 7 X 20 1 X 0 Single Base United MALB UF
Big Picture: Let’s Oversimplify Standalone reading R DBMS update U logging 14
Big Picture: Let’s Oversimplify Standalone reading R DBMS update U logging R reading Replica 1/N N.R (traditional) U update N.U (N-1).ws logging 15
Big Picture: Let’s Oversimplify Standalone reading R DBMS update U logging R reading Replica 1/N N.R (traditional) U update N.U (N-1).ws logging R* reading Replica 1/N N.R (optimized) update U* N.U logging (N-1).ws* 16
Big Picture: Let’s Oversimplify Standalone reading R DBMS update U logging R reading Replica 1/N N.R (traditional) U update N.U (N-1).ws logging R* reading Replica 1/N MALB N.R (optimized) Update Filtering update U* N.U Uniting O & D logging (N-1).ws* 17
Key Points 1. Commit updates in order – Perform serial synchronous disk writes – Unite ordering and durability 2. Load balancing – Optimize for equal load: memory contention – MALB: optimize for in-memory execution 3. Update propagation – Propagate updates everywhere – Update filtering: propagate to where needed 18
Roadmap Tx A Replica 1 Replica 2 Load Ordering Balancer Replica 3 Commit Load balancing updates in Update propagation order 19
Key Idea • Traditionally: – Commit ordering and durability are separated • Key idea: – Unite commit ordering and durability 20
All Replicas Must Agree • All replicas agree on Replica 1 Tx A – which update tx commit durability – their commit order • Total order Replica 2 – Determined by middleware Tx B – Followed by each replica durability Replica 3 durability 21
Order Outside DBMS Replica 1 Tx A Tx A durability Tx B Replica 2 Ordering Tx B durability Replica 3 durability 22
Order Outside DBMS Replica 1 Tx A Tx A durability A B A B Tx B Replica 2 Ordering Tx B durability A B A B A B Replica 3 A B durability A B 23
Enforce External Commit Order Ordering Replica 3 DBMS A B Proxy Task A SQL interface Task B Tx A Tx B durability 24
Enforce External Commit Order Ordering Replica 3 DBMS A B Proxy Task A SQL interface Task B Tx A Tx B durability B A 25
Enforce External Commit Order Ordering Replica 3 DBMS A B Proxy Task A SQL interface Task B Tx A Tx B durability B A Cannot commit A & B concurrently! 26
Enforce Order = Serial Commit Ordering Replica 3 DBMS A B Proxy Task A SQL interface Task B Tx A Tx B durability A 27
Enforce Order = Serial Commit Ordering Replica 3 DBMS A B Proxy Task A SQL interface Task B Tx A Tx B durability A B 28
Commit Serialization is Slow Ordering A B C Commit order A B C Proxy Ack B Ack A Ack C Commit A Commit B Commit C DBMS CPU CPU CPU Durability Durability Durability durability A B A B C A 29
Commit Serialization is Slow Ordering A B C Commit order A B C Proxy Ack B Ack A Ack C Commit A Commit B Commit C Problem: Durability & ordering separated → serial disk writes DBMS CPU CPU CPU Durability Durability Durability durability A B A B C A 30
Unite D. & O. in Middleware Ordering A B C Commit order Durability durability A B C A B C Proxy Ack B Ack A Ack C Commit A Commit B Commit C DBMS CPU CPU CPU durability OFF 31
Unite D. & O. in Middleware Ordering A B C Commit order Durability durability A B C A B C Proxy Ack B Ack A Ack C Solution: Commit A Commit B Commit C Move durability to MW DBMS Durability & ordering in middleware → group commit CPU CPU CPU durability OFF 32
Implementation: Uniting D & O in MW • Middleware logs tx effects – Durability of update tx • Guaranteed in middleware • Turn durability off at database • Middleware performs durability & ordering – United → group commit → fast • Database commits update tx serially – Commit = quick main memory operation 33
Uniting Improves Throughput • Metric 40 12 X – Throughput 35 • Workload 30 – TPC-W Ordering 25 TPS 7 X (50% updates) 20 • System 15 – Linux cluster 10 – PostgreSQL 1 X 5 – 16 replicas 0 Single Base United – Serializable exec. TPC-W
Roadmap Tx A Replica 1 Replica 2 Load Ordering Balancer Replica 3 Commit Load balancing updates in Update propagation order 35
Key Idea Replica 1 Mem Equal load on replicas Disk Load Balancer Replica 2 Mem Disk 36
Key Idea Replica 1 Mem Equal load on replicas Disk Load Balancer Replica 2 Mem MALB: (Memory-Aware Load Balancing) Disk Optimize for in-memory execution 37
How Does MALB Work? Database 1 2 3 Workload A → 1 2 B → 2 3 Mem Memory 38
Read Data From Disk Replica 1 A → 1 2 Mem B → 2 3 Disk 1 1 2 3 3 A, B, A, B Least Loaded Replica 2 Mem Disk 1 2 3 39
Read Data From Disk Replica 1 A → 1 2 Mem B → 2 3 1 1 2 3 3 Slow Disk 1 1 2 3 3 A, B, A, B Least Loaded Replica 2 Mem Slow 1 1 2 3 3 Disk 1 2 3 40
Data Fits in Memory Replica 1 A → 1 2 Mem B → 2 3 Disk 1 2 3 A, B, A, B MALB Replica 2 Mem Disk 1 2 3 41
Data Fits in Memory Replica 1 A → 1 2 Mem B → 2 3 1 2 Fast Disk 1 2 3 A, B, A, B MALB Replica 2 Mem Fast 2 3 Memory info? Disk Many tx and replicas? 1 2 3 42
Estimate Tx Memory Needs • Exploit tx execution plan – Which tables & indices are accessed – Their access pattern • Linear scan, direct access • Metadata from database – Sizes of tables and indices 43
Grouping Transactions • Objective – Construct tx groups that fit together in memory • Bin packing – Item: tx memory needs – Bin: memory of replica – Heuristic: Best Fit Decreasing • Allocate replicas to tx groups – Adjust for group loads 44
MALB in Action A B C D E F MALB 45
MALB in Action Memory needs for A, B, C, D, E, F A B C D E F MALB 46
MALB in Action Group A Memory needs for A, B, C, D, E, F A B C Group B C D E F MALB Group D E F 47
MALB in Action Replica Group A Memory needs for A A, B, C, D, E, F Disk A B C Replica Group B C D E F MALB B C Disk Group D E F Replica D E F Disk 48
MALB Summary • Objective – Optimize for in-memory execution • Method – Estimate tx memory needs – Construct tx groups – Allocate replicas to tx groups 49
Experimental Evaluation • Implementation – No change in consistency – Still middleware • Compare – United : efficient baseline system – MALB : exploits working set information • Same environment – Linux cluster running PostgreSQL – Workload: TPC-W Ordering (50% update txs) 50
MALB Doubles Throughput 120 TPC-W 100 Ordering 105% 25 X 80 16 replicas TPS 60 12 X 40 7 X 20 1 X 0 Single Base United MALB UF 51
MALB Doubles Throughput 1.0 120 Read I/O, normalized 100 0.8 105% 25 X 80 TPS 0.6 60 0.4 12 X 40 7 X 0.2 20 1 X 0 0.0 Single Base United MALB UF United MALB 52
Big Gains with MALB DB Size 12% 75% 182% Big 45% 105% 48% Small 29% 0% 4% Mem Size Small Big
Big Gains with MALB DB Size Run 12% 75% 182% from Big disk 45% 105% 48% Small 29% 0% 4% Run from memory Mem Size Small Big
Recommend
More recommend