IN SUPPORT OF WORKLOAD-AWARE STREAMING STATE MANAGEMENT Vasiliki Kalavri John Liagouris vkalavri@bu.edu liagos@bu.edu HotStorage 2020 14 July 2020
STREAMING DATAFLOWS Nexmark Q4: “Rolling average of winning bids” auctions source Worker 1 rolling join average sink bids Worker 2 source Logical Dataflow Physical Dataflow Nexmark Streaming Benchmark Suite: https://beam.apache.org/documentation/sdks/java/testing/nexmark/ 2
LARGER-THAN-MEMORY STATE MANAGEMENT Worker 1 put/get <k,v> <k,v> put/get Worker 2 Large operator state is backed by key-value stores 3
LARGER-THAN-MEMORY STATE MANAGEMENT Worker 1 put/get <k,v> <k,v> put/get Worker 2 Large operator state is LSM-based write-optimized backed by key-value stores store with efficient range scans 4
STATE REQUIREMENTS VARY ACROSS OPERATORS Nexmark Q4: “Rolling average of winning bids” auctions source rolling join Join: Write-heavy and can potentially average accumulate large state Average: Read-Modify-Write a single value sink bids source Dataflow operators may have different state access patterns and memory requirements 5
CURRENT PRACTICE: MONOLITHIC STATE MANAGEMENT Worker 1 One key-value store (RocksDB) per stateful operator instance <k,v> <k,v> All key-value stores in the <k,v> <k,v> dataflow are globally-configured Worker 2 6
FLAWS OF MONOLITHIC STATE MANAGEMENT Worker 1 - Oblivious store configuration <k,v> <k,v> - Unnecessary data marshaling - Unnecessary key-value store features <k,v> <k,v> Worker 2 7
UNNECESSARY KEY-VALUE STORE FEATURES - State partitioning All these operations are handled by modern stream processors outside the state store - State scoping - Concurrent access to state Stream processors guarantee single-thread access to state - State checkpointing 8
WORKLOAD-AWARE STREAMING STATE MANAGEMENT Worker 1 Multiple state stores of different store:<u64,auction> types and configurations store:u64 store:<u64,bid> according to the requirements of the stateful operators rmw_u64 put/get Streaming operators are instantiated once and are long-running: their access patterns and state sizes are largely known in advance Worker 2 9
A FLEXIBLE TESTBED FOR STREAMING STATE MANAGEMENT RocksDB : LSM-based - Implemented in Rust with efficient range scans - Based on Timely Dataflow stream processor - Supports two key-value stores - RocksDB - FASTER FASTER : Hybrid log with efficient lookups and in-place updates - Supports different window evaluation strategies Testbed: https://github.com/jliagouris/wassm Timely Dataflow: https://github.com/TimelyDataflow/timely-dataflow 10 FASTER: https://github.com/microsoft/FASTER
11 EXPERIMENTAL RESULTS
EVALUATION GOALS 1. Study the effect of the backend’s data layout on the evaluation of streaming windows 2. Study the effect of workload-aware configuration on queries with multiple stateful operators 12
EFFECT OF DATA LAYOUT ON WINDOW EVALUATION COUNT-30s-1s �� � ������ - Query 1: Count the number of records in a ��������������� �� �� 30s window that slides every 1s ������������� �� �� - Input rate: 10K records/s ���� �� �� - Single thread execution �� �� �� �� - Report end-to-end latency (ms) per record �� �� �� � �� � �� � �� � ������������ 13
EFFECT OF DATA LAYOUT ON WINDOW EVALUATION COUNT-30s-1s �� � ������ ��������������� �� �� p90 ������������� p99 �� �� Complementary CDF : Each ���� point (x,y) indicates that y% of p99.9 �� �� the latency measurements are … �� �� at least x ms �� �� �� �� �� � �� � �� � �� � ������������ Lower is better 14
EFFECT OF DATA LAYOUT ON WINDOW EVALUATION COUNT-30s-1s �� � ������ ��������������� �� �� ������������� RocksDB PUT/GET: On record , �� �� retrieve window contents, apply ���� �� �� new record, and put the updated �� �� contents back to the store �� �� �� �� �� � �� � �� � �� � ������������ Lower is better 15
EFFECT OF DATA LAYOUT ON WINDOW EVALUATION COUNT-30s-1s �� � ������ ��������������� �� �� ������������� RocksDB MERGE: On record , put �� �� record to the store using MERGE. ���� �� �� The record is applied to the window �� �� contents lazily on trigger �� �� �� �� �� � �� � �� � �� � ������������ Lower is better 16
EFFECT OF DATA LAYOUT ON WINDOW EVALUATION COUNT-30s-1s �� � ������ ��������������� �� �� ������������� �� �� FASTER performs better 100X in p99 ���� �� �� due to in-place updates �� �� �� �� �� �� �� � �� � �� � �� � ������������ Lower is better 17
EFFECT OF DATA LAYOUT ON WINDOW EVALUATION RANK-30s-30s �� � - Query 2: Rank records in a 30s tumbling ������ ��������������� window �� �� ������������� �� �� - Input rate: 1K records/s ���� �� �� - Single thread execution �� �� �� �� - Report end-to-end latency (ms) per record �� �� �� � �� � �� � �� � �� � �� � ������������ Lower is better 18
EFFECT OF DATA LAYOUT ON WINDOW EVALUATION RANK-30s-30s �� � ������ ��������������� �� �� ������������� �� �� RocksDB MERGE performs 100X 1000X ���� �� �� best due to lazy evaluation �� �� �� �� �� �� �� � �� � �� � �� � �� � �� � ������������ Lower is better 19
THERE IS NO CLEAR WINNER COUNT-30s-1s RANK-30s-30s �� � �� � ������ ������ ��������������� ��������������� �� �� �� �� ������������� ������������� �� �� �� �� 100X in p99 100X 1000X ���� ���� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� �� � �� � �� � �� � �� � �� � �� � �� � �� � �� � ������������ ������������ 20
MONOLITHIC VS WORKLOAD-WARE STATE MANAGEMENT - Experiments with six Nexmark * queries - Different stateful operators ( joins , window aggregations , custom aggregations ) - Simple workload-aware configuration of data types and available memory size * Nexmark Streaming Benchmark Suite: https://beam.apache.org/documentation/sdks/java/testing/nexmark/ 21
MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT - State store used: FASTER custom join and Q4 rolling aggregate �� � - Input rate: 10K records/s ���������� �������������� �� �� - SIngle thread execution �� �� - Monolithic memory configuration: 8GB �� �� �� �� - Workload-aware memory configuration: 6GB �� �� (bids), 1.5GB (auctions), 512MB (average) �� � �� � �� � ������������ - Report end-to-end latency (ms) per record 22
MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT - State store used: FASTER custom join and Q4 rolling aggregate �� � - Input rate: 10K records/s ���������� �������������� �� �� - SIngle thread execution �� �� - Monolithic memory configuration: 8GB �� �� �� �� - Workload-aware memory configuration: 6GB �� �� (bids), 1.5GB (auctions), 512MB (average) �� � �� � �� � ������������ - Report end-to-end latency (ms) per record 23
MONOLITHIC VS WORKLOAD-AWARE STATE MANAGEMENT - State store used: FASTER custom join and Q4 rolling aggregate �� � - Input rate: 10K records/s ���������� �������������� �� �� - SIngle thread execution �� �� - Monolithic memory configuration: 8GB �� �� �� �� - Workload-aware memory configuration: 6GB �� �� (bids), 1.5GB (auctions), 512MB (average) �� � �� � �� � ������������ - Report end-to-end latency (ms) per record 24
Recommend
More recommend