PerfDebug: Performance Debugging of Computation Skew in Dataflow Systems Jason Teoh, Muhammad Ali Gulzar, Harry Xu, Miryung Kim University of California, Los Angeles
Motivating Example Server Logs Cron Day 1 20GB Web Server Anomaly Detection 2
Motivating Example Execution Time : 28 s Server Logs Cron Day 1 20GB Web Server Anomaly Detection 3
Motivating Example Execution Time : 28 s Server Logs Cron Day 1 20GB Execution Time : 25 s Cron Day 2 20GB Web Server Anomaly Detection 4
Motivating Example Execution Time : 28 s Server Logs Cron Day 1 20GB Execution Time : 25 s Cron Day 2 20GB Cron Day 3 Web Server 20GB Execution Time : 92 s Anomaly Detection 5
Motivating Example Execution Time : 28 s Server Logs Cron Day 1 20GB Execution Time : 25 s Cron Day 2 20GB Cron Day 3 Web Server 20GB Execution Time : 92 s Anomaly Detection 6
Motivating Example Execution Time : 28 s Server Logs Cron Day 1 20GB Execution Time : 25 s Cron Day 2 20GB Why does my job run slowly for day 3’s data? Cron Day 3 Web Server 20GB Execution Time : 92 s Anomaly Detection 7
Data Skew in Distributed Processing Worker1 Worker2 Worker3 Uneven distribution of data across partitions, tasks, or workers can lead to performance delays. 8
Computation Skew User-defined function commonDefs = { Term Term Latency “Hello World”: ..,, “Big Data”: ..,, Hello World Hello World 2 ms “Debugging”: ..., ... Big Data Big Data 1 ms } Debugging Debugging 3 ms if (commonDefs.contains(term)) { return commonDefs.get(term) PerfDebug PerfDebug 442 ms } else { r = new r = new RedisClient RedisClient(…) (…) return return r.get r.get(term) (term) } Uneven distribution of computation due to interactions between data and application code. 9
Computation Skew Why is it challenging? • Requires insight on how application code interacts with data. • Occurs across multiple stages. • Affected applications are inherently expensive to run. • Isolating individual records that impact performance is difficult with existing tools. 10
Performance Debugging of Computation Skew Input: Output: Spark program, Individual records input data responsible for computation skew PerfDebug Data Provenance Expensive Record Computation + Record-Level Identification Skew Detection Latency 14
PerfDebug Approach Input: Output: Spark program, Individual records input data responsible for computation skew PerfDebug Data Provenance Expensive Record Computation + Record-Level Identification Skew Detection Latency 15
Data Expensive Computation Provenance + Record Skew Record-Level Computation Skew Detection Identification Detection Latency • PerfDebug monitors task-level metrics such as latency, garbage collection, and serialization using SparkListener API. • If potential computation skew is found, rerun the user program in debugging mode to collect additional information. 17
PerfDebug Approach Input: Output: Spark program, Individual records input data responsible for computation skew PerfDebug Data Provenance Expensive Record Computation + Record-Level Identification Skew Detection Latency 18
Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Stage 1 reduceByKey lines map (map-side) Stage 2 reduceByKey map (reduce-side) Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 19
Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3} (0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 20
Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3} (0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 21
Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3} (0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 22
Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3}(0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 23
Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3}(0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 24
Data Expensive Computation Provenance + Record Skew Record-Level Measure UDF Latency Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3} (0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … PerfDebug extends Titian by capturing summed UDF execution times. 25
Data Expensive Computation Provenance + Record Skew Record-Level Measure UDF Latency Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 7 ms 3 ms offset1 id1 {id1, id3} (0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … PerfDebug extends Titian by capturing summed UDF execution times. 26
Data Expensive Computation Provenance + Record Skew Record-Level Measure UDF Latency Identification Detection Latency Input ID Output ID Input ID Input ID Output ID Output ID UDF Latency Stage 1 7 ms 3 ms offset1 id1 {id1, id3} (0, 100) {id1, id3} (0, 100) 7 + 3 = 10 ms offset2 id2 reduceByKey lines map (map-side) {id2} {id2} (0, 200) (0, 200) offset3 id3 … … … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … PerfDebug extends Titian by capturing summed UDF execution times. 27
Data Expensive Computation Provenance + Record Skew Record-Level Measure UDF Latency Identification Detection Latency Input ID Output ID Input ID Input ID Output ID UDF Latency Output ID Stage 1 offset1 id1 {id1, id3} (0, 100) {id1, id3} (0, 100) 10 ms offset2 id2 reduceByKey lines map (map-side) {id2} {id2} (0, 200) (0, 200) 20 ms offset3 id3 … … … … … … … Input ID Output ID Input ID Input ID Output ID Output ID UDF Latency Stage 2 (0, 100) 100 100 100 output1 output1 30 ms reduceByKey (1, 100) 100 map 200 200 output2 output2 40 ms (reduce-side) (0, 200) 200 … … … … … … … PerfDebug extends Titian by capturing summed UDF execution times. 28
Recommend
More recommend