perfdebug performance debugging of computation skew in
play

PerfDebug: Performance Debugging of Computation Skew in Dataflow - PowerPoint PPT Presentation

PerfDebug: Performance Debugging of Computation Skew in Dataflow Systems Jason Teoh, Muhammad Ali Gulzar, Harry Xu, Miryung Kim University of California, Los Angeles Motivating Example Server Logs Cron Day 1 20GB Web Server Anomaly


  1. PerfDebug: Performance Debugging of Computation Skew in Dataflow Systems Jason Teoh, Muhammad Ali Gulzar, Harry Xu, Miryung Kim University of California, Los Angeles

  2. Motivating Example Server Logs Cron Day 1 20GB Web Server Anomaly Detection 2

  3. Motivating Example Execution Time : 28 s Server Logs Cron Day 1 20GB Web Server Anomaly Detection 3

  4. Motivating Example Execution Time : 28 s Server Logs Cron Day 1 20GB Execution Time : 25 s Cron Day 2 20GB Web Server Anomaly Detection 4

  5. Motivating Example Execution Time : 28 s Server Logs Cron Day 1 20GB Execution Time : 25 s Cron Day 2 20GB Cron Day 3 Web Server 20GB Execution Time : 92 s Anomaly Detection 5

  6. Motivating Example Execution Time : 28 s Server Logs Cron Day 1 20GB Execution Time : 25 s Cron Day 2 20GB Cron Day 3 Web Server 20GB Execution Time : 92 s Anomaly Detection 6

  7. Motivating Example Execution Time : 28 s Server Logs Cron Day 1 20GB Execution Time : 25 s Cron Day 2 20GB Why does my job run slowly for day 3’s data? Cron Day 3 Web Server 20GB Execution Time : 92 s Anomaly Detection 7

  8. Data Skew in Distributed Processing Worker1 Worker2 Worker3 Uneven distribution of data across partitions, tasks, or workers can lead to performance delays. 8

  9. Computation Skew User-defined function commonDefs = { Term Term Latency “Hello World”: ..,, “Big Data”: ..,, Hello World Hello World 2 ms “Debugging”: ..., ... Big Data Big Data 1 ms } Debugging Debugging 3 ms if (commonDefs.contains(term)) { return commonDefs.get(term) PerfDebug PerfDebug 442 ms } else { r = new r = new RedisClient RedisClient(…) (…) return return r.get r.get(term) (term) } Uneven distribution of computation due to interactions between data and application code. 9

  10. Computation Skew Why is it challenging? • Requires insight on how application code interacts with data. • Occurs across multiple stages. • Affected applications are inherently expensive to run. • Isolating individual records that impact performance is difficult with existing tools. 10

  11. Performance Debugging of Computation Skew Input: Output: Spark program, Individual records input data responsible for computation skew PerfDebug Data Provenance Expensive Record Computation + Record-Level Identification Skew Detection Latency 14

  12. PerfDebug Approach Input: Output: Spark program, Individual records input data responsible for computation skew PerfDebug Data Provenance Expensive Record Computation + Record-Level Identification Skew Detection Latency 15

  13. Data Expensive Computation Provenance + Record Skew Record-Level Computation Skew Detection Identification Detection Latency • PerfDebug monitors task-level metrics such as latency, garbage collection, and serialization using SparkListener API. • If potential computation skew is found, rerun the user program in debugging mode to collect additional information. 17

  14. PerfDebug Approach Input: Output: Spark program, Individual records input data responsible for computation skew PerfDebug Data Provenance Expensive Record Computation + Record-Level Identification Skew Detection Latency 18

  15. Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Stage 1 reduceByKey lines map (map-side) Stage 2 reduceByKey map (reduce-side) Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 19

  16. Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3} (0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 20

  17. Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3} (0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 21

  18. Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3} (0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 22

  19. Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3}(0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 23

  20. Data Expensive Computation Provenance + Record Skew Record-Level Capture Data Provenance Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3}(0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … Titian [VLDB 2016] provides data provenance using provenance tables at the start/end of stages to track input-output record mappings. 24

  21. Data Expensive Computation Provenance + Record Skew Record-Level Measure UDF Latency Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 offset1 id1 {id1, id3} (0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … PerfDebug extends Titian by capturing summed UDF execution times. 25

  22. Data Expensive Computation Provenance + Record Skew Record-Level Measure UDF Latency Identification Detection Latency Input ID Output ID Input ID Output ID Stage 1 7 ms 3 ms offset1 id1 {id1, id3} (0, 100) offset2 id2 reduceByKey lines map (map-side) {id2} (0, 200) offset3 id3 … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … PerfDebug extends Titian by capturing summed UDF execution times. 26

  23. Data Expensive Computation Provenance + Record Skew Record-Level Measure UDF Latency Identification Detection Latency Input ID Output ID Input ID Input ID Output ID Output ID UDF Latency Stage 1 7 ms 3 ms offset1 id1 {id1, id3} (0, 100) {id1, id3} (0, 100) 7 + 3 = 10 ms offset2 id2 reduceByKey lines map (map-side) {id2} {id2} (0, 200) (0, 200) offset3 id3 … … … … … … Input ID Output ID Input ID Output ID Stage 2 (0, 100) 100 100 output1 reduceByKey (1, 100) 100 map 200 output2 (reduce-side) (0, 200) 200 … … … … PerfDebug extends Titian by capturing summed UDF execution times. 27

  24. Data Expensive Computation Provenance + Record Skew Record-Level Measure UDF Latency Identification Detection Latency Input ID Output ID Input ID Input ID Output ID UDF Latency Output ID Stage 1 offset1 id1 {id1, id3} (0, 100) {id1, id3} (0, 100) 10 ms offset2 id2 reduceByKey lines map (map-side) {id2} {id2} (0, 200) (0, 200) 20 ms offset3 id3 … … … … … … … Input ID Output ID Input ID Input ID Output ID Output ID UDF Latency Stage 2 (0, 100) 100 100 100 output1 output1 30 ms reduceByKey (1, 100) 100 map 200 200 output2 output2 40 ms (reduce-side) (0, 200) 200 … … … … … … … PerfDebug extends Titian by capturing summed UDF execution times. 28

Recommend


More recommend