cuttlefish
play

Cuttlefish Lightweight Primitives for Online Tuning by Tomer Kaftan - PowerPoint PPT Presentation

Cuttlefish Lightweight Primitives for Online Tuning by Tomer Kaftan (UW), Magdalena Balazinska (UW), Alvin Cheung (UW), Johannes Gehrke (Microsoft) 1 Data processing workloads today are complicated. 2 Motivating Workload 3 Motivating


  1. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 13

  2. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 13

  3. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 13

  4. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 13

  5. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 13

  6. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 13

  7. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 13

  8. Cuttlefish I. Problem & Motivation II. The Cuttlefish API III.Bandit-based Online Tuning IV. Distributed Tuning Approach V. Contextual Tuning VI. Handling Nonstationary Settings VII.Other Operators VIII.Conclusion 14

  9. Approach: Tuning 15

  10. Approach: Tuning Multi-armed Bandit Problem 15

  11. Approach: Tuning Multi-armed Bandit Problem • K possible choices (called arms) 15

  12. Approach: Tuning Multi-armed Bandit Problem • K possible choices (called arms) • Arms have unknown reward distributions 15

  13. Approach: Tuning Multi-armed Bandit Problem • K possible choices (called arms) • Arms have unknown reward distributions • At each round: select an Arm and observe a reward 15

  14. Approach: Tuning Multi-armed Bandit Problem • K possible choices (called arms) • Arms have unknown reward distributions • At each round: select an Arm and observe a reward Goal: Maximize Cumulative Reward (by balancing exploration & exploitation) 15

  15. Thompson Sampling 16

  16. Thompson Sampling Belief distributions about expected reward Reward Arm 1 Arm 2 Arm 3 Arm 4 16

  17. Thompson Sampling Reward Arm 1 Arm 2 Arm 3 Arm 4 17

  18. Thompson Sampling Reward Arm 1 Arm 2 Arm 3 Arm 4 18

  19. Thompson Sampling Reward Arm 1 Arm 2 Arm 3 Arm 4 18

  20. Thompson Sampling Reward Arm 1 Arm 2 Arm 3 Arm 4 19

  21. Thompson Sampling Better arms chosen more often Reward Arm 1 Arm 2 Arm 3 Arm 4 20

  22. Thompson Sampling 21

  23. Thompson Sampling • Gaussian runtimes with initially unknown means and variances 21

  24. Thompson Sampling • Gaussian runtimes with initially unknown means and variances • Belief distributions form t-distributions • Depend only on sample mean, variance, count 21

  25. Thompson Sampling • Gaussian runtimes with initially unknown means and variances • Belief distributions form t-distributions • Depend only on sample mean, variance, count • No meta-parameters, yet works well for diverse operators 21

  26. Thompson Sampling • Gaussian runtimes with initially unknown means and variances • Belief distributions form t-distributions • Depend only on sample mean, variance, count • No meta-parameters, yet works well for diverse operators • Constant memory overhead, 0.03 ms per tuning round 21

  27. Convolution Evaluation 22

  28. Convolution Evaluation • Prototype in Apache Spark 22

  29. Convolution Evaluation • Prototype in Apache Spark • Tune between three convolution algorithms (Nested Loops, FFT, or Matrix Multiply) • Reward: -1*elapsedTime (maximizes throughput) 22

  30. Convolution Evaluation • Prototype in Apache Spark • Tune between three convolution algorithms (Nested Loops, FFT, or Matrix Multiply) • Reward: -1*elapsedTime (maximizes throughput) • Convolve 8000 Flickr images with sets of filters (~32gb) • Vary number & size of filters 22

  31. Convolution Evaluation • Prototype in Apache Spark • Tune between three convolution algorithms (Nested Loops, FFT, or Matrix Multiply) • Reward: -1*elapsedTime (maximizes throughput) • Convolve 8000 Flickr images with sets of filters (~32gb) • Vary number & size of filters • Compute intensive • (Some configs up to 45 min on a single node) 22

  32. Convolution Evaluation • Prototype in Apache Spark • Tune between three convolution algorithms (Nested Loops, FFT, or Matrix Multiply) • Reward: -1*elapsedTime (maximizes throughput) • Convolve 8000 Flickr images with sets of filters (~32gb) • Vary number & size of filters • Compute intensive • (Some configs up to 45 min on a single node) • Run on an 8-node (AWS EC2 4-core r3.xlarge) cluster. • 32 total cores, ~252 images per core 22

  33. Convolution Results Relative throughput normalized against the highest-throughput algorithm 23

  34. Convolution Results Relative throughput normalized against the highest-throughput algorithm 23

  35. Convolution Results Relative throughput normalized against the highest-throughput algorithm 23

  36. Cuttlefish I. Problem & Motivation II. The Cuttlefish API III. Bandit-based Online Tuning IV. Distributed Tuning Approach V. Contextual Tuning VI. Handling Nonstationary Settings VII.Other Operators VIII.Conclusion 24

  37. Challenges in Distributed Tuning 25

  38. Challenges in Distributed Tuning 1. Choosing and observing occur throughout a cluster • To maximize learning, need to communicate 25

  39. Challenges in Distributed Tuning 1. Choosing and observing occur throughout a cluster • To maximize learning, need to communicate 2. Synchronization & communication overheads 25

  40. Challenges in Distributed Tuning 1. Choosing and observing occur throughout a cluster • To maximize learning, need to communicate 2. Synchronization & communication overheads 3. Feedback delay • How many times is `choose’ called before an earlier reward is observed? • Fortunately, theoretically sound to have delays 25

  41. Distributed Tuning Approach 26

  42. Distributed Tuning Approach Centralized Tuner Choose/Observe Machine 1 Machine 2 Machine 3 26

  43. Distributed Tuning Approach Independent Tuners, Centralized Tuner Centralized Store Choose/Observe Push Local / Pull Global Machine 1 Machine 1 Global Model Machine 2 Machine 2 Store Machine 3 Machine 3 26

  44. Distributed Tuning Approach Independent Tuners, Centralized Tuner Centralized Store Choose/Observe Push Local / Pull Global Machine 1 Machine 1 Global Model Machine 2 Machine 2 Store Machine 3 Machine 3 26

  45. Distributed Tuning Approach Independent Tuners, Centralized Tuner Centralized Store Choose/Observe Push Local / Pull Global Machine 1 Machine 1 Global Model Machine 2 Machine 2 Store Machine 3 Machine 3 Peer-to-Peer is also a possibility, but requires more communication 26

  46. Distributed Tuning Approach Worker 1 Local State Model Store Local State Non-local State Worker 1: Thread 1 Thread 2 Thread 3 Local State Non-local State Worker 2: Worker 2 Local State Local State Non-local State *On Master or a Parameter Server* Thread 1 Thread 2 Thread 3 … 27

  47. Distributed Tuning Approach Worker 1 Local State Model Store Local State Non-local State Worker 1: Thread 1 Thread 2 Thread 3 Local State Non-local State Worker 2: Worker 2 Local State Local State Non-local State *On Master or a Parameter Server* Thread 1 Thread 2 Thread 3 … 27

  48. Distributed Tuning Approach Worker 1 Local State Model Store Local State Non-local State Worker 1: Thread 1 Thread 2 Thread 3 Local State Non-local State Worker 2: Worker 2 Local State Local State Non-local State *On Master or a Parameter Server* Thread 1 Thread 2 Thread 3 • When choosing: aggregate local & non-local state … 27

  49. Distributed Tuning Approach Worker 1 Local State Model Store Local State Non-local State Worker 1: Thread 1 Thread 2 Thread 3 Local State Non-local State Worker 2: Worker 2 Local State Local State Non-local State *On Master or a Parameter Server* Thread 1 Thread 2 Thread 3 • When choosing: aggregate local & non-local state • When observing: update the local state … 27

  50. Distributed Tuning Approach Worker 1 Local State Model Store Local State Non-local State Worker 1: Thread 1 Thread 2 Thread 3 Local State Non-local State Worker 2: Worker 2 Local State Local State Non-local State *On Master or a Parameter Server* Thread 1 Thread 2 Thread 3 • When choosing: aggregate local & non-local state • When observing: update the local state • Model store aggregates non-local state … 27

  51. Results with Distributed Approach Relative throughput normalized against the highest-throughput algorithm 28

  52. Results with Distributed Approach Throughput normalized against an ideal oracle that always picks the fastest option at each round 29

  53. Results with Distributed Approach Throughput normalized against an ideal oracle that always picks the fastest option at each round 29

  54. Cuttlefish I. Problem & Motivation II. The Cuttlefish API III. Bandit-based Online Tuning IV. Distributed Tuning Approach V. Contextual Tuning (by learning cost models) VI. Handling Nonstationary Settings VII.Other Operators VIII.Conclusion 30

  55. Contextual Tuning 31

  56. Contextual Tuning • Best physical operator for each round may depend on current (easy to compute) context • e.g. convolution performance depends on the image & filter dimensions 31

  57. Contextual Tuning • Best physical operator for each round may depend on current (easy to compute) context • e.g. convolution performance depends on the image & filter dimensions • Users may know important context features • e.g. from the asymptotic algorithmic complexity 31

  58. Contextual Tuning • Best physical operator for each round may depend on current (easy to compute) context • e.g. convolution performance depends on the image & filter dimensions • Users may know important context features • e.g. from the asymptotic algorithmic complexity • Users can specify context in Tuner.choose 31

  59. Contextual Tuning Algorithm 32

  60. Contextual Tuning Algorithm • Linear contextual Thompson sampling learns a linear model that maps features to rewards 32

  61. Contextual Tuning Algorithm • Linear contextual Thompson sampling learns a linear model that maps features to rewards • Feature Normalization & Regularization • Increased robustness towards feature choices 32

  62. Contextual Tuning Algorithm • Linear contextual Thompson sampling learns a linear model that maps features to rewards • Feature Normalization & Regularization • Increased robustness towards feature choices • Effectively learns a cost model 32

  63. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 33

  64. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … def getDimensions(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: context = getDimensions(image, filters) context convolve, token = tuner.choose(context) start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 34

  65. Contextual Convolution Results Throughput normalized against an ideal oracle that always picks the fastest algorithm 35

  66. Cuttlefish I. Problem & Motivation II. The Cuttlefish API III. Bandit-based Online Tuning IV. Distributed Tuning Approach V. Contextual Tuning VI.Handling Nonstationary Settings VII.Other Operators VIII.Conclusion 36

  67. Nonstationary Settings 37

More recommend