cuttlefish
play

Cuttlefish A Lightweight Primitive for Online Tuning by Tomer - PowerPoint PPT Presentation

Cuttlefish A Lightweight Primitive for Online Tuning by Tomer Kaftan (UW), Magdalena Balazinska (UW), Alvin Cheung (UW), Johannes Gehrke (Microsoft) 1 Logical Operators have multiple physical Operators The system should automatically


  1. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 14

  2. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 14

  3. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 14

  4. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 14

  5. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 14

  6. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 14

  7. Tuning Convolution with Cuttlefish def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) for image, filters in convolutions: convolve, token = tuner.choose() start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime) tuner.observe(token, reward) output result 14

  8. Cuttlefish I. Problem & Motivation II. The Cuttlefish API III.Bandit-based Online Tuning IV. Distributed Tuning Approach V. Contextual Tuning VI. Handling Nonstationary Settings VII.Other Operators VIII.Conclusion 15

  9. Approach: Tuning 16

  10. Approach: Tuning Multi-armed Bandit Problem 16

  11. Approach: Tuning Multi-armed Bandit Problem • K possible choices (called arms) 16

  12. Approach: Tuning Multi-armed Bandit Problem • K possible choices (called arms) • Arms have unknown reward distributions 16

  13. Approach: Tuning Multi-armed Bandit Problem • K possible choices (called arms) • Arms have unknown reward distributions • At each round: select an Arm and observe a reward 16

  14. Approach: Tuning Multi-armed Bandit Problem • K possible choices (called arms) • Arms have unknown reward distributions • At each round: select an Arm and observe a reward Goal: Maximize Cumulative Reward (by balancing exploration & exploitation) 16

  15. Thompson Sampling 17

  16. Thompson Sampling Belief distributions about expected reward Reward Arm 1 Arm 2 Arm 3 Arm 4 17

  17. Thompson Sampling Reward Arm 1 Arm 2 Arm 3 Arm 4 18

  18. Thompson Sampling Reward Arm 1 Arm 2 Arm 3 Arm 4 19

  19. Thompson Sampling Reward Arm 1 Arm 2 Arm 3 Arm 4 19

  20. Thompson Sampling Reward Arm 1 Arm 2 Arm 3 Arm 4 20

  21. Thompson Sampling Better arms chosen more often Reward Arm 1 Arm 2 Arm 3 Arm 4 21

  22. Thompson Sampling 22

  23. Thompson Sampling • Gaussian runtimes with initially unknown means and variances 22

  24. Thompson Sampling • Gaussian runtimes with initially unknown means and variances • Belief distributions form t-distributions • Depend only on sample mean, variance, count 22

  25. Thompson Sampling • Gaussian runtimes with initially unknown means and variances • Belief distributions form t-distributions • Depend only on sample mean, variance, count • No meta-parameters, yet works well for diverse operators 22

  26. Thompson Sampling • Gaussian runtimes with initially unknown means and variances • Belief distributions form t-distributions • Depend only on sample mean, variance, count • No meta-parameters, yet works well for diverse operators • Constant memory overhead, 0.03 ms per tuning round 22

  27. Convolution Evaluation 23

  28. Convolution Evaluation • Prototype in Apache Spark 23

  29. Convolution Evaluation • Prototype in Apache Spark • Tune between three convolution algorithms (Nested Loops, FFT, or Matrix Multiply) • Reward: -1*elapsedTime (maximizes throughput) 23

  30. Convolution Evaluation • Prototype in Apache Spark • Tune between three convolution algorithms (Nested Loops, FFT, or Matrix Multiply) • Reward: -1*elapsedTime (maximizes throughput) • Convolve 8000 Flickr images with sets of filters (~32gb) • Vary number & size of filters 23

  31. Convolution Evaluation • Prototype in Apache Spark • Tune between three convolution algorithms (Nested Loops, FFT, or Matrix Multiply) • Reward: -1*elapsedTime (maximizes throughput) • Convolve 8000 Flickr images with sets of filters (~32gb) • Vary number & size of filters • Run on an 8-node (AWS EC2 4-core r3.xlarge) cluster. • 32 total cores, ~252 images per core 23

  32. Convolution Evaluation • Prototype in Apache Spark • Tune between three convolution algorithms (Nested Loops, FFT, or Matrix Multiply) • Reward: -1*elapsedTime (maximizes throughput) • Convolve 8000 Flickr images with sets of filters (~32gb) • Vary number & size of filters • Run on an 8-node (AWS EC2 4-core r3.xlarge) cluster. • 32 total cores, ~252 images per core • *Very* compute intensive • (Some configs up to 45 min on a single node) 23

  33. Convolution Results Relative throughput normalized against the highest-throughput algorithm 24

  34. Convolution Results Relative throughput normalized against the highest-throughput algorithm 24

  35. Convolution Results Relative throughput normalized against the highest-throughput algorithm 24

  36. Cuttlefish I. Problem & Motivation II. The Cuttlefish API III. Bandit-based Online Tuning IV. Distributed Tuning Approach V. Contextual Tuning VI. Handling Nonstationary Settings VII.Other Operators VIII.Conclusion 25

  37. Challenges in Distributed Tuning 26

  38. Challenges in Distributed Tuning 1. Choosing and observing occur throughout a cluster • To maximize learning, need to communicate 26

  39. Challenges in Distributed Tuning 1. Choosing and observing occur throughout a cluster • To maximize learning, need to communicate 2. Synchronization & communication overheads 26

  40. Challenges in Distributed Tuning 1. Choosing and observing occur throughout a cluster • To maximize learning, need to communicate 2. Synchronization & communication overheads 3. Feedback delay • How many times is `choose’ called before an earlier reward is observed? • Fortunately, theoretically sound to have delays 26

  41. Distributed Tuning Approach 27

  42. Distributed Tuning Approach Centralized Tuner Choose/Observe Machine 1 Machine 2 Machine 3 27

  43. Distributed Tuning Approach Independent Tuners, Centralized Tuner Centralized Store Choose/Observe Push Local / Pull Global Machine 1 Machine 1 Global Model Machine 2 Machine 2 Store Machine 3 Machine 3 27

  44. Distributed Tuning Approach Independent Tuners, Centralized Tuner Centralized Store Choose/Observe Push Local / Pull Global Machine 1 Machine 1 Global Model Machine 2 Machine 2 Store Machine 3 Machine 3 27

  45. Distributed Tuning Approach Independent Tuners, Centralized Tuner Centralized Store Choose/Observe Push Local / Pull Global Machine 1 Machine 1 Global Model Machine 2 Machine 2 Store Machine 3 Machine 3 Peer-to-Peer is also a possibility, but requires more communication 27

  46. Distributed Tuning Approach Worker 1 Local State Model Store Local State Non-local State Worker 1: Thread 1 Thread 2 Thread 3 Local State Non-local State Worker 2: Worker 2 Local State Local State Non-local State *On Master or a Parameter Server* Thread 1 Thread 2 Thread 3 … 28

  47. Distributed Tuning Approach Worker 1 Local State Model Store Local State Non-local State Worker 1: Thread 1 Thread 2 Thread 3 Local State Non-local State Worker 2: Worker 2 Local State Local State Non-local State *On Master or a Parameter Server* Thread 1 Thread 2 Thread 3 … 28

  48. Distributed Tuning Approach Worker 1 Local State Model Store Local State Non-local State Worker 1: Thread 1 Thread 2 Thread 3 Local State Non-local State Worker 2: Worker 2 Local State Local State Non-local State *On Master or a Parameter Server* Thread 1 Thread 2 Thread 3 • When choosing: aggregate local & non-local state … 28

  49. Distributed Tuning Approach Worker 1 Local State Model Store Local State Non-local State Worker 1: Thread 1 Thread 2 Thread 3 Local State Non-local State Worker 2: Worker 2 Local State Local State Non-local State *On Master or a Parameter Server* Thread 1 Thread 2 Thread 3 • When choosing: aggregate local & non-local state • When observing: update the local state … 28

  50. Distributed Tuning Approach Worker 1 Local State Model Store Local State Non-local State Worker 1: Thread 1 Thread 2 Thread 3 Local State Non-local State Worker 2: Worker 2 Local State Local State Non-local State *On Master or a Parameter Server* Thread 1 Thread 2 Thread 3 • When choosing: aggregate local & non-local state • When observing: update the local state • Model store aggregates non-local state … 28

  51. Results with Distributed Approach Relative throughput normalized against the highest-throughput algorithm 29

  52. Results with Distributed Approach Throughput normalized against an ideal oracle that always picks the fastest algorithm 30

  53. Results with Distributed Approach Throughput normalized against an ideal oracle that always picks the fastest algorithm 30

  54. Cuttlefish I. Problem & Motivation II. The Cuttlefish API III. Bandit-based Online Tuning IV. Distributed Tuning Approach V. Contextual Tuning (by learning cost models) VI. Handling Nonstationary Settings VII.Other Operators VIII.Conclusion 31

  55. Contextual Tuning 32

  56. Contextual Tuning • Best physical operator for each round may depend on current context • e.g. convolution performance depends on the image & filter dimensions 32

  57. Contextual Tuning • Best physical operator for each round may depend on current context • e.g. convolution performance depends on the image & filter dimensions • Users may know important context features • e.g. from the asymptotic algorithmic complexity 32

  58. Contextual Tuning • Best physical operator for each round may depend on current context • e.g. convolution performance depends on the image & filter dimensions • Users may know important context features • e.g. from the asymptotic algorithmic complexity • Users can specify context in Tuner.choose 32

  59. Contextual Tuning Algorithm 33

  60. Contextual Tuning Algorithm • Linear contextual Thompson sampling learns a linear model that maps features to rewards 33

  61. Contextual Tuning Algorithm • Linear contextual Thompson sampling learns a linear model that maps features to rewards • Feature Normalization & Regularization • Increased robustness towards feature choices 33

  62. Contextual Tuning Algorithm • Linear contextual Thompson sampling learns a linear model that maps features to rewards • Feature Normalization & Regularization • Increased robustness towards feature choices • Effectively learns a cost model 33

More recommend