Cuttlefish A Lightweight Primitive for Online Tuning by Tomer - - PowerPoint PPT Presentation

cuttlefish
SMART_READER_LITE
LIVE PREVIEW

Cuttlefish A Lightweight Primitive for Online Tuning by Tomer - - PowerPoint PPT Presentation

Cuttlefish A Lightweight Primitive for Online Tuning by Tomer Kaftan (UW), Magdalena Balazinska (UW), Alvin Cheung (UW), Johannes Gehrke (Microsoft) 1 Logical Operators have multiple physical Operators The system should automatically


slide-1
SLIDE 1

A Lightweight Primitive for Online Tuning

Cuttlefish

by Tomer Kaftan (UW), Magdalena Balazinska (UW), Alvin Cheung (UW), Johannes Gehrke (Microsoft)

1

slide-2
SLIDE 2

Logical Operators have multiple physical Operators… The system should automatically choose!

slide-3
SLIDE 3

Logical Operators have multiple physical Operators… The system should automatically choose!

slide-4
SLIDE 4

(Some) Prior Work on Query Optimization

3

slide-5
SLIDE 5

(Some) Prior Work on Query Optimization

  • Static Query Optimizers
  • cardinality & selectivity estimation , heuristics, cost

models

3

slide-6
SLIDE 6

(Some) Prior Work on Query Optimization

  • Static Query Optimizers
  • cardinality & selectivity estimation , heuristics, cost

models

  • Adaptive Query Optimization
  • Query re-optimization (update cardinality &

selectivity estimates)

  • Adaptive operators (scans, aggregates, etc.)
  • Eddies & adaptive tuple routing (operator

reordering)

3

slide-7
SLIDE 7

These work great, BUT…

4

slide-8
SLIDE 8

These work great, BUT…

  • Designing good query optimizers takes time!

4

slide-9
SLIDE 9

These work great, BUT…

  • Designing good query optimizers takes time!
  • Requires deep knowledge of the operators and significant

development effort

4

slide-10
SLIDE 10

These work great, BUT…

  • Designing good query optimizers takes time!
  • Requires deep knowledge of the operators and significant

development effort

  • Spark SQL took 2 years to go from heuristics-based
  • ptimization to cost-based optimization! [1]

4

[1] http://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2- 2.html

slide-11
SLIDE 11

These work great, BUT…

  • Designing good query optimizers takes time!
  • Requires deep knowledge of the operators and significant

development effort

  • Spark SQL took 2 years to go from heuristics-based
  • ptimization to cost-based optimization! [1]
  • Existing adaptive approaches just push the development
  • verhead to physical execution

4

[1] http://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2- 2.html

slide-12
SLIDE 12

These work great, BUT…

  • Designing good query optimizers takes time!
  • Requires deep knowledge of the operators and significant

development effort

  • Spark SQL took 2 years to go from heuristics-based
  • ptimization to cost-based optimization! [1]
  • Existing adaptive approaches just push the development
  • verhead to physical execution
  • Modern data processing applications involve diverse,

sophisticated operators, not just relational operators!

4

[1] http://databricks.com/blog/2017/08/31/cost-based-optimizer-in-apache-spark-2- 2.html

slide-13
SLIDE 13

Motivating Workload

5

slide-14
SLIDE 14

Motivating Workload

“A Cuttlefish pretending to be a rock”

5

*Image Sourced from https://www.flickr.com/photos/silkebaron/32001215104

slide-15
SLIDE 15

Motivating Workload

“A Cuttlefish pretending to be a rock”

5

Generate Training Data from:

etc.

*Image Sourced from https://www.flickr.com/photos/silkebaron/32001215104

slide-16
SLIDE 16

Motivating Workload

6

CNN

HTML Data

Train a caption- generating model

Output Model Conv RNN

Repeat

Regex Join

Images

Filter

Generate Training Labels

...

Conv

*caption-generating model portion of the logical plan inspired by: Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015

slide-17
SLIDE 17

Motivating Workload

6

CNN

HTML Data

Train a caption- generating model

Output Model Conv RNN

Repeat

Regex Join

Images

Filter

Generate Training Labels

...

Conv

*caption-generating model portion of the logical plan inspired by: Xu et al. Show, Attend and Tell: Neural Image Caption Generation with Visual Attention. ICML 2015

Diverse, sophisticated operators, with multiple physical alternatives!

slide-18
SLIDE 18

Example Operator: Convolution

7

slide-19
SLIDE 19

Example Operator: Convolution

7

Tested 3 convolution algorithms on 8000 Flickr images

slide-20
SLIDE 20

Can we optimize without a full-fledged optimizer?

8

slide-21
SLIDE 21

Prior Work: Tuning Black-box Operators

9

slide-22
SLIDE 22

Prior Work: Tuning Black-box Operators

  • Offline Autotuning
  • Searches through arbitrary physical plan spaces
  • Requires representative workloads
  • Offline training time

9

slide-23
SLIDE 23

Prior Work: Tuning Black-box Operators

  • Offline Autotuning
  • Searches through arbitrary physical plan spaces
  • Requires representative workloads
  • Offline training time
  • Micro Adaptivity in Vectorwise
  • Reinforcement learning chooses physical flavors
  • f black-box vectorized operators
  • Limited to vectorized operators, does not explore

multi-core settings

9

slide-24
SLIDE 24

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to pick physical operators during execution

Cuttlefish: A Lightweight Primitive for Online Tuning

10

CNN

HTML Data

Train a caption- generating model

Output Model Conv RNN

Repeat

Regex Join

Images

Filter

Generate Training Labels

...

Conv

slide-25
SLIDE 25

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to pick physical operators during execution

Cuttlefish: A Lightweight Primitive for Online Tuning

10

CNN

HTML Data

Train a caption- generating model

Output Model Conv RNN

Repeat

Regex Join

Images

Filter

Generate Training Labels

...

Conv

slide-26
SLIDE 26 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to pick physical operators during execution

11

Cuttlefish: A Lightweight Primitive for Online Tuning

slide-27
SLIDE 27 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to pick physical operators during execution

Developer maps tuning rounds to the execution model of each operator:

11

Cuttlefish: A Lightweight Primitive for Online Tuning

slide-28
SLIDE 28 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to pick physical operators during execution

Developer maps tuning rounds to the execution model of each operator:

  • Regex: One round per HTML Doc

11

Cuttlefish: A Lightweight Primitive for Online Tuning

slide-29
SLIDE 29 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to pick physical operators during execution

Developer maps tuning rounds to the execution model of each operator:

  • Regex: One round per HTML Doc
  • Convolve: One round per image

11

Cuttlefish: A Lightweight Primitive for Online Tuning

slide-30
SLIDE 30 Tuner Lifecycle Choose Execute Observe Nest. Loop Mat. Mult Sort FFT Lib 2 Lib 3 Lib 4 Lib 1 HTML Data

Join

Output Model Tuner Filter

Regex

Images RNN

Repeat CNN

Nest. Loop Mat. Mult FFT ... Tuner Hash Tuner Tuner

Conv Conv

Workload developer (or the query optimizer) inserts calls to Cuttlefish’s API to pick physical operators during execution

Developer maps tuning rounds to the execution model of each operator:

  • Regex: One round per HTML Doc
  • Convolve: One round per image
  • Parallel Distributed Join: One round per partition

11

Cuttlefish: A Lightweight Primitive for Online Tuning

slide-31
SLIDE 31

Cuttlefish

12

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-32
SLIDE 32

The Cuttlefish Primitive

13

slide-33
SLIDE 33
  • 1. Construct a tuner (from a set of choices)

The Cuttlefish Primitive

13

slide-34
SLIDE 34
  • 1. Construct a tuner (from a set of choices)
  • 2. Tuner.choose (pick one of the choices)

The Cuttlefish Primitive

13

slide-35
SLIDE 35
  • 1. Construct a tuner (from a set of choices)
  • 2. Tuner.choose (pick one of the choices)
  • 3. Tuner.observe (observe a reward for a choice)

The Cuttlefish Primitive

13

slide-36
SLIDE 36
  • 1. Construct a tuner (from a set of choices)
  • 2. Tuner.choose (pick one of the choices)
  • 3. Tuner.observe (observe a reward for a choice)

Cuttlefish tuners maximize the total reward after multiple choose-observe tuning rounds

The Cuttlefish Primitive

13

slide-37
SLIDE 37

Tuning Convolution with Cuttlefish

14

convolve, token = tuner.choose() tuner.observe(token, reward)

slide-38
SLIDE 38

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

14

convolve, token = tuner.choose() tuner.observe(token, reward)

slide-39
SLIDE 39

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

14

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-40
SLIDE 40

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

14

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-41
SLIDE 41

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

14

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-42
SLIDE 42

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

14

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-43
SLIDE 43

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

14

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-44
SLIDE 44

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

14

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-45
SLIDE 45

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

14

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-46
SLIDE 46

Cuttlefish

15

I. Problem & Motivation

  • II. The Cuttlefish API

III.Bandit-based Online Tuning

  • IV. Distributed Tuning Approach
  • V. Contextual Tuning
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-47
SLIDE 47

Approach: Tuning

16

slide-48
SLIDE 48

Multi-armed Bandit Problem

Approach: Tuning

16

slide-49
SLIDE 49
  • K possible choices (called arms)

Multi-armed Bandit Problem

Approach: Tuning

16

slide-50
SLIDE 50
  • K possible choices (called arms)
  • Arms have unknown reward distributions

Multi-armed Bandit Problem

Approach: Tuning

16

slide-51
SLIDE 51
  • K possible choices (called arms)
  • Arms have unknown reward distributions
  • At each round: select an Arm and observe a reward

Multi-armed Bandit Problem

Approach: Tuning

16

slide-52
SLIDE 52
  • K possible choices (called arms)
  • Arms have unknown reward distributions
  • At each round: select an Arm and observe a reward

Multi-armed Bandit Problem

Goal: Maximize Cumulative Reward

(by balancing exploration & exploitation)

Approach: Tuning

16

slide-53
SLIDE 53

Thompson Sampling

17

slide-54
SLIDE 54

Thompson Sampling

17

Reward Arm 1 Arm 2 Arm 3 Arm 4 Belief distributions about expected reward

slide-55
SLIDE 55

Thompson Sampling

18

Reward Arm 1 Arm 2 Arm 3 Arm 4

slide-56
SLIDE 56

Thompson Sampling

19

Reward Arm 1 Arm 2 Arm 3 Arm 4

slide-57
SLIDE 57

Thompson Sampling

19

Reward Arm 1 Arm 2 Arm 3 Arm 4

slide-58
SLIDE 58

Thompson Sampling

20

Reward Arm 1 Arm 2 Arm 3 Arm 4

slide-59
SLIDE 59

Thompson Sampling

21

Reward Arm 1 Arm 2 Arm 3 Arm 4 Better arms chosen more often

slide-60
SLIDE 60

Thompson Sampling

22

slide-61
SLIDE 61

Thompson Sampling

  • Gaussian runtimes with initially unknown means and variances

22

slide-62
SLIDE 62

Thompson Sampling

  • Gaussian runtimes with initially unknown means and variances
  • Belief distributions form t-distributions
  • Depend only on sample mean, variance, count

22

slide-63
SLIDE 63

Thompson Sampling

  • Gaussian runtimes with initially unknown means and variances
  • Belief distributions form t-distributions
  • Depend only on sample mean, variance, count
  • No meta-parameters, yet works well for diverse operators

22

slide-64
SLIDE 64

Thompson Sampling

  • Gaussian runtimes with initially unknown means and variances
  • Belief distributions form t-distributions
  • Depend only on sample mean, variance, count
  • No meta-parameters, yet works well for diverse operators
  • Constant memory overhead, 0.03 ms per tuning round

22

slide-65
SLIDE 65

Convolution Evaluation

23

slide-66
SLIDE 66

Convolution Evaluation

  • Prototype in Apache Spark

23

slide-67
SLIDE 67

Convolution Evaluation

  • Prototype in Apache Spark
  • Tune between three convolution algorithms (Nested Loops, FFT, or

Matrix Multiply)

  • Reward: -1*elapsedTime (maximizes throughput)

23

slide-68
SLIDE 68

Convolution Evaluation

  • Prototype in Apache Spark
  • Tune between three convolution algorithms (Nested Loops, FFT, or

Matrix Multiply)

  • Reward: -1*elapsedTime (maximizes throughput)
  • Convolve 8000 Flickr images with sets of filters (~32gb)
  • Vary number & size of filters

23

slide-69
SLIDE 69

Convolution Evaluation

  • Prototype in Apache Spark
  • Tune between three convolution algorithms (Nested Loops, FFT, or

Matrix Multiply)

  • Reward: -1*elapsedTime (maximizes throughput)
  • Convolve 8000 Flickr images with sets of filters (~32gb)
  • Vary number & size of filters
  • Run on an 8-node (AWS EC2 4-core r3.xlarge) cluster.
  • 32 total cores, ~252 images per core

23

slide-70
SLIDE 70

Convolution Evaluation

  • Prototype in Apache Spark
  • Tune between three convolution algorithms (Nested Loops, FFT, or

Matrix Multiply)

  • Reward: -1*elapsedTime (maximizes throughput)
  • Convolve 8000 Flickr images with sets of filters (~32gb)
  • Vary number & size of filters
  • Run on an 8-node (AWS EC2 4-core r3.xlarge) cluster.
  • 32 total cores, ~252 images per core
  • *Very* compute intensive
  • (Some configs up to 45 min on a single node)

23

slide-71
SLIDE 71

Convolution Results

24

Relative throughput normalized against the highest-throughput algorithm

slide-72
SLIDE 72

Convolution Results

24

Relative throughput normalized against the highest-throughput algorithm

slide-73
SLIDE 73

Convolution Results

24

Relative throughput normalized against the highest-throughput algorithm

slide-74
SLIDE 74

Cuttlefish

25

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-75
SLIDE 75

Challenges in Distributed Tuning

26

slide-76
SLIDE 76

Challenges in Distributed Tuning

  • 1. Choosing and observing occur throughout a cluster
  • To maximize learning, need to communicate

26

slide-77
SLIDE 77

Challenges in Distributed Tuning

  • 1. Choosing and observing occur throughout a cluster
  • To maximize learning, need to communicate
  • 2. Synchronization & communication overheads

26

slide-78
SLIDE 78

Challenges in Distributed Tuning

  • 1. Choosing and observing occur throughout a cluster
  • To maximize learning, need to communicate
  • 2. Synchronization & communication overheads
  • 3. Feedback delay
  • How many times is `choose’ called before an

earlier reward is observed?

  • Fortunately, theoretically sound to have delays

26

slide-79
SLIDE 79

Distributed Tuning Approach

27

slide-80
SLIDE 80

Distributed Tuning Approach

27

Machine 1 Machine 2 Machine 3

Choose/Observe

Centralized Tuner

slide-81
SLIDE 81

Distributed Tuning Approach

27

Machine 1 Machine 2 Machine 3

Choose/Observe

Centralized Tuner

Machine 1 Machine 2 Machine 3

Push Local / Pull Global

Global Model Store

Independent Tuners, Centralized Store

slide-82
SLIDE 82

Distributed Tuning Approach

27

Machine 1 Machine 2 Machine 3

Choose/Observe

Centralized Tuner

Machine 1 Machine 2 Machine 3

Push Local / Pull Global

Global Model Store

Independent Tuners, Centralized Store

slide-83
SLIDE 83

Distributed Tuning Approach

27

Machine 1 Machine 2 Machine 3

Choose/Observe

Centralized Tuner

Machine 1 Machine 2 Machine 3

Push Local / Pull Global

Global Model Store

Independent Tuners, Centralized Store Peer-to-Peer is also a possibility, but requires more communication

slide-84
SLIDE 84

Distributed Tuning Approach

28

Local State Thread 1

Worker 1 Model Store

Non-local State Local State

Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1

Worker 2

Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*

slide-85
SLIDE 85

Distributed Tuning Approach

28

Local State Thread 1

Worker 1 Model Store

Non-local State Local State

Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1

Worker 2

Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*

slide-86
SLIDE 86

Distributed Tuning Approach

28

Local State Thread 1

Worker 1 Model Store

Non-local State Local State

Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1

Worker 2

Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*

  • When choosing: aggregate local & non-local state
slide-87
SLIDE 87

Distributed Tuning Approach

28

Local State Thread 1

Worker 1 Model Store

Non-local State Local State

Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1

Worker 2

Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*

  • When choosing: aggregate local & non-local state
  • When observing: update the local state
slide-88
SLIDE 88

Distributed Tuning Approach

28

Local State Thread 1

Worker 1 Model Store

Non-local State Local State

Non-local State Thread 2 Thread 3 Worker 2: Local State Local State Thread 1

Worker 2

Non-local State Thread 2 Thread 3 Worker 1: Local State *On Master or a Parameter Server*

  • When choosing: aggregate local & non-local state
  • When observing: update the local state
  • Model store aggregates non-local state
slide-89
SLIDE 89

Results with Distributed Approach

29

Relative throughput normalized against the highest-throughput algorithm

slide-90
SLIDE 90

Results with Distributed Approach

30

Throughput normalized against an ideal oracle that always picks the fastest algorithm

slide-91
SLIDE 91

Results with Distributed Approach

30

Throughput normalized against an ideal oracle that always picks the fastest algorithm

slide-92
SLIDE 92

Cuttlefish

31

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning (by learning cost models)
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-93
SLIDE 93

Contextual Tuning

32

slide-94
SLIDE 94

Contextual Tuning

  • Best physical operator for each round may depend
  • n current context
  • e.g. convolution performance depends on the

image & filter dimensions

32

slide-95
SLIDE 95

Contextual Tuning

  • Best physical operator for each round may depend
  • n current context
  • e.g. convolution performance depends on the

image & filter dimensions

  • Users may know important context features
  • e.g. from the asymptotic algorithmic complexity

32

slide-96
SLIDE 96

Contextual Tuning

  • Best physical operator for each round may depend
  • n current context
  • e.g. convolution performance depends on the

image & filter dimensions

  • Users may know important context features
  • e.g. from the asymptotic algorithmic complexity
  • Users can specify context in Tuner.choose

32

slide-97
SLIDE 97

Contextual Tuning Algorithm

33

slide-98
SLIDE 98

Contextual Tuning Algorithm

  • Linear contextual Thompson sampling learns a linear

model that maps features to rewards

33

slide-99
SLIDE 99

Contextual Tuning Algorithm

  • Linear contextual Thompson sampling learns a linear

model that maps features to rewards

  • Feature Normalization & Regularization
  • Increased robustness towards feature choices

33

slide-100
SLIDE 100

Contextual Tuning Algorithm

  • Linear contextual Thompson sampling learns a linear

model that maps features to rewards

  • Feature Normalization & Regularization
  • Increased robustness towards feature choices
  • Effectively learns a cost model

33

slide-101
SLIDE 101

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … for image, filters in convolutions: start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

34

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose() tuner.observe(token, reward)

slide-102
SLIDE 102

Tuning Convolution with Cuttlefish

def loopConvolve(image, filters): … def fftConvolve(image, filters): … def mmConvolve(image, filters): … def getDimensions(image, filters): … for image, filters in convolutions: context = getDimensions(image, filters) start = now() result = convolve(image, filters) elapsedTime = now() - start reward = computeReward(elapsedTime)

  • utput result

35

tuner = Tuner([loopConvolve, fftConvolve, mmConvolve]) convolve, token = tuner.choose(context) tuner.observe(token, reward) context

slide-103
SLIDE 103

Contextual Convolution Results

36

Throughput normalized against an ideal oracle that always picks the fastest algorithm

slide-104
SLIDE 104

Cuttlefish

37

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning

VI.Handling Nonstationary Settings VII.Other Operators VIII.Conclusion

slide-105
SLIDE 105

Nonstationary Settings

38

slide-106
SLIDE 106

Nonstationary Settings

  • Runtimes may drift over time, or differ across nodes
  • heterogenous cluster, changing resource availabilities,

data properties varying throughout the workload, etc.

  • E.g. web crawl data and images may be stored sorted

by website. This could correlate with performance

38

slide-107
SLIDE 107

Nonstationary Settings

  • Runtimes may drift over time, or differ across nodes
  • heterogenous cluster, changing resource availabilities,

data properties varying throughout the workload, etc.

  • E.g. web crawl data and images may be stored sorted

by website. This could correlate with performance

  • We might not be capturing sufficient context!

38

slide-108
SLIDE 108

Nonstationary Settings

  • Runtimes may drift over time, or differ across nodes
  • heterogenous cluster, changing resource availabilities,

data properties varying throughout the workload, etc.

  • E.g. web crawl data and images may be stored sorted

by website. This could correlate with performance

  • We might not be capturing sufficient context!
  • Standard multi-armed bandit techniques fail

38

slide-109
SLIDE 109

Nonstationary Settings

39

slide-110
SLIDE 110

Nonstationary Settings

  • Prior work: dynamic bandit approaches
  • Sliding windows, discounting older observations, reset on

change detection, etc.

  • Good for dealing with changes over time

39

slide-111
SLIDE 111

Nonstationary Settings

  • Prior work: dynamic bandit approaches
  • Sliding windows, discounting older observations, reset on

change detection, etc.

  • Good for dealing with changes over time
  • Prior work: bandit clustering approaches
  • identify & share learning among agents solving similar

bandit problems

  • Good for dealing with differences between cores

39

slide-112
SLIDE 112

Nonstationary Settings

  • Prior work: dynamic bandit approaches
  • Sliding windows, discounting older observations, reset on

change detection, etc.

  • Good for dealing with changes over time
  • Prior work: bandit clustering approaches
  • identify & share learning among agents solving similar

bandit problems

  • Good for dealing with differences between cores
  • Need dynamic bandit clustering where agents’ underlying

problems may change over time!

39

slide-113
SLIDE 113

Possible Solution

40

Observations Agents (core or machine)

slide-114
SLIDE 114

Possible Solution

41

Observations Agents (core or machine)

slide-115
SLIDE 115

Possible Solution

41

Observations Agents (core or machine)

Use all epochs that pass a statistical similarity test

slide-116
SLIDE 116

Possible Solution

41

Observations Agents (core or machine)

Use all epochs that pass a statistical similarity test

slide-117
SLIDE 117

To Lower Overheads

42

Observations Agents (core or machine)

Store only one ‘aggregated old state’ per epoch

slide-118
SLIDE 118

To Lower Overheads

42

Observations Agents (core or machine)

Store only one ‘aggregated old state’ per epoch At epoch end: If similar to old, merge into ‘old state’ . Otherwise, replace ‘old state’

slide-119
SLIDE 119

To Lower Overheads

42

Observations Agents (core or machine)

Store only one ‘aggregated old state’ per epoch At epoch end: If similar to old, merge into ‘old state’ . Otherwise, replace ‘old state’ Identify (& merge) similar non-local states only at communication rounds, in the centralized model store

slide-120
SLIDE 120

Nonstationary Results

43

Throughput normalized against an ideal oracle that always picks the fastest algorithm

slide-121
SLIDE 121

Cuttlefish

44

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-122
SLIDE 122

Regex Operator

45

slide-123
SLIDE 123

Regex Operator

45

  • Tune between four regular expression searching libraries
  • Built-in Java Regex and 3 third-party libraries
slide-124
SLIDE 124

Regex Operator

45

  • Tune between four regular expression searching libraries
  • Built-in Java Regex and 3 third-party libraries
  • Search through 256k Common Crawl docs (~30gb uncompressed)
  • one tuning round per doc
slide-125
SLIDE 125

Regex Operator

45

  • Tune between four regular expression searching libraries
  • Built-in Java Regex and 3 third-party libraries
  • Search through 256k Common Crawl docs (~30gb uncompressed)
  • one tuning round per doc
  • Test 8 Regexes sourced from regex-sharing website RegExr
  • Match hyperlinks, trigrams, valid emails, color codes, etc.
slide-126
SLIDE 126

Regex Operator

45

  • Tune between four regular expression searching libraries
  • Built-in Java Regex and 3 third-party libraries
  • Search through 256k Common Crawl docs (~30gb uncompressed)
  • one tuning round per doc
  • Test 8 Regexes sourced from regex-sharing website RegExr
  • Match hyperlinks, trigrams, valid emails, color codes, etc.
  • Multiple of orders of magnitude variation in performance
  • Email validation regex w/ built-in java utilities takes 33μs to process

the fastest document, but over 1000s for the slowest document

slide-127
SLIDE 127

Regex Operator

45

  • Tune between four regular expression searching libraries
  • Built-in Java Regex and 3 third-party libraries
  • Search through 256k Common Crawl docs (~30gb uncompressed)
  • one tuning round per doc
  • Test 8 Regexes sourced from regex-sharing website RegExr
  • Match hyperlinks, trigrams, valid emails, color codes, etc.
  • Multiple of orders of magnitude variation in performance
  • Email validation regex w/ built-in java utilities takes 33μs to process

the fastest document, but over 1000s for the slowest document

  • 8-node (AWS EC2 4-core r3.xlarge) cluster
slide-128
SLIDE 128

Regex Results

46

Note: Y-axis is Log-scale

slide-129
SLIDE 129

Distributed Parallel Join Operator

47

slide-130
SLIDE 130

Distributed Parallel Join Operator

47

  • Hash-partition relations according to join attributes
slide-131
SLIDE 131

Distributed Parallel Join Operator

47

  • Hash-partition relations according to join attributes
  • On each partition, pick a local hash join or a local sort-merge join
slide-132
SLIDE 132

Distributed Parallel Join Operator

47

  • Hash-partition relations according to join attributes
  • On each partition, pick a local hash join or a local sort-merge join
  • Rewards capture total join time
  • measure from when joins begin until result iterators are fully consumed
slide-133
SLIDE 133

Distributed Parallel Join Operator

47

  • Hash-partition relations according to join attributes
  • On each partition, pick a local hash join or a local sort-merge join
  • Rewards capture total join time
  • measure from when joins begin until result iterators are fully consumed
  • Set as Spark SQL 2.2’s join for all equijoins too large to broadcast
  • No heuristics and cost models in the query optimizer, falls back on

explicit configurations (defaults to global sort-merge join)

slide-134
SLIDE 134

Distributed Parallel Join Operator

47

  • Hash-partition relations according to join attributes
  • On each partition, pick a local hash join or a local sort-merge join
  • Rewards capture total join time
  • measure from when joins begin until result iterators are fully consumed
  • Set as Spark SQL 2.2’s join for all equijoins too large to broadcast
  • No heuristics and cost models in the query optimizer, falls back on

explicit configurations (defaults to global sort-merge join)

  • Test on TPC-DS benchmark (scale factor 200)
slide-135
SLIDE 135

Distributed Parallel Join Operator

47

  • Hash-partition relations according to join attributes
  • On each partition, pick a local hash join or a local sort-merge join
  • Rewards capture total join time
  • measure from when joins begin until result iterators are fully consumed
  • Set as Spark SQL 2.2’s join for all equijoins too large to broadcast
  • No heuristics and cost models in the query optimizer, falls back on

explicit configurations (defaults to global sort-merge join)

  • Test on TPC-DS benchmark (scale factor 200)
  • Configure queries to use 512 shuffle / join partitions
slide-136
SLIDE 136

Join Results (Query Throughput)

48

slide-137
SLIDE 137

Join Results (Query Throughput)

48

But, requires exploration & provides no ‘special ordering’ benefits

slide-138
SLIDE 138

Cuttlefish join usually faster (Join throughput graphs even more dramatic)

Join Results (Query Throughput)

48

But, requires exploration & provides no ‘special ordering’ benefits

slide-139
SLIDE 139

Cuttlefish

49

I. Problem & Motivation

  • II. The Cuttlefish API
  • III. Bandit-based Online Tuning
  • IV. Distributed Tuning Approach
  • V. Contextual Tuning
  • VI. Handling Nonstationary Settings

VII.Other Operators VIII.Conclusion

slide-140
SLIDE 140

Cuttlefish

50

  • A simple, flexible API for online tuning
  • Thompson-sampling based tuning algorithms
  • Supports contextual tuning (learns cost models)
  • Distributed learning between workers
  • Adapts to nonstationary workloads
  • Prototyped in Apache Spark & successfully tunes

convolution, regex, and join operators