Refined bounds for algorithm configuration: The knife-edge of dual - - PowerPoint PPT Presentation
Refined bounds for algorithm configuration: The knife-edge of dual - - PowerPoint PPT Presentation
Refined bounds for algorithm configuration: The knife-edge of dual class approximability Nina Balcan, Tuomas Sandholm, Ellen Vitercik Algorithms typically come with many tunable parameters Significant impact on runtime, solution quality,
Algorithms typically come with many tunable parameters Significant impact on runtime, solution quality, … Hand-tuning is time-consuming, tedious, and error-prone
Automated algorithm configuration
Goal: Automate algorithm configuration via machine learning Algorithmically find good parameter settings using a set of “typical” inputs from application at hand
Training set
Automated configuration procedure
- 1. Fix parameterized algorithm (e.g., CPLEX)
- 2. Receive set 𝒯 of “typical” inputs from unknown distribution
- 3. Return parameter setting with good avg performance over 𝒯
Problem instance 1 Problem instance 2 Problem instance 3 Problem instance 4 Runtime, solution quality, memory usage, etc.
Automated configuration procedure
Problem instance 1 Problem instance 2 Problem instance 3 Problem instance 4 Problem instance
Seen Unseen? Key question (focus of talk): Will those parameters have good expected performance?
Overview of main result
Key question (focus of talk): Will those parameters have good expected performance? “Yes” when algorithmic performance as function of parameters can be approximated by a simple function
Simple approximating function 𝒉∗(𝒔) Algorithmic performance 𝒈∗(𝒔) Parameter 𝑠
Key question (focus of talk): Will those parameters have good expected performance?
Overview of main result
Observe this structure, e.g., in integer programming algorithm configuration
Algorithmic performance 𝒈∗(𝒔) Parameter 𝑠 Simple approximating function 𝒉∗(𝒔)
Overview of main result: a dichotomy
If approximation holds under the 𝑀!-norm: We provide strong guarantees
sup
"
𝑔∗ 𝑠 − ∗(𝑠) is small Algorithmic performance 𝒈∗(𝒔) Parameter 𝑠 Simple approximating function 𝒉∗(𝒔)
Overview of main result: a dichotomy
If approximation holds under the 𝑀!-norm: We provide strong guarantees If approximation only holds under the 𝑀"-norm for 𝑞 < ∞: Not possible to provide strong guarantees in worst case
Algorithmic performance 𝒈∗(𝒔) Parameter 𝑠 Simple approximating function 𝒉∗(𝒔)
! ∫ 𝑔∗ 𝑠 − ∗(𝑠) # 𝑒𝑠
is small
Model
𝒴: Set of all inputs (e.g., integer programs) ℝ#: Set of all parameter settings (e.g., CPLEX parameters) Standard assumption: Unknown distribution over inputs
E.g., represents scheduling problem airline solves day-to-day
Model
𝑔
𝒔 𝑦 = utility of algorithm parameterized by 𝒔 ∈ ℝ# on input 𝑦
E.g., runtime, solution quality, memory usage, …
Assume 𝑔
𝒔 𝑦 ∈ −1,1
Can be generalized to 𝑔
𝒔 𝑦 ∈ −𝐼, 𝐼
“Algorithmic performance”
Generalization bounds
Generalization bounds
Key question: For any parameter setting 𝒔, Does good avg utility on training set imply good exp utility? Formally: Given samples 𝑦%, … , 𝑦&~, for any 𝒔, 1 𝑂 4
'(% &
𝑔
𝒔 𝑦' − 𝔽)~ 𝑔 𝒔 𝑦
≤ ? Typically, answer by bounding the intrinsic complexity of ℱ = 𝑔
𝒔
𝒔 ∈ ℝ#
?
Empirical average utility
Generalization bounds
Key question: For any parameter setting 𝒔, Does good avg utility on training set imply good exp utility? Formally: Given samples 𝑦%, … , 𝑦&~, for any 𝒔, 1 𝑂 4
'(% &
𝑔
𝒔 𝑦' − 𝔽)~ 𝑔 𝒔 𝑦
≤ ? Typically, answer by bounding the intrinsic complexity of ℱ = 𝑔
𝒔
𝒔 ∈ ℝ#
?
Expected utility
Generalization bounds
Key question: For any parameter setting 𝒔, Does good avg utility on training set imply good exp utility? Formally: Given samples 𝑦%, … , 𝑦&~, for any 𝒔, 1 𝑂 4
'(% &
𝑔
𝒔 𝑦' − 𝔽)~ 𝑔 𝒔 𝑦
≤ ? Typically, answer by bounding the intrinsic complexity of ℱ = 𝑔
𝒔
𝒔 ∈ ℝ#
?
Generalization bounds
Challenge: Class ℱ = 𝑔
𝒔: 𝒴 → ℝ 𝒔 ∈ ℝ# is gnarly
E.g., in integer programming algorithm configuration:
- Each domain element is an IP
- Unclear how to plot or visualize functions 𝑔
𝒔
- No obvious notions of Lipschitzness or smoothness to rely on
Dual functions
Dual classes
𝑔
𝒔 𝑦 = utility of algorithm parameterized by 𝒔 ∈ ℝ# on input 𝑦
ℱ = 𝑔
𝒔: 𝒴 → ℝ 𝒔 ∈ ℝ#
“Primal” function class 𝑔
) ∗ 𝒔 = utility as function of parameters
𝑔
) ∗ 𝒔 = 𝑔 𝒔 𝑦
ℱ∗ = 𝑔
) ∗: ℝ# → ℝ 𝑦 ∈ 𝒴
“Dual” function class
- Dual functions have simple, Euclidean domain
- Often have ample structure can use to bound complexity of ℱ
Dual function approximability
ℱ = 𝑔
𝒔
𝒔 ∈ ℝ# = 𝒔 𝒔 ∈ ℝ# Dual class ∗ (𝜹, 𝒒)-approximates ℱ∗ if for all 𝑦 ∈ 𝒴, 𝑔
) ∗ − ) ∗ " =
!
A
ℝ" 𝑔 ) ∗ 𝒔 − ) ∗(𝒔) " 𝑒𝒔 ≤ 𝛿.
Sets of functions mapping 𝒴 to ℝ
$
∗
𝑔
$ ∗
𝑠
Main result: Upper bound
Generalization upper bound
ℱ = 𝑔
𝒔
𝒔 ∈ ℝ# = 𝒔 𝒔 ∈ ℝ# With high probability over the draw of 𝒯~&, for any 𝒔, 1 𝑂 4
)∈𝒯
𝑔
𝒔 𝑦 − 𝔽 )~ 𝑔 𝒔 𝑦
= E 𝑃 1 𝑂 4
)∈𝒯
𝑔
) ∗ − ) ∗ ! + H
ℜ𝒯 + 1 𝑂 Sets of functions mapping 𝒴 to ℝ
Average utility over the training set
Generalization upper bound
ℱ = 𝑔
𝒔
𝒔 ∈ ℝ# = 𝒔 𝒔 ∈ ℝ# With high probability over the draw of 𝒯~&, for any 𝒔, 1 𝑂 4
)∈𝒯
𝑔
𝒔 𝑦 − 𝔽 )~ 𝑔 𝒔 𝑦
= E 𝑃 1 𝑂 4
)∈𝒯
𝑔
) ∗ − ) ∗ ! + H
ℜ𝒯 + 1 𝑂 Sets of functions mapping 𝒴 to ℝ
Expected utility
Generalization upper bound
ℱ = 𝑔
𝒔
𝒔 ∈ ℝ# = 𝒔 𝒔 ∈ ℝ# With high probability over the draw of 𝒯~&, for any 𝒔, 1 𝑂 4
)∈𝒯
𝑔
𝒔 𝑦 − 𝔽 )~ 𝑔 𝒔 𝑦
= E 𝑃 1 𝑂 4
)∈𝒯
𝑔
) ∗ − ) ∗ ! + H
ℜ𝒯 + 1 𝑂 If not too complex and ∗ (𝛿, ∞)-approximates ℱ∗, Bound approaches 𝑷 𝜹 as 𝑶 → ∞. Sets of functions mapping 𝒴 to ℝ
Main result: Lower bound
Lower bound
For any 𝛿 and 𝑞 < ∞, there exist function classes ℱ, such that:
- Dual class ∗ (𝛿, 𝑞)-approximates ℱ∗
- is very simple
Rademacher complexity is 0
- ℱ is very complex
Rademacher complexity is
%
- Not possible to provide generalization bounds in worst case
Experiments
Tune integer programming solver parameters
Also studied by Balcan, Dick, Sandholm, Vitercik [ICML’18]
Distributions over auction IPs
[Leyton-Brown, Pearson, Shoham, EC’00]
Experiments: Integer programming
Number of training instances Generalization error Our bound Bound by BDSV’18
Conclusion
Conclusion
- Provided generalization bounds for algorithm configuration
- Apply whenever utility as function of parameters is
“approximately simple”
- Connection between learnability and approximability is
balanced on a knife-edge
- If approximation holds under 𝑀/-norm, can provide strong bounds
- If holds under 𝑀0-norm for 𝑞 < ∞, not possible to provide bounds
- Experiments demonstrate strength of these bounds