“Optimistic” Rates Nati Srebro Based on work with Karthik Sridharan and Ambuj Tewari Examples based on work with Andy Cotter, Elad Hazan, Tomer Koren, Percy Liang, Shai Shalev-Shwartz, Ohad Shamir, Karthik Sridharan
Outline • What? • When? (How?) • Why?
Estimating the Bias of a Coin
Optimistic VC bound (aka 𝑀 ∗ -bound, multiplicative bound) • For a hypothesis class with VC-dim D , w.p. 1- 𝜀 over n samples:
Optimistic VC bound (aka 𝑀 ∗ -bound, multiplicative bound) • For a hypothesis class with VC-dim D , w.p. 1- 𝜀 over n samples: • Sample complexity to get 𝑀 ℎ ≤ 𝑀 ∗ + 𝜗 : • Extends to bounded real-valued loss, D =VC subgraph dim
From Parametric to Scale Sensitive Classes • Instead of VC-dim or VC-subgraph-dim ( ≈ #params), rely on 𝑆 metric scale to control complexity, e.g.: = • Learning depends on: • Metric complexity measures: fat shattering dimension, covering numbers, Rademacher Complexity • Scale sensitivity of loss 𝜚 (bound on derivatives or “margin”) • For ℋ with Rademacher Complexity ℛ 𝑜 , and 𝜚 ′ ≤ 𝐻 :
Non-Parametric Optimistic Rate for Smooth Loss • Theorem: for any ℋ with (worst case) Rademacher Complexity ℛ 𝑜 (ℋ) , and any smooth loss with 𝜚 ′′ ≤ 𝐼 , 𝜚 ≤ 𝑐 , w.p. 1 − 𝜀 over n samples: [S Sridharan Tewari 2010] • Sample complexity
Proof Ideas • Smooth functions are self bounding: for any H -smooth non- negative f : 𝑔 ′ 𝑢 ≤ 4𝐼𝑔 𝑢 • 2 nd order version of Lipschitz composition Lemma, restricted to predictors with low loss: Rademacher fat shattering 𝑀 ∞ covering (compose with loss and use smoothness) 𝑀 2 covering Rademacher • Local Rademacher analysis
Non-Parametric Optimistic Rate for Smooth Loss • Theorem: for any ℋ with (worst case) Rademacher Complexity ℛ 𝑜 (ℋ) , and any smooth loss with 𝜚 ′′ ≤ 𝐼 , 𝜚 ≤ 𝑐 , w.p. 1 − 𝜀 over n samples: [S Sridharan Tewari 2010] • Sample complexity
Parametric vs Non-Parametric Parametric Scale-Sensitive dim( ℋ ) ≤ 𝐄 , 𝒊 ≤ 𝟐 𝑺 𝒐 ℛ 𝒐 ℋ ≤ Lipschitz : 𝜚 ′ ≤ 𝐻 (e.g. hinge, ℓ 1 ) 𝐻 2 𝑆 𝐻 𝐸 𝑀 ∗ 𝐻𝐸 𝑜 + 𝑜 𝑜 Smooth : 𝜚 ′ ′ ≤ 𝐼 (e.g. logistic, Huber, 𝐼 𝐸 𝑀 ∗ 𝐼𝐸 𝐼 𝑆 𝑀 ∗ 𝐼𝑆 𝑜 + 𝑜 + 𝑜 𝑜 smoothed hinge) Smooth & strongly convex : 𝜇 ≤ 𝜚 ′′ ≤ 𝐼 𝐼 𝜇 ⋅ 𝐼 𝐸 𝐼 𝑆 𝑀 ∗ 𝐼𝑆 𝑜 + 𝑜 𝑜 (e.g. square loss) Min-max tight up to poly-log factors
Optimistic SVM-Type Bounds 𝜚hinge 𝜚 01 ≤ • Optimize • Generalize
Optimistic SVM-Type Bounds 𝜚 01 ≤ 𝜚smooth≤ 𝜚hinge • • Generalize Optimize
Optimistic Learning Guarantees Parametric classes Scale-sensitive classes with smooth loss SVM-type bounds Margin Bounds Online Learning/Optimization with smooth loss Stability-based guarantees with smooth loss × Non-param (scale sensitive) classes with non-smooth loss × Online Learning/Optimization with non-smooth loss
Why Optimistic Guarantees? • Optimistic regime typically relevant regime: • Approximation error 𝑀 ∗ ≈ Estimation error 𝜗 • If 𝜗 ≪ 𝑀 ∗ , better to spend energy on lowering approx. error (use more complex class) • Important in understanding statistical learning
Training Kernel SVMs 𝑥 ∗ 2 ) # Kernel evaluations to get excess error 𝜗 : ( 𝑆 = • Using SGD: • Using the Stochastic Batch Perceptron [Cotter et al 2012] : (is this the best possible?)
Training Linear SVMs 𝑥 ∗ 2 ) Runtime (# feature evaluations): ( 𝑆 = • Using SGD: • Using SIMBA [Hazan et al 2011] : (is this the best possible?)
Mini-Batch SGD • Stochastic optimization of smooth 𝑀 𝑥 using n training-points, doing T=n/b iterations of SGD with mini-batches of size b • Pessimistic Analysis (ignoring 𝑀 ∗ ): Can use minibatch of size 𝑐 ∝ 𝑜 , with 𝑈 ∝ 𝑜 iterations and get same error (up to constant factor) as sequential SGD [Dekel et al 2010][Agarwal Duchi 2011] • But taking into account 𝑀 ∗ : In Optimistic Regime: Can’t use b>1, no parallelization speedups! • Use acceleration to get speedup in optimistic regime [Cotter et al 2011]
Multiple Complexity Controls [Liang Srebro 2010] 𝑥, 𝑌 − 𝑍 2 , 𝑍 = 𝑥, 𝑌 + 𝒪(0, 𝜏 2 ) 𝑀 𝑥 = 𝔽 𝑥 2 ≤ 𝑆 𝑥 ∈ ℝ 𝐸 𝔽[𝑍 2 ] 𝑆/𝑜 𝑀 ∗ 𝑆/𝑜 𝑀 ∗ 𝑀 ∗ 𝐸/𝑜 𝑀 ∗ /𝐸 𝑆/𝔽[𝑍 2 ] 𝑆/𝑀 ∗ 𝑀 ∗ 𝐸 2 /𝑆
Be Optimistic • For scale-sensitive non-parametric classes, with smooth loss: [Srebro Sridharan Tewari 2010] • Diff vs parametric: Not possible with non-smooth loss! • Optimistic regime typically relevant regime: • Approximation error 𝑀 ∗ ≈ Estimation error 𝜗 • If 𝜗 ≪ 𝑀 ∗ , better to spend energy on lowering approx. error (use more complex class) • Important in understanding statistical learning
Recommend
More recommend