Online Learning with Kernel Losses Aldo Pacchiano UC Berkeley Joint work with Niladri Chatterji and Peter Bartlett 1
Talk Overview • Intro to Online Learning • Linear Bandits • Kernel Bandits 2
Online Learning 3
Online Learning t = 1 , · · · , n Learner Adversary 3
Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A 3
Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A Adversary reveals loss (or reward) ` t ∈ W 3
Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A Can be i.i.d or Adversary reveals loss (or reward) ` t ∈ W adversarial 3
Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A Can be i.i.d or Adversary reveals loss (or reward) ` t ∈ W adversarial n X ` t ( a t ) t =1 3
Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A Can be i.i.d or Adversary reveals loss (or reward) ` t ∈ W adversarial n n R ( n ) = X X ` t ( a t ) − min ` t ( a t ) a ∗ ∈ A t =1 t =1 3
Online Learning t = 1 , · · · , n Learner Adversary Learner chooses an action a t ∈ A Can be i.i.d or Adversary reveals loss (or reward) ` t ∈ W adversarial n n R ( n ) = X X ` t ( a t ) − min ` t ( a t ) a ∗ ∈ A t =1 t =1 The learner’s objective is to minimize Regret 3
Full information vs Bandit feedback 4
Full information vs Bandit feedback Full Information: Learner gets to sees all of ` t ( · ) 4
Full information vs Bandit feedback Full Information: Learner gets to sees all of ` t ( · ) Bandit Feedback: Learner only sees the value ` t ( a t ) 4
Full information vs Bandit feedback Full Information: Learner gets to sees all of ` t ( · ) Bandit Feedback: Learner only sees the value ` t ( a t ) 4
Multi Armed Bandits P 1 P 2 P 3 µ 2 µ 3 µ 1 5
Multi Armed Bandits P 1 P 2 P 3 Learner chooses a t ∈ { 1 , · · · , K } µ 2 µ 3 µ 1 5
Multi Armed Bandits P 1 P 2 P 3 Learner chooses a t ∈ { 1 , · · · , K } Gets reward X a t ∼ P a t µ 2 µ 3 µ 1 5
Multi Armed Bandits P 1 P 2 P 3 Learner chooses a t ∈ { 1 , · · · , K } Gets reward X a t ∼ P a t µ 2 µ 3 µ 1 " n # X R ( n ) = max a ∗ ∈ { 1 , ··· K } nµ a ∗ − E X a t t =1 5
Multi Armed Bandits P 1 P 2 P 3 Learner chooses a t ∈ { 1 , · · · , K } Gets reward X a t ∼ P a t µ 2 µ 3 µ 1 " n # X R ( n ) = max a ∗ ∈ { 1 , ··· K } nµ a ∗ − E X a t t =1 MAB regret p R ( n ) = O ( Kn log( n )) [Auer et al. 2002] 5
Structured losses Network ( V, E ) Arms = Paths a t ∈ A ⊂ { 0 , 1 } E Loss = delay w t ∈ W = [0 , 1] E Exponential MAB regret Packet routing p R ( n ) = O ( | num paths | · n log( n )) Delay is linear h a t , w t i 6
Structured losses Network ( V, E ) Arms = Paths a t ∈ A ⊂ { 0 , 1 } E Loss = delay w t ∈ W = [0 , 1] E Exponential MAB regret Packet routing p R ( n ) = O ( | num paths | · n log( n )) Delay is linear h a t , w t i 6
Linear Bandits 7
Linear Bandits Learner chooses an action a t ∈ A ⊂ R d 7
Linear Bandits Learner chooses an action a t ∈ A ⊂ R d Adversary’s loss ` t ( a ) = h w t , a i for w t ∈ W ⊂ R d 7
Linear Bandits Learner chooses an action a t ∈ A ⊂ R d Adversary’s loss ` t ( a ) = h w t , a i for w t ∈ W ⊂ R d Can be i.i.d or adversarial 7
Linear Bandits Learner chooses an action a t ∈ A ⊂ R d Adversary’s loss ` t ( a ) = h w t , a i for w t ∈ W ⊂ R d Can be i.i.d or Learner only experiences h w t , a t i adversarial 7
Linear Bandits Learner chooses an action a t ∈ A ⊂ R d Adversary’s loss ` t ( a ) = h w t , a i for w t ∈ W ⊂ R d Can be i.i.d or Learner only experiences h w t , a t i adversarial Expected regret: " n # n X X R ( n ) = E h w t , a t i � inf h w t , a i a ∈ A t =1 t =1 7
Linear Bandits Learner chooses an action a t ∈ A ⊂ R d Adversary’s loss ` t ( a ) = h w t , a i for w t ∈ W ⊂ R d Can be i.i.d or Learner only experiences h w t , a t i adversarial Expected regret: " n # n X X R ( n ) = E h w t , a t i � inf h w t , a i a ∈ A t =1 t =1 , W = [0 , 1] d MAB reduces to Linear A = { e 1 , · · · , e d } Bandits 7
Exponential weights for adversarial linear bandits 8
Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation 8
Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation 8
Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation 8
Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation See h w t , a t i 8
Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation See h w t , a t i Build loss estimator ˆ w t 8
Exponential weights for adversarial linear bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation See h w t , a t i Build loss estimator ˆ w t q t ( a ) / exp( � η h ˆ w t , a i ) q t − 1 ( a ) Update | {z } Exponential weights 8
Exponential weights t X q t ( a ) / exp( � η h w i , a i ) ˆ i =1 9
Exponential weights t X q t ( a ) / exp( � η h w i , a i ) ˆ i =1 t X ˆ w i i =1 A 9
Exponential weights t X q t ( a ) / exp( � η h w i , a i ) ˆ i =1 t X ˆ w i q t i =1 A A 9
Unbiased estimator of the loss ⇥ aa > ⇤ Let and set w t = ( Σ t ) − 1 a t h w t , a t i Σ t = E a ⇠ p t ˆ 10
Unbiased estimator of the loss ⇥ aa > ⇤ Let and set w t = ( Σ t ) − 1 a t h w t , a t i Σ t = E a ⇠ p t ˆ 10
Unbiased estimator of the loss ⇥ aa > ⇤ Let and set w t = ( Σ t ) − 1 a t h w t , a t i Σ t = E a ⇠ p t ˆ is an unbiased estimator of : ˆ w t w t 10
Unbiased estimator of the loss ⇥ aa > ⇤ Let and set w t = ( Σ t ) − 1 a t h w t , a t i Σ t = E a ⇠ p t ˆ is an unbiased estimator of : ˆ w t w t aa T ⇤� � 1 E a t ⇠ p t [ a t h w t , a t i |F t � 1 ] � ⇥ E a t ⇠ p t [ ˆ w t |F t � 1 ] = E a ⇠ p t aa T ⇤� � 1 E a t ⇠ p t a t a > � ⇥ ⇥ ⇤ = t |F t � 1 E a ⇠ p t w t = w t 10
Linear bandits regret Theorem. (Linear Bandits Regret). [See for example Bubeck ‘11] n R ( n ) γ n + log( |A| ) X w t , a i ) 2 + η EE a ∼ p t ( h ˆ η t =1 Exploration over Barycentric Spanner, [Dani, Hayes, Kakade ’08] p n log( |A| )) = O ( d 3 / 2 √ n ) O ( d Uniform over , [Cesa-Bianchi, Lugosi, ’12] A p dn log( |A| )) = O ( d √ n ) O ( John’s distribution [Bubeck, Cesa-Bianchi, Kakade ’12] O ( d √ n ) 11
Linear bandits regret Dimension dependence Variance bound: w t , a i ) 2 ⇤⇤ ⇥ ⇥ ( h ˆ d E E a t ∼ p t 12
Linear bandits regret Dimension dependence Variance bound: w t , a i ) 2 ⇤⇤ ⇥ ⇥ ( h ˆ d E E a t ∼ p t Dimension dependence 12
Linear bandits regret Dimension dependence Variance bound: w t , a i ) 2 ⇤⇤ ⇥ ⇥ ( h ˆ d E E a t ∼ p t Dimension dependence n R ( n ) γ n + log( |A| ) X w t , a i ) 2 + η EE a ∼ p t ( h ˆ η t =1 | {z } ≤ η dn 12
Recap • Intro to Online Learning • Linear Bandits • Kernel Bandits 13
Online Quadratic losses a t 2 A = { a s.t. k a k 2 1 } Symmetric and B t possibly non convex ` t ( a ) = h b t , a i + a > B t a min ` t ( a ) a ∈ A Offline problem has polytime solution Strong Duality Covfefe z = x 2 − . 5 ∗ y 2 + x ∗ y − . 5 ∗ x + . 5 y + 1 Peter Bartlett Niladri Chatterji 14
Linearization of Quadratic losses matrices ( ) Quadratic losses are linear in the space of vector ✓ aa > ⌧✓ B t ◆ ◆� ` ( a ) = h b t , a i + a > B t a ` ( a ) = , b t a We can use the linear bandits machinery Exponential weights for quadratic bandits 15
Exponential weights for adversarial quadratic bandits For t = 1 , · · · , n : Sample mixture a t ∼ p t = (1 − γ ) q t + γν |{z} | {z } Exploration Exploitation See h b t , a t i + a > t B t a t ✓ ˆ ◆ B t Build loss estimator ˆ b t b t , a i + a > ˆ q t ( a ) / exp( � η ( h ˆ B t a )) q t � 1 ( a ) Update | {z } Exponential weights Sampling is poly time 16
Beyond “Finite Dimensional” Losses Evasion games: Obstacle avoidance ` t ( a ) = exp( �k a � w t k 2 ) Gaussian kernel - Infinite dimensional 17
Recommend
More recommend