neural contextual bandits with ucb based exploration
play

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 - PowerPoint PPT Presentation

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1 Department of Computer Science, UCLA 2 Google Research 1 / 49 Outline Background Contextual bandit problem Deep neural networks 2 / 49


  1. Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1 Department of Computer Science, UCLA 2 Google Research 1 / 49

  2. Outline ◮ Background ◮ Contextual bandit problem ◮ Deep neural networks 2 / 49

  3. Outline ◮ Background ◮ Contextual bandit problem ◮ Deep neural networks ◮ Algorithm – NeuralUCB ◮ Use a neural network to learn the reward ◮ Use neural network’s gradient to explore ◮ Upper confidence bound strategy 3 / 49

  4. Outline ◮ Background ◮ Contextual bandit problem ◮ Deep neural networks ◮ Algorithm – NeuralUCB ◮ Use a neural network to learn the reward ◮ Use neural network’s gradient to explore ◮ Upper confidence bound strategy ◮ Main theory ◮ Neural tangent kernel matrix and effective dimension √ ◮ � O ( T ) regret 4 / 49

  5. Background – decision-making problems Decision-making problems are everywhere! ◮ As a gambler in a casino, find a slot machine, you will... ◮ Limited budget, maximize the payoff ! ◮ Which arm to pull? ◮ As a movie recommender, you need to... ◮ Recommend movies based on users’ interests, maximize users’ purchase rate ◮ Which movie to recommend? (a) Slot machine (b) Movie recommendation 5 / 49

  6. Background – contextual bandit K -armed contextual bandit problem: movie recommendation 6 / 49

  7. Background – contextual bandit K -armed contextual bandit problem: movie recommendation At round t , ◮ Agent observes K d -dimensional contextual vectors (user’s movie purchase history) { x t,a ∈ R d | a ∈ [ K ] } 7 / 49

  8. Background – contextual bandit K -armed contextual bandit problem: movie recommendation At round t , ◮ Agent observes K d -dimensional contextual vectors (user’s movie purchase history) { x t,a ∈ R d | a ∈ [ K ] } ◮ Agent selects an action a t and receives a reward r t,a t (recommends some movie and user choose to purchase or not) 8 / 49

  9. Background – contextual bandit K -armed contextual bandit problem: movie recommendation At round t , ◮ Agent observes K d -dimensional contextual vectors (user’s movie purchase history) { x t,a ∈ R d | a ∈ [ K ] } ◮ Agent selects an action a t and receives a reward r t,a t (recommends some movie and user choose to purchase or not) ◮ The goal is to minimize the following pesudo regret � � T � R T = E ( r t,a ∗ t − r t,a t ) t =1 where a ∗ t = argmax a ∈ [ K ] E [ r t,a ] is the optimal action at round t 9 / 49

  10. Background – contextual linear bandit r t,a t = � θ ∗ , x t,a t � + ξ t , ξ t ∼ ν -sub-Gaussian 10 / 49

  11. Background – contextual linear bandit r t,a t = � θ ∗ , x t,a t � + ξ t , ξ t ∼ ν -sub-Gaussian ◮ Build confidence set for θ ∗ and use optimism in the face of uncertainty (OFU) principle 11 / 49

  12. Background – contextual linear bandit r t,a t = � θ ∗ , x t,a t � + ξ t , ξ t ∼ ν -sub-Gaussian ◮ Build confidence set for θ ∗ and use optimism in the face of uncertainty (OFU) principle √ ◮ Leads to � O ( d T ) regret (Abbasi-Yadkori et al. 2011) ◮ Strongly depends on linear structure! 12 / 49

  13. Background – general reward function r t,a t = h ( x t,a t ) + ξ t , 0 ≤ h ( x ) ≤ 1 , ξ t ∼ ν -sub-Gaussian 13 / 49

  14. Background – general reward function r t,a t = h ( x t,a t ) + ξ t , 0 ≤ h ( x ) ≤ 1 , ξ t ∼ ν -sub-Gaussian ◮ Including many popular contextual bandit problems ◮ Linear bandit ◮ h ( x ) = � θ , x � , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 ◮ Generalized linear bandit ◮ h ( x ) = g ( � θ , x � ) , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 , |∇ g | ≤ 1 14 / 49

  15. Background – general reward function r t,a t = h ( x t,a t ) + ξ t , 0 ≤ h ( x ) ≤ 1 , ξ t ∼ ν -sub-Gaussian ◮ Including many popular contextual bandit problems ◮ Linear bandit ◮ h ( x ) = � θ , x � , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 ◮ Generalized linear bandit ◮ h ( x ) = g ( � θ , x � ) , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 , |∇ g | ≤ 1 We do not know what h is... 15 / 49

  16. Background – general reward function r t,a t = h ( x t,a t ) + ξ t , 0 ≤ h ( x ) ≤ 1 , ξ t ∼ ν -sub-Gaussian ◮ Including many popular contextual bandit problems ◮ Linear bandit ◮ h ( x ) = � θ , x � , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 ◮ Generalized linear bandit ◮ h ( x ) = g ( � θ , x � ) , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 , |∇ g | ≤ 1 We do not know what h is... Use some universal function approximator, such as neural networks! 16 / 49

  17. Background – neural network Fully connected neural networks: � �� f ( x ; θ ) = √ m W L σ � W L − 1 σ · · · σ ( W 1 x ) 17 / 49

  18. Background – neural network Fully connected neural networks: � �� f ( x ; θ ) = √ m W L σ � W L − 1 σ · · · σ ( W 1 x ) ◮ σ ( x ) = max { x, 0 } is the ReLU activation function 18 / 49

  19. Background – neural network Fully connected neural networks: � �� f ( x ; θ ) = √ m W L σ � W L − 1 σ · · · σ ( W 1 x ) ◮ σ ( x ) = max { x, 0 } is the ReLU activation function ◮ W i is the weight matrix ◮ W 1 ∈ R m × d ◮ W i ∈ R m × m , 2 ≤ i ≤ L − 1 ◮ W L ∈ R m × 1 19 / 49

  20. Background – neural network Fully connected neural networks: � �� f ( x ; θ ) = √ m W L σ � W L − 1 σ · · · σ ( W 1 x ) ◮ σ ( x ) = max { x, 0 } is the ReLU activation function ◮ θ = [ vec ( W 1 ) ⊤ , . . . , vec ( W L ) ⊤ ] ⊤ ∈ R p , p = m + md + m 2 ( L − 1) 20 / 49

  21. Background – neural network Fully connected neural networks: � �� f ( x ; θ ) = √ m W L σ � W L − 1 σ · · · σ ( W 1 x ) ◮ σ ( x ) = max { x, 0 } is the ReLU activation function ◮ θ = [ vec ( W 1 ) ⊤ , . . . , vec ( W L ) ⊤ ] ⊤ ∈ R p , p = m + md + m 2 ( L − 1) ◮ Gradient of the neural network g ( x ; θ ) = ∇ θ f ( x ; θ ) ∈ R p 21 / 49

  22. Question ◮ Neural network-based contextual bandit algorithms (Riquelme et al. 2018; Zahavy and Mannor 2019) ◮ No theoretical guarantee 22 / 49

  23. Question ◮ Neural network-based contextual bandit algorithms (Riquelme et al. 2018; Zahavy and Mannor 2019) ◮ No theoretical guarantee Can we design provably efficient neural network-based algorithm to learn the general reward function? 23 / 49

  24. Question ◮ Neural network-based contextual bandit algorithms (Riquelme et al. 2018; Zahavy and Mannor 2019) ◮ No theoretical guarantee Can we design provably efficient neural network-based algorithm to learn the general reward function? Yes! NeuralUCB ◮ Neural network to model reward function, UCB strategy to explore √ ◮ Theoretical guarantee on regret � O ( T ) ◮ Matches regret bound for linear setting (Abbasi-Yadkori et al. 2011) 24 / 49

  25. NeuralUCB – initialization ◮ Special initialization on θ 0 ◮ For 1 ≤ l ≤ L − 1 , � W � 0 W l = , W { i,j } ∼ N (0 , 4 /m ) 0 W ◮ For L , W = ( w ⊤ , − w ⊤ ) , w { i } ∼ N (0 , 2 /m ) 25 / 49

  26. NeuralUCB – initialization ◮ Special initialization on θ 0 ◮ For 1 ≤ l ≤ L − 1 , � W � 0 W l = , W { i,j } ∼ N (0 , 4 /m ) 0 W ◮ For L , W = ( w ⊤ , − w ⊤ ) , w { i } ∼ N (0 , 2 /m ) ◮ Normalization on { x i } : for any 1 ≤ i ≤ TK , � x i � 2 = 1 and [ x i ] j = [ x i ] j + d/ 2 √ ◮ For any unit vector x , construct x ′ = ( x ; x ) / 2 26 / 49

  27. NeuralUCB – initialization ◮ Special initialization on θ 0 ◮ For 1 ≤ l ≤ L − 1 , � W � 0 W l = , W { i,j } ∼ N (0 , 4 /m ) 0 W ◮ For L , W = ( w ⊤ , − w ⊤ ) , w { i } ∼ N (0 , 2 /m ) ◮ Normalization on { x i } : for any 1 ≤ i ≤ TK , � x i � 2 = 1 and [ x i ] j = [ x i ] j + d/ 2 √ ◮ For any unit vector x , construct x ′ = ( x ; x ) / 2 Guarantee that f ( x i ; θ 0 ) = 0 ! 27 / 49

  28. NeuralUCB – upper confidence bounds At round t , NeuralUCB will... ◮ Observe { x t,a } K a =1 28 / 49

  29. NeuralUCB – upper confidence bounds At round t , NeuralUCB will... ◮ Observe { x t,a } K a =1 ◮ Compute upper confidence bound for each arm a , which is � g ( x t,a ; θ t − 1 ) ⊤ Z − 1 U t,a = f ( x t,a ; θ t − 1 ) + γ t − 1 t − 1 g ( x t,a ; θ t − 1 ) /m � �� � � �� � mean variance 29 / 49

  30. NeuralUCB – upper confidence bounds At round t , NeuralUCB will... ◮ Observe { x t,a } K a =1 ◮ Compute upper confidence bound for each arm a , which is � g ( x t,a ; θ t − 1 ) ⊤ Z − 1 U t,a = f ( x t,a ; θ t − 1 ) + γ t − 1 t − 1 g ( x t,a ; θ t − 1 ) /m � �� � � �� � mean variance Compared with LinUCB (Li et al. 2010) � t,a Z − 1 x ⊤ U t,a = � x t,a , θ t − 1 � + γ t − 1 t − 1 x t,a � �� � � �� � mean variance 30 / 49

  31. NeuralUCB – upper confidence bounds At round t , NeuralUCB will... ◮ Observe { x t,a } K a =1 ◮ Compute upper confidence bound for each arm a , which is � g ( x t,a ; θ t − 1 ) ⊤ Z − 1 U t,a = f ( x t,a ; θ t − 1 ) + γ t − 1 t − 1 g ( x t,a ; θ t − 1 ) /m � �� � � �� � mean variance Compared with LinUCB (Li et al. 2010) � t,a Z − 1 x ⊤ U t,a = � x t,a , θ t − 1 � + γ t − 1 t − 1 x t,a � �� � � �� � mean variance ◮ Select a t = argmax a ∈ [ K ] U t,a , play a t and observe reward r t,a t 31 / 49

Recommend


More recommend