Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 - PowerPoint PPT Presentation

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1 Department of Computer Science, UCLA 2 Google Research 1 / 49

Outline ◮ Background ◮ Contextual bandit problem ◮ Deep neural networks 2 / 49

Outline ◮ Background ◮ Contextual bandit problem ◮ Deep neural networks ◮ Algorithm – NeuralUCB ◮ Use a neural network to learn the reward ◮ Use neural network’s gradient to explore ◮ Upper confidence bound strategy 3 / 49

Outline ◮ Background ◮ Contextual bandit problem ◮ Deep neural networks ◮ Algorithm – NeuralUCB ◮ Use a neural network to learn the reward ◮ Use neural network’s gradient to explore ◮ Upper confidence bound strategy ◮ Main theory ◮ Neural tangent kernel matrix and effective dimension √ ◮ � O ( T ) regret 4 / 49

Background – decision-making problems Decision-making problems are everywhere! ◮ As a gambler in a casino, find a slot machine, you will... ◮ Limited budget, maximize the payoff ! ◮ Which arm to pull? ◮ As a movie recommender, you need to... ◮ Recommend movies based on users’ interests, maximize users’ purchase rate ◮ Which movie to recommend? (a) Slot machine (b) Movie recommendation 5 / 49

Background – contextual bandit K -armed contextual bandit problem: movie recommendation 6 / 49

Background – contextual bandit K -armed contextual bandit problem: movie recommendation At round t , ◮ Agent observes K d -dimensional contextual vectors (user’s movie purchase history) { x t,a ∈ R d | a ∈ [ K ] } 7 / 49

Background – contextual bandit K -armed contextual bandit problem: movie recommendation At round t , ◮ Agent observes K d -dimensional contextual vectors (user’s movie purchase history) { x t,a ∈ R d | a ∈ [ K ] } ◮ Agent selects an action a t and receives a reward r t,a t (recommends some movie and user choose to purchase or not) 8 / 49

Background – contextual bandit K -armed contextual bandit problem: movie recommendation At round t , ◮ Agent observes K d -dimensional contextual vectors (user’s movie purchase history) { x t,a ∈ R d | a ∈ [ K ] } ◮ Agent selects an action a t and receives a reward r t,a t (recommends some movie and user choose to purchase or not) ◮ The goal is to minimize the following pesudo regret � � T � R T = E ( r t,a ∗ t − r t,a t ) t =1 where a ∗ t = argmax a ∈ [ K ] E [ r t,a ] is the optimal action at round t 9 / 49

Background – contextual linear bandit r t,a t = � θ ∗ , x t,a t � + ξ t , ξ t ∼ ν -sub-Gaussian 10 / 49

Background – contextual linear bandit r t,a t = � θ ∗ , x t,a t � + ξ t , ξ t ∼ ν -sub-Gaussian ◮ Build confidence set for θ ∗ and use optimism in the face of uncertainty (OFU) principle 11 / 49

Background – contextual linear bandit r t,a t = � θ ∗ , x t,a t � + ξ t , ξ t ∼ ν -sub-Gaussian ◮ Build confidence set for θ ∗ and use optimism in the face of uncertainty (OFU) principle √ ◮ Leads to � O ( d T ) regret (Abbasi-Yadkori et al. 2011) ◮ Strongly depends on linear structure! 12 / 49

Background – general reward function r t,a t = h ( x t,a t ) + ξ t , 0 ≤ h ( x ) ≤ 1 , ξ t ∼ ν -sub-Gaussian 13 / 49

Background – general reward function r t,a t = h ( x t,a t ) + ξ t , 0 ≤ h ( x ) ≤ 1 , ξ t ∼ ν -sub-Gaussian ◮ Including many popular contextual bandit problems ◮ Linear bandit ◮ h ( x ) = � θ , x � , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 ◮ Generalized linear bandit ◮ h ( x ) = g ( � θ , x � ) , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 , |∇ g | ≤ 1 14 / 49

Background – general reward function r t,a t = h ( x t,a t ) + ξ t , 0 ≤ h ( x ) ≤ 1 , ξ t ∼ ν -sub-Gaussian ◮ Including many popular contextual bandit problems ◮ Linear bandit ◮ h ( x ) = � θ , x � , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 ◮ Generalized linear bandit ◮ h ( x ) = g ( � θ , x � ) , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 , |∇ g | ≤ 1 We do not know what h is... 15 / 49

Background – general reward function r t,a t = h ( x t,a t ) + ξ t , 0 ≤ h ( x ) ≤ 1 , ξ t ∼ ν -sub-Gaussian ◮ Including many popular contextual bandit problems ◮ Linear bandit ◮ h ( x ) = � θ , x � , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 ◮ Generalized linear bandit ◮ h ( x ) = g ( � θ , x � ) , where � θ � 2 ≤ 1 , � x � 2 ≤ 1 , |∇ g | ≤ 1 We do not know what h is... Use some universal function approximator, such as neural networks! 16 / 49

Background – neural network Fully connected neural networks: � �� f ( x ; θ ) = √ m W L σ � W L − 1 σ · · · σ ( W 1 x ) 17 / 49

Background – neural network Fully connected neural networks: � �� f ( x ; θ ) = √ m W L σ � W L − 1 σ · · · σ ( W 1 x ) ◮ σ ( x ) = max { x, 0 } is the ReLU activation function 18 / 49

Background – neural network Fully connected neural networks: � �� f ( x ; θ ) = √ m W L σ � W L − 1 σ · · · σ ( W 1 x ) ◮ σ ( x ) = max { x, 0 } is the ReLU activation function ◮ W i is the weight matrix ◮ W 1 ∈ R m × d ◮ W i ∈ R m × m , 2 ≤ i ≤ L − 1 ◮ W L ∈ R m × 1 19 / 49

Background – neural network Fully connected neural networks: � �� f ( x ; θ ) = √ m W L σ � W L − 1 σ · · · σ ( W 1 x ) ◮ σ ( x ) = max { x, 0 } is the ReLU activation function ◮ θ = [ vec ( W 1 ) ⊤ , . . . , vec ( W L ) ⊤ ] ⊤ ∈ R p , p = m + md + m 2 ( L − 1) 20 / 49

Background – neural network Fully connected neural networks: � �� f ( x ; θ ) = √ m W L σ � W L − 1 σ · · · σ ( W 1 x ) ◮ σ ( x ) = max { x, 0 } is the ReLU activation function ◮ θ = [ vec ( W 1 ) ⊤ , . . . , vec ( W L ) ⊤ ] ⊤ ∈ R p , p = m + md + m 2 ( L − 1) ◮ Gradient of the neural network g ( x ; θ ) = ∇ θ f ( x ; θ ) ∈ R p 21 / 49

Question ◮ Neural network-based contextual bandit algorithms (Riquelme et al. 2018; Zahavy and Mannor 2019) ◮ No theoretical guarantee 22 / 49

Question ◮ Neural network-based contextual bandit algorithms (Riquelme et al. 2018; Zahavy and Mannor 2019) ◮ No theoretical guarantee Can we design provably efficient neural network-based algorithm to learn the general reward function? 23 / 49

Question ◮ Neural network-based contextual bandit algorithms (Riquelme et al. 2018; Zahavy and Mannor 2019) ◮ No theoretical guarantee Can we design provably efficient neural network-based algorithm to learn the general reward function? Yes! NeuralUCB ◮ Neural network to model reward function, UCB strategy to explore √ ◮ Theoretical guarantee on regret � O ( T ) ◮ Matches regret bound for linear setting (Abbasi-Yadkori et al. 2011) 24 / 49

NeuralUCB – initialization ◮ Special initialization on θ 0 ◮ For 1 ≤ l ≤ L − 1 , � W � 0 W l = , W { i,j } ∼ N (0 , 4 /m ) 0 W ◮ For L , W = ( w ⊤ , − w ⊤ ) , w { i } ∼ N (0 , 2 /m ) 25 / 49

NeuralUCB – initialization ◮ Special initialization on θ 0 ◮ For 1 ≤ l ≤ L − 1 , � W � 0 W l = , W { i,j } ∼ N (0 , 4 /m ) 0 W ◮ For L , W = ( w ⊤ , − w ⊤ ) , w { i } ∼ N (0 , 2 /m ) ◮ Normalization on { x i } : for any 1 ≤ i ≤ TK , � x i � 2 = 1 and [ x i ] j = [ x i ] j + d/ 2 √ ◮ For any unit vector x , construct x ′ = ( x ; x ) / 2 26 / 49

NeuralUCB – initialization ◮ Special initialization on θ 0 ◮ For 1 ≤ l ≤ L − 1 , � W � 0 W l = , W { i,j } ∼ N (0 , 4 /m ) 0 W ◮ For L , W = ( w ⊤ , − w ⊤ ) , w { i } ∼ N (0 , 2 /m ) ◮ Normalization on { x i } : for any 1 ≤ i ≤ TK , � x i � 2 = 1 and [ x i ] j = [ x i ] j + d/ 2 √ ◮ For any unit vector x , construct x ′ = ( x ; x ) / 2 Guarantee that f ( x i ; θ 0 ) = 0 ! 27 / 49

NeuralUCB – upper confidence bounds At round t , NeuralUCB will... ◮ Observe { x t,a } K a =1 28 / 49

NeuralUCB – upper confidence bounds At round t , NeuralUCB will... ◮ Observe { x t,a } K a =1 ◮ Compute upper confidence bound for each arm a , which is � g ( x t,a ; θ t − 1 ) ⊤ Z − 1 U t,a = f ( x t,a ; θ t − 1 ) + γ t − 1 t − 1 g ( x t,a ; θ t − 1 ) /m � �� mean variance 29 / 49

NeuralUCB – upper confidence bounds At round t , NeuralUCB will... ◮ Observe { x t,a } K a =1 ◮ Compute upper confidence bound for each arm a , which is � g ( x t,a ; θ t − 1 ) ⊤ Z − 1 U t,a = f ( x t,a ; θ t − 1 ) + γ t − 1 t − 1 g ( x t,a ; θ t − 1 ) /m � �� mean variance Compared with LinUCB (Li et al. 2010) � t,a Z − 1 x ⊤ U t,a = � x t,a , θ t − 1 � + γ t − 1 t − 1 x t,a � �� mean variance 30 / 49

NeuralUCB – upper confidence bounds At round t , NeuralUCB will... ◮ Observe { x t,a } K a =1 ◮ Compute upper confidence bound for each arm a , which is � g ( x t,a ; θ t − 1 ) ⊤ Z − 1 U t,a = f ( x t,a ; θ t − 1 ) + γ t − 1 t − 1 g ( x t,a ; θ t − 1 ) /m � �� mean variance Compared with LinUCB (Li et al. 2010) � t,a Z − 1 x ⊤ U t,a = � x t,a , θ t − 1 � + γ t − 1 t − 1 x t,a � �� mean variance ◮ Select a t = argmax a ∈ [ K ] U t,a , play a t and observe reward r t,a t 31 / 49

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 - PowerPoint PPT Presentation

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1 Department of Computer Science, UCLA 2 Google Research 1 / 49 Outline Background Contextual bandit problem Deep neural networks 2 / 49

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Kernel Methods for Cooperative Contextual Bandits Introduction Motivation UCB Algorithms Basic

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Foundations of Cryptography MIT-6.875/18.425 , UCB CS-276 Lecture 1 Shafi Goldwasser MIT,

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e III University of Maryland

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 16:

Linear maps Matthew Macauley Department of Mathematical Sciences Clemson University

Waqar Ali, Heechul Yun University of Kansas Multicore Processors Provide high computing

Analyzable and Practical Real-Time Gang Scheduling on Multicore Using RT-Gang Waqar Ali, Michael

Real-Time Cloud Computing Chenyang Lu Cyber-Physical Systems Laboratory

ENABLING RICH WEB APPLICATIONS FOR IN-VEHICLE INFOTAINMENT. USING THE WEBINOS PLATFORM INSIDE THE

Financial Econometrics, Econ 40357 Macro and Financial Time Series N.C. Mark University of Notre

Evaluation of real-time operating systems for use in Integrated Modular Avionics Professor:

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 - PowerPoint PPT Presentation

Neural Contextual Bandits with UCB-based Exploration Dongruo Zhou 1 Lihong Li 2 Quanquan Gu 1 1 Department of Computer Science, UCLA 2 Google Research 1 / 49 Outline Background Contextual bandit problem Deep neural networks 2 / 49

The Contextual Bandits Problem The Contextual Bandits Problem The Contextual Bandits Problem The

Kernel Methods for Cooperative Contextual Bandits Introduction Motivation UCB Algorithms Basic

Contextual Inquiry Take Aways Overview of Contextual Design Contextual inquiry

CS885 Reinforcement Learning Lecture 8b: May 25, 2018 Bayesian and Contextual Bandits [SutBar]

The Epoch-Greedy Algorithm for Contextual Multi-armed Bandits Authors: John Langford, Tom Zhang

Cooperative Multi-Agent Bandits with Heavy Tails Introduction K-Armed Bandits Cooperation

Differentially-Private Federated Linear Bandits Introduction Federated Learning Contextual

Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work

Introduction to Bandits R emi Munos SequeL project: Sequential Learning

Contextual Analysis SWEN-444 Contextual analysis Systematic analysis of contextual user work

Foundations of Cryptography MIT-6.875/18.425 , UCB CS-276 Lecture 1 Shafi Goldwasser MIT,

Meta-Learning Contextual Bandit Exploration Amr Sharaf Hal Daum e III University of Maryland

Contextual Advertising: Contextual Advertising: Semantic Approach Semantic Approach Ekaterina

Experimental Design &amp; Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual

Serving Contextual Communities Serving Contextual Communities The Evangelical Theological

Chicag cago o Bandits dits Affili liate te Program ram Junior r Affiliate and Tra vel

Class Structure Last time: Batch RL This Time: MCTS Next time: Human in the Loop RL Lecture 16:

Linear maps Matthew Macauley Department of Mathematical Sciences Clemson University

Waqar Ali, Heechul Yun University of Kansas Multicore Processors Provide high computing

Analyzable and Practical Real-Time Gang Scheduling on Multicore Using RT-Gang Waqar Ali, Michael

Real-Time Cloud Computing Chenyang Lu Cyber-Physical Systems Laboratory

ENABLING RICH WEB APPLICATIONS FOR IN-VEHICLE INFOTAINMENT. USING THE WEBINOS PLATFORM INSIDE THE

Financial Econometrics, Econ 40357 Macro and Financial Time Series N.C. Mark University of Notre

Evaluation of real-time operating systems for use in Integrated Modular Avionics Professor:

Experimental Design & Evaluation 4. Contextual Inquiry SunyoungKim,PhD Contextual