Bypassing the Monster: A Faster and Simpler Optimal Algorithm for Contextual Bandits under Realizability Yunzong Xu MIT Joint work with David Simchi-Levi (MIT) July 18 RealML @ ICML 2020
Stochastic Contextual Bandits β’ For round π’ = 1, β― , π β’ Nature generates a random context π¦ π’ according to a fixed unknown distribution πΈ ππππ’ππ¦π’ β’ Learner observes π¦ π’ and makes a decision π π’ β {1, β¦ , πΏ} β’ Nature generates a random reward π π’ π¦ π’ , π π’ β [0,1] according to an unknown distribution πΈ π¦ π’ ,π π’ with (conditional) mean π’ π¦ π’ , π π’ π¦ π’ = π¦, π π’ = π = π β (π¦, π) π½ π β’ We call π β the ground-truth reward function β’ In statistical learning, people use a function class πΊ to approximate π β . Some examples of πΊ : β’ Linear class / high-dimension linear class / generalized linear models β’ Reproducing kernel Hilbert spaces β’ Lipschitz and HΓΆlder spaces β’ Neural networks
Challenges β’ We are interested in contextual bandits with a general function class πΊ β’ Realizability assumption: π β β πΊ β’ Statistical challenges : how to achieve the minimax optimal regret for a general function class πΊ ? β’ Computational challenges : how to make the algorithm computational efficient? β’ Existing contextual bandits approaches cannot simultaneously address the above two challenges in practice, as they typically β’ Rely on strong parametric/structural assumptions on πΊ (e.g., UCB variants and Thompson Sampling) β’ Become computationally intractable for large πΊ (e.g., EXP4) β’ Assume computationally expensive or statistically restrictive oracles that are only implementable for specific F (a series of work on oracle-based contextual bandits)
Research Question β’ Observation: the statistical and computational aspects of βoffline regression with a general πΊ β are very well-studied in ML β’ Can we reduce general contextual bandits to general offline regression? β’ Specifically, for any πΊ , given an offline regression oracle, i.e., a least- squares regression oracle (ERM with square loss): π‘ π’ (π¦ π’ , π π’ ) 2 , min πβπΊ ΰ· π π¦ π’ , π π’ β π π’=1 can we design an algorithm that achieves the optimal regret via a few calls to this oracle? β’ An open problem mentioned in Agarwal et al. (2012), Foster et al. (2018), Foster and Rakhlin (2020)
Our Contributions β’ We provide the first optimal and efficient offline- regression-oracle-based algorithm for general contextual bandits (under realizability) β’ The algorithm is much simpler and faster than existing approaches to general contextual bandits β’ We provide the first universal and optimal black- box reduction from contextual bandits to offline regression β’ Any advances in offline (square loss) regression immediately translate to contextual bandits, statistically and computationally
Recommend
More recommend