On conditional versus marginal bias in multi-armed bandits Jaehyeok Shin 1 , Aaditya Ramdas 1,2 and Alessandro Rinaldo 1 Dept. of Statistics and Data Science 1 , Machine Learning Dept. 2 , CMU
Stochastic Multi-armed bandits (MABs) μ K μ 2 μ 1 . . . ∼ Y "Random reward" 2
Adaptive sampling scheme to maximize rewards / to identify the best arm μ K μ 2 μ 1 Time . . . 3
Adaptive sampling scheme to maximize rewards / to identify the best arm μ K μ 2 μ 1 Time . . . t = 1 3
Adaptive sampling scheme to maximize rewards / to identify the best arm μ K μ 2 μ 1 Time . . . t = 1 3
Adaptive sampling scheme to maximize rewards / to identify the best arm μ K μ 2 μ 1 Time . . . Y 1 t = 1 3
Adaptive sampling scheme to maximize rewards / to identify the best arm μ K μ 2 μ 1 Time . . . Y 1 t = 1 t = 2 3
Adaptive sampling scheme to maximize rewards / to identify the best arm μ K μ 2 μ 1 Time . . . Y 1 t = 1 t = 2 3
Adaptive sampling scheme to maximize rewards / to identify the best arm μ K μ 2 μ 1 Time . . . Y 1 t = 1 Y 2 t = 2 3
Adaptive sampling scheme to maximize rewards / to identify the best arm μ K μ 2 μ 1 Time . . . Y 1 t = 1 Y 2 t = 2 ⋮ 3
Adaptive sampling scheme to maximize rewards / to identify the best arm μ K μ 2 μ 1 Time . . . Y 1 t = 1 Y 2 t = 2 ⋮ 𝒰 Stopping time 3
Collected data can be used to identify an interesting arm... μ K μ 2 μ 1 Time . . . Y 1 t = 1 Y 2 t = 2 ⋮ 𝒰 4
̂ Collected data can be used to identify an interesting arm... μ K μ 2 μ 1 Time . . . Y 1 t = 1 Y 2 t = 2 e.g., κ = arg max μ k ( 𝒰 ) ⋮ k 𝒰 4
̂ ...and the data can be used to conduct statistical inferences. μ K μ 2 μ 1 Time . . . Y 1 t = 1 Y 2 t = 2 ⋮ 𝒰 Sample mean μ κ ( 𝒰 ) at a stopping time 𝒰 5
̂ Q. Sign of the bias of sample mean? 𝔽 [ μ κ ( 𝒰 ) − μ κ ] ≤ or ≥ 0? Xu et al. [2013] : An informal argument why the sample mean is negatively biased for “optimistic” algorithms. Villar et al. [2015] : Demonstrate this negative bias in a simulation study motivated by using MAB for clinical trials. 6
̂ Q. Sign of the bias of sample mean? 𝔽 [ μ κ ( 𝒰 ) − μ κ ] ≤ or ≥ 0? Xu et al. [2013] : An informal argument why the sample mean is negatively biased for “optimistic” algorithms. Villar et al. [2015] : Demonstrate this negative bias in a simulation study motivated by using MAB for clinical trials. 6
̂ ̂ Nie et al. [2018] Sample mean is negatively biased 𝔽 [ μ k ( t ) − μ k ] ≤ 0 Fixed Arm Fixed Time for MABs designed to maximize cumulative reward. Shin et al. [2019] Introduced "monotonicity property" characterizing the bias of the sample mean for more general classes of MABs. 𝔽 [ μ κ ( 𝒰 ) − μ κ ] Chosen Arm Stopping Time 7
̂ ̂ Nie et al. [2018] Sample mean is negatively biased 𝔽 [ μ k ( t ) − μ k ] ≤ 0 Fixed Arm Fixed Time for MABs designed to maximize cumulative reward. Shin et al. [2019] Introduced "monotonicity property" characterizing the bias of the sample mean for more general classes of MABs. 𝔽 [ μ κ ( 𝒰 ) − μ κ ] Chosen Arm Stopping Time 8
However, current understanding of bias is limited in two aspects. 1. Existing results concern the bias of the sample mean only. 9
However, current understanding of bias is limited in two aspects. 1. Existing results concern the bias of the sample mean only. We study the bias of monotone functions of the rewards. 9
However, current understanding of bias is limited in two aspects. 1. Existing results concern the bias of the sample mean only. We study the bias of monotone functions of the rewards. 2. Existing guarantees cover only the marginal bias. 9
However, current understanding of bias is limited in two aspects. 1. Existing results concern the bias of the sample mean only. We study the bias of monotone functions of the rewards. 2. Existing guarantees cover only the marginal bias. We extend previous results to cover the conditional bias. 9
Marginal vs Conditional bias • K prototype items μ K μ 2 μ 1 . . . Want to screen out some items by testing vs H 1 k : μ k < c for H 0 k : μ k ≥ c k = 1,…, K . 10
̂ ̂ Marginal vs Conditional bias H 0 : μ ≥ c vs H 1 : μ < c μ ( t ) μ ( t ) T 𝒰 "Keep the item" "Screen out the item at " 𝒰 (Reject the null) (Fail to reject the null) 11
̂ ̂ ̂ ̂ Marginally, the sample mean is negatively biased. Item 1 Item 2 Item K μ ( t ) μ ( t ) μ ( t ) . . . T 𝒰 𝒰 𝔽 [ μ k − μ k ] ≤ 0, k = 1,…, K "Underestimating the true mean revenue." (e.g. Starr & Woodroofe [1968], Shin et al. [2019]) 12
̂ ̂ ̂ ...however, we usually do not evaluate the sample mean every time. Item 1 Item 2 Item K μ ( t ) μ ( t ) μ ( t ) . . . T 𝒰 𝒰 13
̂ ̂ ̂ ...however, we usually do not evaluate the sample mean every time. Item 1 Item 2 Item K μ ( t ) μ ( t ) μ ( t ) . . . T 𝒰 𝒰 13
̂ ̂ ̂ ̂ Conditioned on the "active" event, the sample mean is positively biased. Item 1 Item 2 Item K μ ( t ) μ ( t ) μ ( t ) . . . T 𝒰 𝒰 𝔽 [ μ k − μ k ∣ item k is active ] ≥ 0, k = 1,…, K "Overestimating the true mean revenue." 13
̂ ̂ Conditional bias of the empirical cumulative distribution function (CDF) For a fixed y ∈ ℝ , 𝔽 [ F k , 𝒰 ( y ) − F k ( y ) ∣ C ] ≤ or ≥ 0? e.g., C = { Reject the null }, { Chosen as the best arm } . . . . where F k , 𝒰 : Empirical CDF of arm k at time 𝒰 F k : True CDF of arm k at time 𝒰 14
Tabular model of MABs μ 1 μ 2 μ K . . . i.i.d. i.i.d. i.i.d. ∼ ∼ ∼ X * ∞ ∈ ℝ ℕ× K X * X * X * . . . 1, K 1,1 1,2 } X * X * X * . . . := 2,1 2,2 2, K ⋮ ⋮ ⋮ ⋮ "Hypothetical table" 15
Tabular model of MABs μ 1 μ 2 μ K Time . . . X * X * X * . . . 1, K 1,1 1,2 X * X * X * . . . 2,1 2,2 2, K ⋮ ⋮ ⋮ ⋮ 16
Tabular model of MABs μ 1 μ 2 μ K Time . . . X * X * X * t = 1 . . . 1, K 1,1 1,2 X * X * X * . . . 2,1 2,2 2, K ⋮ ⋮ ⋮ ⋮ 16
Tabular model of MABs μ 1 μ 2 μ K Time . . . X * X * t = 1 Y 1 . . . 1, K 1,1 X * X * X * . . . 2,1 2,2 2, K ⋮ ⋮ ⋮ ⋮ 16
Tabular model of MABs μ 1 μ 2 μ K Time . . . X * X * t = 1 Y 1 . . . 1, K 1,1 X * X * X * t = 2 . . . 2,1 2,2 2, K ⋮ ⋮ ⋮ ⋮ 16
Tabular model of MABs μ 1 μ 2 μ K Time . . . X * X * t = 1 Y 1 . . . 1, K 1,1 X * X * X * t = 2 . . . 2,1 2,2 2, K ⋮ ⋮ ⋮ ⋮ 16
Tabular model of MABs μ 1 μ 2 μ K Time . . . X * X * t = 1 Y 1 . . . 1, K 1,1 X * X * t = 2 Y 2 . . . 2,2 2, K ⋮ ⋮ ⋮ ⋮ 16
Hypothetical dataset Hypothetical table ∞ ∪ { W t } ∞ * ∞ = X * t =1 Random seeds 17
Hypothetical dataset ∞ ∪ { W t } ∞ Given * ∞ = X * t =1 , and for each and can be C 𝒰 N k ( t ) t k expressed as some functions of . * ∞ 18
̂ Monotone effect of a sample Theorem 1 ( C ) Suppose arm has a finite mean. If is an increasing k N k ( 𝒰 ) function of each while keeping all other entries in X * * ∞ i , k fixed then we have 𝔽 [ (Negative conditional bias of F k , 𝒰 ( y ) − F k ( y ) ∣ C ] ≤ 0 the empirical CDF) 19
̂ ̂ Monotone effect of a sample Theorem 1 ( C ) Suppose arm has a finite mean. If is an increasing k N k ( 𝒰 ) function of each while keeping all other entries in X * * ∞ i , k fixed then we have 𝔽 [ (Negative conditional bias of F k , 𝒰 ( y ) − F k ( y ) ∣ C ] ≤ 0 the empirical CDF) (Positive conditional bias of 𝔽 [ μ k ( 𝒰 ) − μ k ∣ C ] ≥ 0 the sample mean) 19
̂ Monotone effect of a sample Theorem 1 ( C ) Suppose arm has a finite mean. If is a decreasing k N k ( 𝒰 ) function of each while keeping all other entries in X * * ∞ i , k fixed then we have 𝔽 [ (Positive conditional bias of F k , 𝒰 ( y ) − F k ( y ) ∣ C ] ≥ 0 the empirical CDF) 20
̂ ̂ Monotone effect of a sample Theorem 1 ( C ) Suppose arm has a finite mean. If is a decreasing k N k ( 𝒰 ) function of each while keeping all other entries in X * * ∞ i , k fixed then we have 𝔽 [ (Positive conditional bias of F k , 𝒰 ( y ) − F k ( y ) ∣ C ] ≥ 0 the empirical CDF) (Negative conditional bias 𝔽 [ μ k ( 𝒰 ) − μ k ∣ C ] ≤ 0 of the sample mean) 20
E.g.: Best arm identification • K prototype items μ K μ 2 μ 1 . . . Want to figure out which one has the largest revenue. 21
E.g.: Best arm identification lil' UCB algorithm μ K μ 2 μ 1 Time . . . Y 1 t = 1 Y 2 t = 2 22
̂ E.g.: Best arm identification lil' UCB algorithm μ K μ 2 μ 1 Time . . . Y 1 t = 1 Y 2 t = 2 (Upper confidence bound) A t = arg max μ k ( t ) + u ( N k ( t )) k 22
Recommend
More recommend