Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint work with Émilie Kaufmann PhD Student Team SCEE, IETR, CentraleSupélec, Rennes & Team SequeL, CRIStAL, Inria, Lille ALT Conference – 08 - 04 - 2018
1. Introduction and motivation Maintain a good Quality of Service . Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) Devices can choose a difgerent radio channel at each time How? 1.a. Objective With no centralized control as it costs network overhead. Goal Insert them in a crowded wireless network . wireless access point. We control some communicating devices, they want to use a Motivation 2 / 30 With a protocol slotted in both time and frequency . ֒ → learn the best one with a sequential algorithm ! ALT Conference – 08 - 04 - 2018
2.a. Our communication model Our communication model It decides each time the channel it uses to send each packet . It can implement a simple decision algorithm . Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited 3 / 30 2. Our model: 3 difgerent feedback levels K radio channels ( e.g. , 10). Discrete and synchronized time t ≥ 1 . Dynamic device = dynamic radio reconfjguration ALT Conference – 08 - 04 - 2018
4 / 30 2.b. With or without sensing Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) Without sensing : same background traffjc, but cannot sense, so 2 detect collisions. With sensing : Device fjrst senses for presence of Primary Users 1 Two variants : with or without sensing Background traffjc is i.i.d. . network, independently without centralized supervision, “Easy” case Our model 2. Our model: 3 difgerent feedback levels M ≤ K devices always communicate and try to access the that have strict priority (background traffjc), then use Ack to only Ack is used. ALT Conference – 08 - 04 - 2018
5 / 30 with sensing information Background traffjc, and rewards i.i.d. background traffjc Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) dynamic devices, iid Rewards 2.c. Background traffjc, and rewards 2. Our model: 3 difgerent feedback levels K channels, modeled as Bernoulli ( 0 / 1 ) distributions of mean µ k = background traffjc from Primary Users , bothering the M devices, each uses channel A j ( t ) ∈ { 1 , . . . , K } at time t . r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) = 1 ( uplink & Ack ) ∀ k, Y k,t ∼ Bern( µ k ) ∈ { 0 , 1 } , C j ( t ) = 1 ( alone on arm A j ( t )) . collision for device j : → r j ( t ) combined binary reward but not from two Bernoulli! ֒ ALT Conference – 08 - 04 - 2018
But all consider the same instantaneous reward 1 6 / 30 Models licensed protocols (ex. ZigBee), our main focus. Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) . Unlicensed protocols (ex. LoRaWAN), harder to analyze ! , “No sensing”: observe only the combined 3 , 2.d. Difgerent feedback levels only if , then “Sensing”: fjrst observe 2 1 3 feedback levels 2. Our model: 3 difgerent feedback levels r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. ALT Conference – 08 - 04 - 2018
But all consider the same instantaneous reward 1 6 / 30 3 feedback levels Multi-Player Bandits Revisited 1 Lilian Besson (CentraleSupélec & Inria) . 2 Unlicensed protocols (ex. LoRaWAN), harder to analyze ! , 2.d. Difgerent feedback levels 3 “No sensing”: observe only the combined 2. Our model: 3 difgerent feedback levels r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus. ALT Conference – 08 - 04 - 2018
But all consider the same instantaneous reward 6 / 30 2.d. Difgerent feedback levels 3 feedback levels Multi-Player Bandits Revisited 1 Lilian Besson (CentraleSupélec & Inria) . 2 3 2. Our model: 3 difgerent feedback levels r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus. “No sensing”: observe only the combined Y A j ( t ) ,t × 1 ( C j ( t )) , ֒ → Unlicensed protocols (ex. LoRaWAN), harder to analyze ! ALT Conference – 08 - 04 - 2018
6 / 30 2 3 feedback levels Multi-Player Bandits Revisited 1 Lilian Besson (CentraleSupélec & Inria) 3 2.d. Difgerent feedback levels 2. Our model: 3 difgerent feedback levels r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus. “No sensing”: observe only the combined Y A j ( t ) ,t × 1 ( C j ( t )) , ֒ → Unlicensed protocols (ex. LoRaWAN), harder to analyze ! But all consider the same instantaneous reward r j ( t ) . ALT Conference – 08 - 04 - 2018
2.e. Goal Goal Goal Minimize packet loss ratio in a fjnite-space discrete-time Decision Making Problem . Solution ? Multi-Armed Bandit algorithms decentralized and used independently by each dynamic device. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited 7 / 30 2. Our model: 3 difgerent feedback levels ( = maximize nb of received Ack ) ALT Conference – 08 - 04 - 2018
8 / 30 2.f. Centralized regret Centralized regret A measure of success Not the network throughput or collision probability, We study the centralized (expected) regret : Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) Upper Bound on the regret, for one algorithm ! How good is my decentralized algorithm in this setting? Lower Bound on the regret, for any algorithm ! How good a decentralized algorithm can be in this setting? Two directions of analysis Ref: [Lai & Robbins, 1985], [Liu & Zhao, 2009], [Anandkumar et al, 2010] etc. 2. Our model: 3 difgerent feedback levels ( M ) T M ∑ ∑ ∑ r j ( t ) . µ ∗ R T ( µ , M, ρ ) := T − E µ k t =1 j =1 k =1 Notation: µ ∗ k is the mean of the k -best arm ( k -th largest in µ ): µ ∗ 1 := max µ , µ ∗ 2 := max µ \ { µ ∗ 1 } , ALT Conference – 08 - 04 - 2018
8 / 30 2.f. Centralized regret Centralized regret A measure of success Not the network throughput or collision probability, We study the centralized (expected) regret: Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) How good is my decentralized algorithm in this setting? How good a decentralized algorithm can be in this setting? Two directions of analysis 2. Our model: 3 difgerent feedback levels ( M ) T M ∑ ∑ ∑ r j ( t ) . R T ( µ , M, ρ ) := µ ∗ T − E µ k k =1 t =1 j =1 ֒ → Lower Bound on the regret, for any algorithm ! ֒ → Upper Bound on the regret, for one algorithm ! ALT Conference – 08 - 04 - 2018
3. Lower bound Lower bound 1 2 Asymptotic lower bound on one term, 3 And for the regret. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited 9 / 30 Decomposition of the regret in 3 terms, ALT Conference – 08 - 04 - 2018
3. Lower bound 3.a. Lower bound on the regret Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) Devices can use orthogonal channels ( number of collisions ). 3 them ( number of optimal non-selections ), Devices can quickly identify the best arms, and most surely play 2 them too much ( number of sub-optimal selections ), , and not play - Devices can quickly identify the bad arms 1 Small regret can be attained if… 10 / 30 Decomposition on the regret Decomposition For any algorithm, decentralized or not, we have ∑ ( µ ∗ R T ( µ , M, ρ ) = M − µ k ) E µ [ T k ( T )] k ∈ M - worst ∑ ∑ K ( µ k − µ ∗ + M ) ( T − E µ [ T k ( T )]) + µ k E µ [ C k ( T )] . k ∈ M - best k =1 Notations for an arm k ∈ { 1 , . . . , K } : k ( T ) := ∑ T T j t =1 1 ( A j ( t ) = k ) , counts selections by the player j ∈ { 1 , . . . , M } , T k ( T ) := ∑ M j =1 T j k ( T ) , counts selections by all M players, C k ( T ) := ∑ T t =1 1 ( ∃ j 1 ̸ = j 2 , A j 1 ( t ) = k = A j 2 ( t )) , counts collisions. ALT Conference – 08 - 04 - 2018
3. Lower bound 3.a. Lower bound on the regret Multi-Player Bandits Revisited Lilian Besson (CentraleSupélec & Inria) Devices can use orthogonal channels ( number of collisions ). 3 play them ( number of optimal non-selections ), Devices can quickly identify the best arms, and most surely 2 play them too much ( number of sub-optimal selections ), 1 Small regret can be attained if… 10 / 30 Decomposition on the regret Decomposition For any algorithm, decentralized or not, we have ∑ R T ( µ , M, ρ ) = ( µ ∗ M − µ k ) E µ [ T k ( T )] k ∈ M - worst ∑ ∑ K ( µ k − µ ∗ + M ) ( T − E µ [ T k ( T )]) + µ k E µ [ C k ( T )] . k ∈ M - best k =1 Devices can quickly identify the bad arms M - worst , and not ALT Conference – 08 - 04 - 2018
Recommend
More recommend