Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Advised by Christophe Moy Émilie Kaufmann PhD Student Team SCEE, IETR, CentraleSupélec, Rennes & Team SequeL, CRIStAL, Inria, Lille SequeL Seminar - 22 December 2017
1. Introduction and motivation 1.a. Objective 2 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited Devices can choose a difgerent radio channel at each time How? With no centralized control as it costs network overhead. Maintain a good Quality of Service. Goal With a protocol slotted in both time and frequency. Insert them in a crowded wireless network. single base station. We control some communicating devices, they want to access to a Motivation 42 ֒ → learn the best one with sequential algorithm!
1 Introduction 1. Introduction and motivation 1.b. Outline and references 3 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited arXiv:1711.02317 “Multi-Player Bandits Models Revisited”, Besson & Kaufmann. This is based on our latest article: 42 Outline and reference 2 Our model: 3 difgerent feedback levels 3 Decomposition and lower bound on regret 4 Quick reminder on single-player MAB algorithms 5 Two new multi-player decentralized algorithms 6 Upper bounds on regret for MCTopM 7 Experimental results 8 An heuristic ( Selfish ), and disappointing results 9 Conclusion
1 Introduction 1. Introduction and motivation This is based on our latest article: 3 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited arXiv:1711.02317 “Multi-Player Bandits Models Revisited”, Besson & Kaufmann. 42 Outline and reference 1.b. Outline and references 2 Our model: 3 difgerent feedback levels 3 Decomposition and lower bound on regret 4 Quick reminder on single-player MAB algorithms 5 Two new multi-player decentralized algorithms 6 Upper bounds on regret for MCTopM 7 Experimental results 8 An heuristic ( Selfish ), and disappointing results 9 Conclusion
2.a. Our model Our model (known) Figure 1: Protocol in time and frequency, with an Acknowledgement. It decides each time the channel it uses to send each packet. It can implement a simple decision algorithm. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited SequeL Seminar - 22/12/17 4 / 42 2. Our model: 3 difgerent feedback level K radio channels (e.g., 10) Discrete and synchronized time t ≥ 1 . Every time frame t is: Dynamic device = dynamic radio reconfjguration
2 Without sensing: same background traffjc, but cannot sense, 42 problem. Not exactly suited for Internet of Things, but 5 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited or SigFox (Harder to analyze mathematically.) so only Ack is used. More suited for “IoT” networks like LoRa mathematically... can model ZigBee, and can be analyzed Model the “classical” Opportunistic Spectrum Access 2.b. With or without sensing Users (background traffjc), then use Ack to detect collisions. Two variants : with or without sensing Background traffjc is i.i.d.. network, independently without centralized supervision, “Easy” case Our model 2. Our model: 3 difgerent feedback level M ≤ K devices always communicate and try to access the 1 With sensing: Device fjrst senses for presence of Primary
42 problem. Not exactly suited for Internet of Things, but 5 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited or SigFox (Harder to analyze mathematically.) so only Ack is used. More suited for “IoT” networks like LoRa mathematically... can model ZigBee, and can be analyzed Model the “classical” Opportunistic Spectrum Access 2.b. With or without sensing Users (background traffjc), then use Ack to detect collisions. Two variants : with or without sensing Background traffjc is i.i.d.. network, independently without centralized supervision, “Easy” case Our model 2. Our model: 3 difgerent feedback level M ≤ K devices always communicate and try to access the 1 With sensing: Device fjrst senses for presence of Primary 2 Without sensing: same background traffjc, but cannot sense,
42 2.c. Background traffjc, and rewards Background traffjc, and rewards i.i.d. background traffjc 6 / SequeL Seminar - 22/12/17 dynamic devices, Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited Rewards iid 2. Our model: 3 difgerent feedback level K channels, modeled as Bernoulli ( 0 / 1 ) distributions of mean µ k = background traffjc from Primary Users, bothering the M devices, each uses channel A j ( t ) ∈ { 1 , . . . , K } at time t . r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) = 1 ( uplink & Ack ) with sensing information ∀ k, Y k,t ∼ Bern( µ k ) ∈ { 0 , 1 } , collision for device j : C j ( t ) = 1 ( alone on arm A j ( t )) . ֒ → joint binary reward but not from two Bernoulli!
But all consider the same instantaneous reward 2 “Sensing”: fjrst observe 1 3 “No sensing”: observe only the joint 42 7 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited . Unlicensed protocols (ex. LoRaWAN), harder to analyze ! , Models licensed protocols (ex. ZigBee), our main focus. 2.d. Difgerent feedback levels , only if , then 3 feedback levels 2. Our model: 3 difgerent feedback level r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it.
But all consider the same instantaneous reward 1 3 “No sensing”: observe only the joint 7 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited . Unlicensed protocols (ex. LoRaWAN), harder to analyze ! , 42 3 feedback levels 2.d. Difgerent feedback levels 2. Our model: 3 difgerent feedback level r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. 2 “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus.
But all consider the same instantaneous reward 42 2.d. Difgerent feedback levels 3 feedback levels 7 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited . 2. Our model: 3 difgerent feedback level r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. 2 “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus. 3 “No sensing”: observe only the joint Y A j ( t ) ,t × 1 ( C j ( t )) , ֒ → Unlicensed protocols (ex. LoRaWAN), harder to analyze !
42 2.d. Difgerent feedback levels 3 feedback levels 7 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited 2. Our model: 3 difgerent feedback level r j ( t ) := Y A j ( t ) ,t × 1 ( C j ( t )) 1 “Full feedback”: observe both Y A j ( t ) ,t and C j ( t ) separately, ֒ → Not realistic enough, we don’t focus on it. 2 “Sensing”: fjrst observe Y A j ( t ) ,t , then C j ( t ) only if Y A j ( t ) ,t ̸ = 0 , ֒ → Models licensed protocols (ex. ZigBee), our main focus. 3 “No sensing”: observe only the joint Y A j ( t ) ,t × 1 ( C j ( t )) , ֒ → Unlicensed protocols (ex. LoRaWAN), harder to analyze ! But all consider the same instantaneous reward r j ( t ) .
42 . 8 / SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited arms, orthogonally (without collisions). best Only possible if : each player converges to one of the With no central control, and no exchange of information, Each player wants to maximize its cumulated reward, algorithm 2.e. Goal max cumulated rewards Max transmission rate Decentralized reinforcement learning optimization! and used independently by each dynamic device. Solution ? Multi-Armed Bandit algorithms, decentralized Ack) in a fjnite-space discrete-time Decision Making Problem. Problem Goal 2. Our model: 3 difgerent feedback level Goal : minimize packet loss ratio ( = maximize nb of received
42 Each player wants to maximize its cumulated reward, Goal Problem 8 / Ack) in a fjnite-space discrete-time Decision Making Problem. Solution ? Multi-Armed Bandit algorithms, decentralized and used independently by each dynamic device. Decentralized reinforcement learning optimization! SequeL Seminar - 22/12/17 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited arms, orthogonally (without collisions). With no central control, and no exchange of information, 2.e. Goal 2. Our model: 3 difgerent feedback level Goal : minimize packet loss ratio ( = maximize nb of received Max transmission rate ≡ max cumulated rewards ∑ T ∑ M j =1 r j max A ( t ) . algorithm A t =1 Only possible if : each player converges to one of the M best
Recommend
More recommend