Multi-Player Bandits Revisited Decentralized Multi-Player Multi-Arm Bandits Lilian Besson Joint work with Émilie Kaufmann PhD Student Team SCEE, IETR, CentraleSupélec, Rennes & Team SequeL, CRIStAL, Inria, Lille CMAP Seminar – 31 st October 2018
Insert them in a crowded wireless network. With a protocol slotted in both time and frequency. Goal Maintain a good Quality of Service. With no centralized control as it costs network overhead. How? Devices can choose a different radio channel at each time learn the best one with a sequential algorithm! 1. Introduction and motivation 1.a. Objective Motivation We control some communicating devices, they want to use a wireless access point. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 2 / 45
Goal Maintain a good Quality of Service. With no centralized control as it costs network overhead. How? Devices can choose a different radio channel at each time learn the best one with a sequential algorithm! 1. Introduction and motivation 1.a. Objective Motivation We control some communicating devices, they want to use a wireless access point. Insert them in a crowded wireless network. With a protocol slotted in both time and frequency. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 2 / 45
How? Devices can choose a different radio channel at each time learn the best one with a sequential algorithm! 1. Introduction and motivation 1.a. Objective Motivation We control some communicating devices, they want to use a wireless access point. Insert them in a crowded wireless network. With a protocol slotted in both time and frequency. Goal Maintain a good Quality of Service. With no centralized control as it costs network overhead. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 2 / 45
1. Introduction and motivation 1.a. Objective Motivation We control some communicating devices, they want to use a wireless access point. Insert them in a crowded wireless network. With a protocol slotted in both time and frequency. Goal Maintain a good Quality of Service. With no centralized control as it costs network overhead. How? Devices can choose a different radio channel at each time � → learn the best one with a sequential algorithm! Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 2 / 45
and reference Quick reminder on single-player MAB algorithms 4 New multi-player non-coordinated decentralized algorithms 5 Our upper bound on regret for 6 Experimental results 7 Review of two more recent articles 8 Conclusion 9 Based on “Multi-Player Bandits Revisited”, by Lilian Besson & Émilie Kaufmann. arXiv:1711.02317 , presented at ALT 2018 (Lanzarote, Spain) in April. 1. Introduction and motivation 1.b. Outline and references Outline Introduction 1 Our model: 3 different feedback levels 2 Regret of the system, and our lower bound on regret 3 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 3 / 45
and reference Experimental results 7 Review of two more recent articles 8 Conclusion 9 Based on “Multi-Player Bandits Revisited”, by Lilian Besson & Émilie Kaufmann. arXiv:1711.02317 , presented at ALT 2018 (Lanzarote, Spain) in April. 1. Introduction and motivation 1.b. Outline and references Outline Introduction 1 Our model: 3 different feedback levels 2 Regret of the system, and our lower bound on regret 3 Quick reminder on single-player MAB algorithms 4 New multi-player non-coordinated decentralized algorithms 5 Our upper bound on regret for MCTopM 6 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 3 / 45
and reference Conclusion 9 Based on “Multi-Player Bandits Revisited”, by Lilian Besson & Émilie Kaufmann. arXiv:1711.02317 , presented at ALT 2018 (Lanzarote, Spain) in April. 1. Introduction and motivation 1.b. Outline and references Outline Introduction 1 Our model: 3 different feedback levels 2 Regret of the system, and our lower bound on regret 3 Quick reminder on single-player MAB algorithms 4 New multi-player non-coordinated decentralized algorithms 5 Our upper bound on regret for MCTopM 6 Experimental results 7 Review of two more recent articles 8 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 3 / 45
and reference Based on “Multi-Player Bandits Revisited”, by Lilian Besson & Émilie Kaufmann. arXiv:1711.02317 , presented at ALT 2018 (Lanzarote, Spain) in April. 1. Introduction and motivation 1.b. Outline and references Outline Introduction 1 Our model: 3 different feedback levels 2 Regret of the system, and our lower bound on regret 3 Quick reminder on single-player MAB algorithms 4 New multi-player non-coordinated decentralized algorithms 5 Our upper bound on regret for MCTopM 6 Experimental results 7 Review of two more recent articles 8 Conclusion 9 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 3 / 45
1. Introduction and motivation 1.b. Outline and references Outline and reference Introduction 1 Our model: 3 different feedback levels 2 Regret of the system, and our lower bound on regret 3 Quick reminder on single-player MAB algorithms 4 New multi-player non-coordinated decentralized algorithms 5 Our upper bound on regret for MCTopM 6 Experimental results 7 Review of two more recent articles 8 Conclusion 9 Based on “Multi-Player Bandits Revisited”, by Lilian Besson & Émilie Kaufmann. arXiv:1711.02317 , presented at ALT 2018 (Lanzarote, Spain) in April. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 3 / 45
2. Our model: 3 different feedback levels Our model Our communication model 1 With or without sensing 2 Background traffic, and rewards 3 Different feedback levels 4 Goal 5 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 4 / 45
Dynamic device dynamic radio reconfiguration It decides each time the channel it uses to send each packet. It can implement a simple decision algorithm. 2. Our model: 3 different feedback levels 2.a. Our communication model Our communication model K radio channels (e.g., 10). Discrete and synchronized time t ≥ 1 . Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 5 / 45
2. Our model: 3 different feedback levels 2.a. Our communication model Our communication model K radio channels (e.g., 10). Discrete and synchronized time t ≥ 1 . Dynamic device = dynamic radio reconfiguration It decides each time the channel it uses to send each packet. It can implement a simple decision algorithm. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 5 / 45
Two variants : with or without sensing With sensing: Device first senses for presence of Primary Users that have strict 1 priority (background traffic), then use Ack to detect collisions. Without sensing: same background traffic, but cannot sense, so only Ack is used. 2 2. Our model: 3 different feedback levels 2.b. With or without sensing Our model “Easy” case M ≤ K devices always communicate and try to access the network, independently without centralized supervision, Background traffic is i.i.d.. Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 6 / 45
2. Our model: 3 different feedback levels 2.b. With or without sensing Our model “Easy” case M ≤ K devices always communicate and try to access the network, independently without centralized supervision, Background traffic is i.i.d.. Two variants : with or without sensing With sensing: Device first senses for presence of Primary Users that have strict 1 priority (background traffic), then use Ack to detect collisions. Without sensing: same background traffic, but cannot sense, so only Ack is used. 2 Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 6 / 45
Rewards 1 uplink & Ack 1 iid with sensing information , collision for device : 1 alone on arm . combined binary reward but not from two Bernoulli! 2. Our model: 3 different feedback levels 2.c. Background traffic, and rewards Background traffic, and rewards i.i.d. background traffic K channels, modeled as Bernoulli ( 0/1 ) distributions of mean µ k = background traffic from Primary Users, bothering the dynamic devices, M devices, each uses channel A j ( t ) ∈ {1,..., K } at time t . Lilian Besson (CentraleSupélec & Inria) Multi-Player Bandits Revisited CMAP Seminar – 31 Oct 2018 7 / 45
Recommend
More recommend