Dynamic spectrum access under partial observations: A restless bandit approach Nima Akbarzadeh, Aditya Mahajan McGill University, Electrical and Computer Engineering Department June 3, 2019 1/23
Restless Bandits Example 2/23
Channel Scheduling Problem At which time, which channel and which resource should be used? Features: Time-varying channels Partially-observable environment Resource Allocation Examples: Cognitive radio networks Resource constraint jamming 3/23
Model (Channel) n finite state Markov channels, N = { 1 , . . . , n } . State space is finite ordered set S i , i ∈ N Markov state process: { S i t } t ≥ 0 Transition Probability Matrix: P i Resource: rate, power, bandwidth, etc., R = {∅ , r 1 , . . . , r k } Payoff: ρ i ( s , r ), s ∈ S i , r ∈ R ρ i ( s , r ) = 0 if r = ∅ Example: S i = { s bad , s good } , R = { r low , r high } r low , if r = r low ρ i ( s , r ) = r high , if r = r high and s = s good 0 , if r = r high and s = s bad 4/23
Model (Channel) n finite state Markov channels, N = { 1 , . . . , n } . State space is finite ordered set S i , i ∈ N Markov state process: { S i t } t ≥ 0 Transition Probability Matrix: P i Resource: rate, power, bandwidth, etc., R = {∅ , r 1 , . . . , r k } Payoff: ρ i ( s , r ), s ∈ S i , r ∈ R ρ i ( s , r ) = 0 if r = ∅ Example: S i = { s bad , s good } , R = { r low , r high } r low , if r = r low ρ i ( s , r ) = r high , if r = r high and s = s good 0 , if r = r high and s = s bad 4/23
Model (Transmitter) Two decisions to make at each time t : Select L channels indexed by L t A i t = 1 if i ∈ L t and 0 otherwise Select resources denoted by R i t R i t = ∅ if i / ∈ L t Observation Process: � S i if A i t , t = 1 Y i t = if A i E , t = 0 . Strategies: A t = f t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 ) , R t = g t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 , A t ) . 5/23
Model (Transmitter) Two decisions to make at each time t : Select L channels indexed by L t A i t = 1 if i ∈ L t and 0 otherwise Select resources denoted by R i t R i t = ∅ if i / ∈ L t Observation Process: � S i if A i t , t = 1 Y i t = if A i E , t = 0 . Strategies: A t = f t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 ) , R t = g t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 , A t ) . 5/23
Model (Transmitter) Two decisions to make at each time t : Select L channels indexed by L t A i t = 1 if i ∈ L t and 0 otherwise Select resources denoted by R i t R i t = ∅ if i / ∈ L t Observation Process: � S i if A i t , t = 1 Y i t = if A i E , t = 0 . Strategies: A t = f t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 ) , R t = g t ( Y 0: t − 1 , R 0: t − 1 , A 0: t − 1 , A t ) . 5/23
Model (Optimization Problem) Problem Given a discount factor β ∈ (0 , 1) , a set of resources R , and the state space, transition probability, and reward function ( S i , P i , ρ i ) i ∈N for all channels, choose a communication strategy ( f , g ) to maximize � ∞ � � β t � ρ i ( S i t , R i t ) A i J ( f , g ) = E . t t =0 i ∈N 6/23
Literature Review and Approaches Partially Observable Markov Decision Process (POMDP). POMDP models suffer from curse of dimensionality: The state space size is exponential in the number of channels Simplified modelling assumptions: Two state Gilbert-Elliot channels Multi-state channels but identical Fully-observable Markov Decision Process (MDP) 7/23
Our contributions Multi-state non-identical channels Restless Bandit approach Convert the POMDP into a countable MDP Finite-state Approximation of the MDP 8/23
POMDP (Belief State) Belief state: Π i t ( s ) = P ( S i t = s | Y i 0: t − 1 , R i 0: t − 1 , A i 0: t − 1 ) . Proposition Let Π t denote (Π 1 t , . . . , Π n t ) . Then, without loss of optimality, A t = f t ( Π t ) R t = g t ( Π t , A t ) . Recall: f is chennel selection policy and g is resource selection policy. 9/23
Optimal Resource Allocation Strategy No need for joint optimization of ( f , g ). Let � ρ i ( π ) := max π ( s ) ρ i ( s , r ) , ¯ r ∈R s ∈S i � r i , ∗ ( π ) := arg max π ( s ) ρ i ( s , r ) . r ∈R s ∈S i Proposition Define g i , ∗ : ∆( S i ) × { 0 , 1 } → R as follows g i , ∗ ( π, 0) = ∅ , g i , ∗ ( π, 1) = r i , ∗ ( π ) . For any channel selection policy, ( g ∗ , g ∗ , . . . ) is an optimal resource allocation strategy. 10/23
Restless Bandit Model (1) Each { Π i t } t ≥ 0 , i ∈ N , is a bandit process. (2) The transmitter can activate L of these processes. (3) Belief state evolution: � if process i is activated, A i δ S i t , t = 1 , Π i t +1 = Π i t · P i , if process i is passive, A i t = 0 . (4) Expected reward: � ρ i (Π i if process i is activated, A i ¯ t ) , t = 1 , ρ i t = if process i is passive, A i 0 , t = 0 . Process: g ∗ . . . → Π i → A i f → R i t → Y i t → ρ i → Π i Dynamics: t +1 → . . . − − − t t t � �� � time t 11/23
Restless Bandit Model (1) Each { Π i t } t ≥ 0 , i ∈ N , is a bandit process. (2) The transmitter can activate L of these processes. (3) Belief state evolution: � if process i is activated, A i δ S i t , t = 1 , Π i t +1 = Π i t · P i , if process i is passive, A i t = 0 . (4) Expected reward: � ρ i (Π i if process i is activated, A i ¯ t ) , t = 1 , ρ i t = if process i is passive, A i 0 , t = 0 . Process: g ∗ . . . → Π i → A i f → R i t → Y i t → ρ i → Π i Dynamics: t +1 → . . . − − − t t t � �� � time t 11/23
Restless Bandit Model (1) Each { Π i t } t ≥ 0 , i ∈ N , is a bandit process. (2) The transmitter can activate L of these processes. (3) Belief state evolution: � if process i is activated, A i δ S i t , t = 1 , Π i t +1 = Π i t · P i , if process i is passive, A i t = 0 . (4) Expected reward: � ρ i (Π i if process i is activated, A i ¯ t ) , t = 1 , ρ i t = if process i is passive, A i 0 , t = 0 . (0, 0, 1) Dynamics: Process: g ∗ . . . → Π i → A i f → R i t → Y i t → ρ i → Π i t +1 → . . . − − − t t t � �� � time t (1, 0, 0) (0, 1, 0) 11/23
Restless Bandit Model (1) Each { Π i t } t ≥ 0 , i ∈ N , is a bandit process. (2) The transmitter can activate L of these processes. (3) Belief state evolution: � if process i is activated, A i δ S i t , t = 1 , Π i t +1 = Π i t · P i , if process i is passive, A i t = 0 . (4) Expected reward: � ρ i (Π i if process i is activated, A i ¯ t ) , t = 1 , ρ i t = if process i is passive, A i 0 , t = 0 . (0, 0, 1) Dynamics: Process: g ∗ . . . → Π i → A i f → R i t → Y i t → ρ i → Π i t +1 → . . . − − − t t t � �� � time t (1, 0, 0) (0, 1, 0) 11/23
Restless Bandit Model (1) Each { Π i t } t ≥ 0 , i ∈ N , is a bandit process. (2) The transmitter can activate L of these processes. (3) Belief state evolution: � if process i is activated, A i δ S i t , t = 1 , Π i t +1 = Π i t · P i , if process i is passive, A i t = 0 . (4) Expected reward: � ρ i (Π i if process i is activated, A i ¯ t ) , t = 1 , ρ i t = if process i is passive, A i 0 , t = 0 . (0, 0, 1) Dynamics: Process: g ∗ . . . → Π i → A i f → R i t → Y i t → ρ i → Π i t +1 → . . . − − − t t t � �� � time t (1, 0, 0) (0, 1, 0) 11/23
Restless Bandit Solution The main idea is to decompose the coupled n -channel optimization problem to n independent one-channel problems. When the Whittle indexability is satisfied, then one may propose a Whittle index policy. The channels with minimum indices are selected. The index strategy performs close-to-optimal for many applications in the state-of-arts works. Goal: We provide an efficient algorithm to check the indexability and compute the Whittle index. 12/23
Problem Decomposition ρ i ( π ) − λ ) a i where λ can be viewed as Modified per-step reward: (¯ the cost for transmitting over channel i . Problem Given channel i ∈ N , the discount factor β ∈ (0 , 1) , the cost λ ∈ R , and the belief state space, transition probability, reward function tuple (∆( S i ) , P i , ρ i ) , choose a policy f i : ∆( S i ) → { 0 , 1 } to maximize � ∞ � � J i λ ( f i ) := E β t � ρ i (Π i A i � ¯ t ) − λ . t t =0 13/23
Dynamic Programming (Belief State) Theorem Let V i λ : ∆( S i ) → R be the unique fixed point of equation V i a ∈{ 0 , 1 } Q i λ ( π ) = max λ ( π, a ) where Q i λ ( π, 0) = β V i λ ( π · P i ) � Q i ρ i π ( s ) V i λ ( π, 1) = ¯ λ ( π ) − λ + β λ ( δ s ) . s ∈S i Let f i λ ( π ) = 1 if Q i λ ( π, 1) ≥ Q i λ ( π, 0) , and f i λ ( π ) = 0 otherwise. Then, f i λ is optimal for Problem 2. Challenge: Continuous state space! 14/23
Dynamic Programming (Belief State) Theorem Let V i λ : ∆( S i ) → R be the unique fixed point of equation V i a ∈{ 0 , 1 } Q i λ ( π ) = max λ ( π, a ) where Q i λ ( π, 0) = β V i λ ( π · P i ) � Q i ρ i π ( s ) V i λ ( π, 1) = ¯ λ ( π ) − λ + β λ ( δ s ) . s ∈S i Let f i λ ( π ) = 1 if Q i λ ( π, 1) ≥ Q i λ ( π, 0) , and f i λ ( π ) = 0 otherwise. Then, f i λ is optimal for Problem 2. Challenge: Continuous state space! 14/23
Recommend
More recommend