ADAPTIVE POWER MANAGEMENT OF ENERGY HARVESTING SENSOR NODES USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA Algorithms 適応的電力制御を行う環境発電駆動センサノードの強化学習戦略の比較評価 SWoPP 2017 SHASWOT SHRESTHAMALI MASAAKI KONDO HIROSHI NAKAMURA THE UNIVERSITY OF TOKYO
INTRODUCTION Use Reinforcement Learning (RL) for power management in Energy Harvesting Sensor Nodes (EHSN) Adaptive control behavior Near-optimal performance Comparison between different RL algorithms Q-Learning SARSA 25-Mar-20 2
ENERGY HARVESTING SENSOR NODE CONCEPT • CONSTRAINTS Harvested Energy • Sensor node has to be operating at ALL times Power RF Transceiver • Battery cannot be completely Manager depleted • Battery cannot be overcharged MCU Memory (exceed 100%) • Battery size is finite Mixed Signal Sensor Circuits • Charging/discharging rates are finite 25-Mar-20 3
OBJECTIVE: NODE-LEVEL ENERGY NEUTRALITY • We want to use ALL the energy that is harvested. • One way of achieving that is by ensuring node level energy neutrality – the condition when Energy Energy the amount of energy harvested equals Harvested Consumed the amount of energy consumed. • Autonomous Perpetual operation can be achieved 25-Mar-20 4
CHALLENGES https://sites.google.com/site/sarmavrudhula/home/research/energy-management-of-wireless-sensor- networks Environmental Sensor Networks – P.I. Corke et. al. DIFFERENT ENVIRONMENTS MOVING SENSORS DIFFERENT SENSORS http://www.mdpi.com/sensors/sensors-12-02175/article_deploy/html/images/sensors-12- 02175f5-1024.png 25-Mar-20 5
SOLUTION Preparing heuristic, user-defined contingency solutions for all possible scenarios is impractical . We want a one-size-fits-all solution sensor nodes that are capable of: • autonomously learning optimal strategies • adapting once they have been deployed in the environment. 25-Mar-20 6
SOLUTION ➢ Use RL for adaptive control ➢ Use a solar energy harvesting sensor node as a case example 25-Mar-20 7
Q-Learning Results (ETNET 2017) 100% Higher Efficiency 90% 80% 𝐵𝑑𝑢𝑣𝑏𝑚 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 70% 𝐵𝑑ℎ𝑗𝑓𝑤𝑏𝑐𝑚𝑓 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 60% 50% Efficiency(%) 40% Energy Wasted(%) 30% 20% Lower Waste 10% 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑧 𝑋𝑏𝑡𝑢𝑓𝑒 0% 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 Naïve Kansal Our method using RL 𝐹𝑜𝑓𝑠𝑧 𝑋𝑏𝑡𝑢𝑓 = 𝐹𝑜𝑓𝑠𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 Fix duty cycle for −𝑂𝑝𝑒𝑓 𝐹𝑜𝑓𝑠𝑧 Duty Cycle is present day by − 𝐷ℎ𝑏𝑠𝑗𝑜 𝐹𝑜𝑓𝑠𝑧 proportional to predicting total battery level energy for next day ETNET 2017 (Kumejima) 8
Q-Learning (ETNET 2017) ❑ Demonstrated that RL approaches outperform traditional methods. ❑ Limitations • State explosion • 200 x 5 x 6 states • Q-table becomes too large to train using random policy • Long training times • Required 10 years worth of training • Reward function did not reflect the true objective of energy neutrality. ETNET 2017 (Kumejima) 9
REINFORCEMENT LEARNING IN A NUTSHELL 25-Mar-20 10
REINFORCEMENT LEARNING • Type of Machine Learning based on experience rather than instruction What action should I take to • Map situations (states) into actions – and receive as much reward as accumulate possible total maximum reward? OBSERVATIONS: Battery Level Energy Harvested Weather Forecast REWARD, New State Environment Agent ACTION: Choose Duty Cycle (Power Manager) 25-Mar-20 11
REINFORCEMENT LEARNING • IMPORTANT CONCEPTS ▫ Q-VALUE ▫ ELIGIBILITY TRACES 25-Mar-20 12
Q-VALUE r 1 • To give a measure of the “goodness” of an action State in a particular state, we s j assign each state-action pair a Q-value: a 1 r 2 Q(state, action) 𝑅(𝑡 𝑗 , 𝑏 1 ) State State a 2 • Learned from past 𝑅(𝑡 𝑗 , 𝑏 2 ) s i s k (training) experiences. a 3 𝑅(𝑡 𝑗 , 𝑏 3 ) r 3 • Higher Q-value → better the choice of action for that state. State s l • Q(s,a) value is the expected cumulative reward that you can get starting from state s and taking action a 25-Mar-20 13
Q-VALUE • To give a measure of the “goodness” of an action State in a particular state, we s j assign each state-action pair a Q-value: r 2 Q(state, action) State State a 2 • Learned from past 𝑅(𝑡 𝑗 , 𝑏 2 ) s i s k (training) experiences. • Higher Q-value → better the choice of action for that state. State s l • Q(s,a) value is the expected cumulative reward that you can get starting from state s and taking action a 25-Mar-20 14
LEARNING Q-VALUES TO FIND 𝑹 𝒕 𝒍 , 𝒃 𝒍 • Start with arbitrary guesses for 𝑅 𝑡 𝑙 , 𝑏 𝑙 • Update 𝑅 𝑡 𝑙 , 𝑏 𝑙 incrementally towards the target value (Bootstrapping) • General Update Rule 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓[𝑈𝑏𝑠𝑓𝑢 − 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓] 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 1 − 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑈𝑏𝑠𝑓𝑢 𝑅 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝑡 𝑙 , 𝑏 𝑙 + 𝛽 × 𝑈𝑏𝑠𝑓𝑢 ? 25-Mar-20 15
SARSA VS Q-LEARNING • Agent starts at state s k and takes some action a k according to policy . • Receives a reward r k and is transported to new state s k+1 . Q-LEARNING SARSA • The agent assumes the next action will • The agent considers taking the be the action with the highest Q-value. next action a k+1 . • The Q-value Q(s k ,a k ) is then updated. • The Q-value Q(s k ,a k ) is then updated . 𝑅 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝑡 𝑙 , 𝑏 𝑙 + 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 + α[𝑠 𝑙 + 𝛿 max 𝑅(𝑡 𝑙+1 , 𝑏)] α[𝑠 𝑙 + 𝛿𝑅 𝜌 𝑡 𝑙+1 , 𝑏 𝑙+1 ] 𝑏 • -greedy policy is used i.e. random actions are taken with probability to allow exploration. 25-Mar-20 16
SARSA VS Q-LEARNING SARSA 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 + α[𝑠 𝑙 + 𝛿𝑅 𝜌 𝑡 𝑙+1 , 𝑏 𝑙+1 ] 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 1 − 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑈𝑏𝑠𝑓𝑢 Q-Learning 𝑅 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝑡 𝑙 , 𝑏 𝑙 + α[𝑠 𝑙 + 𝛿 max 𝑅(𝑡 𝑙+1 , 𝑏)] 𝑏 25-Mar-20 17
SARSA VS Q-LEARNING Q-Learning SARSA On-policy learning: Off-policy learning: updates the policy it is using during final learned policy is the same training regardless of training methods Update is carried out by considering Assumes the best actions will always the next action to be taken be taken Faster convergence but requires an Takes longer to converge initial policy. Difficult to integrate with linear Easier to integrate with function function approximation approximation SARSA Q-Learning -greedy policy -greedy policy Choosing Next Action -greedy policy Updating Q Greedy policy 25-Mar-20 18
ELIGIBILITY TRACES • In our model, one action is taken every hour. The reward is awarded at the end of 24 hours. A single action cannot justify the reward at the end. A series of 24 state-action pairs are responsible for the reward. To update the Q-values of the appropriate state-action pairs, we introduce a memory • variable, 𝑓(𝑡, 𝑏) , called the eligibility trace . • 𝑓 𝑡, 𝑏 for ALL state-action pairs decays by 𝜇 at every time step. • If the state-action pair 𝑡 𝑙 , 𝑏 𝑙 is visited, 𝑓 𝑡 𝑙 , 𝑏 𝑙 is incremented by one. Action 24 Action 2 Action 1 … State 1 State 2 State 24 REWARD Update Q(State 24, Action 24) Update Q(State 2, Action 2) 25-Mar-20 21 Update Q(State 1, Action 1)
SARSA( ) AND Q-LEARNING ( ) • SARSA( ) – integrate eligibility traces with SARSA algorithm • Q( ) – integrate eligibility traces with Q-Learning algorithm • , 0 < 𝜇 < 1 , is the strength with which Q-values of early contributing state-action pairs are updated as a consequence of the final reward. 25-Mar-20 22
ADAPTIVE POWER CONTROL USING REINFORCEMENT LEARNING ALGORITHMS SARSA( ) – SARSA with eligibility traces • • SARSA Q( ) – Q-Learning with eligibility traces • • Q-Learning 25-Mar-20 23
Recommend
More recommend