using reinforcement learning
play

USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA - PowerPoint PPT Presentation

ADAPTIVE POWER MANAGEMENT OF ENERGY HARVESTING SENSOR NODES USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA Algorithms


  1. ADAPTIVE POWER MANAGEMENT OF ENERGY HARVESTING SENSOR NODES USING REINFORCEMENT LEARNING A Comparison of Q-Learning and SARSA Algorithms 適応的電力制御を行う環境発電駆動センサノードの強化学習戦略の比較評価 SWoPP 2017 SHASWOT SHRESTHAMALI MASAAKI KONDO HIROSHI NAKAMURA THE UNIVERSITY OF TOKYO

  2. INTRODUCTION  Use Reinforcement Learning (RL) for power management in Energy Harvesting Sensor Nodes (EHSN) Adaptive control behavior  Near-optimal performance   Comparison between different RL algorithms Q-Learning  SARSA  25-Mar-20 2

  3. ENERGY HARVESTING SENSOR NODE CONCEPT • CONSTRAINTS Harvested Energy • Sensor node has to be operating at ALL times Power RF Transceiver • Battery cannot be completely Manager depleted • Battery cannot be overcharged MCU Memory (exceed 100%) • Battery size is finite Mixed Signal Sensor Circuits • Charging/discharging rates are finite 25-Mar-20 3

  4. OBJECTIVE: NODE-LEVEL ENERGY NEUTRALITY • We want to use ALL the energy that is harvested. • One way of achieving that is by ensuring node level energy neutrality – the condition when Energy Energy the amount of energy harvested equals Harvested Consumed the amount of energy consumed. • Autonomous Perpetual operation can be achieved 25-Mar-20 4

  5. CHALLENGES https://sites.google.com/site/sarmavrudhula/home/research/energy-management-of-wireless-sensor- networks Environmental Sensor Networks – P.I. Corke et. al. DIFFERENT ENVIRONMENTS MOVING SENSORS DIFFERENT SENSORS http://www.mdpi.com/sensors/sensors-12-02175/article_deploy/html/images/sensors-12- 02175f5-1024.png 25-Mar-20 5

  6. SOLUTION Preparing heuristic, user-defined contingency solutions for all possible scenarios is impractical . We want a one-size-fits-all solution sensor nodes that are capable of: • autonomously learning optimal strategies • adapting once they have been deployed in the environment. 25-Mar-20 6

  7. SOLUTION ➢ Use RL for adaptive control ➢ Use a solar energy harvesting sensor node as a case example 25-Mar-20 7

  8. Q-Learning Results (ETNET 2017) 100% Higher Efficiency 90% 80% 𝐵𝑑𝑢𝑣𝑏𝑚 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 70% 𝐵𝑑ℎ𝑗𝑓𝑤𝑏𝑐𝑚𝑓 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 60% 50% Efficiency(%) 40% Energy Wasted(%) 30% 20% Lower Waste 10% 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓𝑒 0% 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 Naïve Kansal Our method using RL 𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓 = 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 Fix duty cycle for −𝑂𝑝𝑒𝑓 𝐹𝑜𝑓𝑠𝑕𝑧 Duty Cycle is present day by − 𝐷ℎ𝑏𝑠𝑕𝑗𝑜𝑕 𝐹𝑜𝑓𝑠𝑕𝑧 proportional to predicting total battery level energy for next day ETNET 2017 (Kumejima) 8

  9. Q-Learning (ETNET 2017) ❑ Demonstrated that RL approaches outperform traditional methods. ❑ Limitations • State explosion • 200 x 5 x 6 states • Q-table becomes too large to train using random policy • Long training times • Required 10 years worth of training • Reward function did not reflect the true objective of energy neutrality. ETNET 2017 (Kumejima) 9

  10. REINFORCEMENT LEARNING IN A NUTSHELL 25-Mar-20 10

  11. REINFORCEMENT LEARNING • Type of Machine Learning based on experience rather than instruction What action should I take to • Map situations (states) into actions – and receive as much reward as accumulate possible total maximum reward? OBSERVATIONS: Battery Level Energy Harvested Weather Forecast REWARD, New State Environment Agent ACTION: Choose Duty Cycle (Power Manager) 25-Mar-20 11

  12. REINFORCEMENT LEARNING • IMPORTANT CONCEPTS ▫ Q-VALUE ▫ ELIGIBILITY TRACES 25-Mar-20 12

  13. Q-VALUE r 1 • To give a measure of the “goodness” of an action State in a particular state, we s j assign each state-action pair a Q-value: a 1 r 2 Q(state, action) 𝑅(𝑡 𝑗 , 𝑏 1 ) State State a 2 • Learned from past 𝑅(𝑡 𝑗 , 𝑏 2 ) s i s k (training) experiences. a 3 𝑅(𝑡 𝑗 , 𝑏 3 ) r 3 • Higher Q-value → better the choice of action for that state. State s l • Q(s,a) value is the expected cumulative reward that you can get starting from state s and taking action a 25-Mar-20 13

  14. Q-VALUE • To give a measure of the “goodness” of an action State in a particular state, we s j assign each state-action pair a Q-value: r 2 Q(state, action) State State a 2 • Learned from past 𝑅(𝑡 𝑗 , 𝑏 2 ) s i s k (training) experiences. • Higher Q-value → better the choice of action for that state. State s l • Q(s,a) value is the expected cumulative reward that you can get starting from state s and taking action a 25-Mar-20 14

  15. LEARNING Q-VALUES TO FIND 𝑹 𝒕 𝒍 , 𝒃 𝒍 • Start with arbitrary guesses for 𝑅 𝑡 𝑙 , 𝑏 𝑙 • Update 𝑅 𝑡 𝑙 , 𝑏 𝑙 incrementally towards the target value (Bootstrapping) • General Update Rule 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓[𝑈𝑏𝑠𝑕𝑓𝑢 − 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓] 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 1 − 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑈𝑏𝑠𝑕𝑓𝑢 𝑅 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝑡 𝑙 , 𝑏 𝑙 + 𝛽 × 𝑈𝑏𝑠𝑕𝑓𝑢 ? 25-Mar-20 15

  16. SARSA VS Q-LEARNING • Agent starts at state s k and takes some action a k according to policy  . • Receives a reward r k and is transported to new state s k+1 . Q-LEARNING SARSA • The agent assumes the next action will • The agent considers taking the be the action with the highest Q-value. next action a k+1 . • The Q-value Q(s k ,a k ) is then updated. • The Q-value Q(s k ,a k ) is then updated . 𝑅 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝑡 𝑙 , 𝑏 𝑙 + 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 + α[𝑠 𝑙 + 𝛿 max 𝑅(𝑡 𝑙+1 , 𝑏)] α[𝑠 𝑙 + 𝛿𝑅 𝜌 𝑡 𝑙+1 , 𝑏 𝑙+1 ] 𝑏 •  -greedy policy is used i.e. random actions are taken with probability  to allow exploration. 25-Mar-20 16

  17. SARSA VS Q-LEARNING SARSA 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝜌 𝑡 𝑙 , 𝑏 𝑙 + α[𝑠 𝑙 + 𝛿𝑅 𝜌 𝑡 𝑙+1 , 𝑏 𝑙+1 ] 𝑂𝑓𝑥𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 ← 1 − 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑃𝑚𝑒𝐹𝑡𝑢𝑗𝑛𝑏𝑢𝑓 + 𝑇𝑢𝑓𝑞𝑇𝑗𝑨𝑓 × 𝑈𝑏𝑠𝑕𝑓𝑢 Q-Learning 𝑅 𝑡 𝑙 , 𝑏 𝑙 ← 1 − 𝛽 𝑅 𝑡 𝑙 , 𝑏 𝑙 + α[𝑠 𝑙 + 𝛿 max 𝑅(𝑡 𝑙+1 , 𝑏)] 𝑏 25-Mar-20 17

  18. SARSA VS Q-LEARNING Q-Learning SARSA On-policy learning: Off-policy learning:   updates the policy it is using during final learned policy is the same   training regardless of training methods Update is carried out by considering  Assumes the best actions will always  the next action to be taken be taken Faster convergence but requires an  Takes longer to converge  initial policy. Difficult to integrate with linear  Easier to integrate with function  function approximation approximation SARSA Q-Learning  -greedy policy  -greedy policy Choosing Next Action  -greedy policy Updating Q Greedy policy 25-Mar-20 18

  19. ELIGIBILITY TRACES • In our model, one action is taken every hour. The reward is awarded at the end of 24 hours. A single action cannot justify the reward at the end. A series of 24 state-action pairs are responsible for the reward. To update the Q-values of the appropriate state-action pairs, we introduce a memory • variable, 𝑓(𝑡, 𝑏) , called the eligibility trace . • 𝑓 𝑡, 𝑏 for ALL state-action pairs decays by 𝜇 at every time step. • If the state-action pair 𝑡 𝑙 , 𝑏 𝑙 is visited, 𝑓 𝑡 𝑙 , 𝑏 𝑙 is incremented by one. Action 24 Action 2 Action 1 … State 1 State 2 State 24 REWARD Update Q(State 24, Action 24) Update Q(State 2, Action 2) 25-Mar-20 21 Update Q(State 1, Action 1)

  20. SARSA(  ) AND Q-LEARNING (  ) • SARSA(  ) – integrate eligibility traces with SARSA algorithm • Q(  ) – integrate eligibility traces with Q-Learning algorithm •  , 0 < 𝜇 < 1 , is the strength with which Q-values of early contributing state-action pairs are updated as a consequence of the final reward. 25-Mar-20 22

  21. ADAPTIVE POWER CONTROL USING REINFORCEMENT LEARNING ALGORITHMS SARSA(  ) – SARSA with eligibility traces • • SARSA Q(  ) – Q-Learning with eligibility traces • • Q-Learning 25-Mar-20 23

Recommend


More recommend