reinforcement learning
play

Reinforcement Learning Shaswot Shresthamali, Masaaki Kondo, Hiroshi - PowerPoint PPT Presentation

Adaptive Power Management for Energy Harvesting Sensor Nodes using Reinforcement Learning Shaswot Shresthamali, Masaaki Kondo, Hiroshi Nakamura The University of Tokyo CONTEXT Energy Harvesting Sensor Nodes Sensor Node (capable of varying


  1. Adaptive Power Management for Energy Harvesting Sensor Nodes using Reinforcement Learning Shaswot Shresthamali, Masaaki Kondo, Hiroshi Nakamura The University of Tokyo

  2. CONTEXT

  3. Energy Harvesting Sensor Nodes Sensor Node (capable of varying the duty cycle) + Battery + http://www.libelium.com/resources/ima ges/content/products/plug- sense/details/solar_powered_photo.png Energy Harvesting Module Theoretically capable of perpetual operation 3 8 March 2017 NAKAMURA LABORATORY

  4. Challenge I Say your battery is at 75% and there is plenty of sunshine Do you ◦ Use the solar power to charge your battery only ◦ Use the solar power to charge your battery and drive the sensor node. If so, then with what proportion? 20000 800 ENERGY HARVESTED 18000 700 16000 600 14000 BATTERY 500 12000 Policy 1 10000 400 Policy 2 8000 300 Policy 3 6000 200 Energy Harvested 4000 100 2000 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 TIME 4 8 March 2017 NAKAMURA LABORATORY

  5. Challenge II Environmental Sensor Networks – P.I. Corke et. al. MOVING SENSORS https://sites.google.com/site/sarmavrudhula/home/research/energy-management-of-wireless-sensor- networks DIFFERENT ENVIRONMENTS DIFFERENT SENSORS http://www.mdpi.com/sensors/sensors-12-02175/article_deploy/html/images/sensors-12- 02175f5-1024.png 5 8 March 2017 NAKAMURA LABORATORY

  6. Challenge III BILLION AND TRILLIONS OF NODES https://bendeetech.files.wordpress.com/2015/12/theinternetofthings2-540x334.jpg?w=350&h=200&crop=1 6 8 March 2017 NAKAMURA LABORATORY

  7. Challenges II and III When dealing with TRILLIONS of sensor nodes, Customizing each node is impractical, impossible ◦ Nodes should OPTIMIZE themselves. ◦ Nodes should ADAPT to their changing environments. ENERGY HARVESTING NODES NEED TO BE ADAPTABLE SELF CALIBRATING 7 8 March 2017 NAKAMURA LABORATORY

  8. What this presentation is about To demonstrate how to overcome the challenges by using Reinforcement Learning (RL) ◦ Brief introduction to Reinforcement Learning ◦ Our approach using RL ◦ How this strategy performs compares to other methods ◦ How this strategy adapts to changing environment 8 8 March 2017 NAKAMURA LABORATORY

  9. OBJECTIVES

  10. Objectives Energy Neutral Operation (ENO) ◦ Energy consumed = Energy harvested Maximize Performance ◦ Maximize Duty Cycle Minimize Battery Downtime ◦ Battery should never drop to zero Minimize Energy Waste ◦ Battery should not overcharge Energy Waste = Energy Harvested –Energy consumed by Node –Energy to charge battery 10 8 March 2017 NAKAMURA LABORATORY

  11. SYSTEM MODEL

  12. System Model Solar Energy (Ideal) Solar Sensor Battery Panel Node Duty Cycle Adaptive Power Battery Reserve Level Manager Energy being harvested 12 8 March 2017 NAKAMURA LABORATORY

  13. REINFORCEMENT LEARNING (RL) A BRIEF INTRODUCTION

  14. What is RL Type of Machine Learning - Learns by interacting with Environment Suited for Sequential Decision Making Tasks Map situations (states) into actions – receive as much reward as possible Based on iterative process of trial and error – similar to how humans learn. ( Search and Memory ) 14 8 March 2017 NAKAMURA LABORATORY

  15. Why Reinforcement Learning By using RL, it is possible o To optimize nodes with raw high level data and minimal human input. o To adapt to changes in the environment parameters. 15 8 March 2017 NAKAMURA LABORATORY

  16. Reinforcement Learning What action should I take to accumulate total maximum OBSERVATIONS: Battery Level reward? Energy Harvested REWARD, New State ACTION: Choose Duty Cycle Environment Agent http://www.canstockphoto.com/go- (Power Manager) green-icons-concept-tree-12796260.html http://wedreamabout.com/product/bb-8-droid- the-coolest-star-wars-toy-ever 16 8 March 2017 NAKAMURA LABORATORY

  17. Reinforcement Learning The question is: WHICH ACTION TO TAKE WHEN YOU ARE IN A GIVEN STATE? EXAMPLE: Lots of sunlight | Battery at 60% Do you [1] [2]  drive the sensor node at full strength without recharging?  drive the sensor node at half strength with partial charging? [1] https://spaceplace.nasa.gov/sun-corona/en/ [2] https://handyenergy.ru/ 17 8 March 2017 NAKAMURA LABORATORY

  18. Q- Value Assign every state- action pair → Q -Value, Q(s,a) Q(s,a) means if the agent • Starts from state s • Takes action a • Q(s,a) is the total reward it can expect in the best case scenario Higher the Q-value, better the action for State X that particular state Action 3 Action 1 Action 2 Q(X,3) Q(X,1) Q(X,2) 18 8 March 2017 NAKAMURA LABORATORY

  19. Q-Learning Challenge → Determining the Q -Values for all state- action pairs. Q-table -> contains Q-Values of all possible state- action pairs Accomplished by Q-Learning Algorithm ◦ Q-values are learned by interacting with environment. ◦ Iterative Process ◦ Bootstrapping approach 19 8 March 2017 NAKAMURA LABORATORY

  20. Q-Learning Algorithm Q-Learning Algorithm ◦ Use arbitrary estimates for Q-values ◦ Use these estimates to decide on actions ◦ Update Q-table by using the rewards received ◦ Repeat until Q-value sufficiently converge 20 8 March 2017 NAKAMURA LABORATORY

  21. EXPERIMENTS ON ADAPTIVE POWER MANAGEMENT USING Q-LEARNING

  22. State Space State is defined by : • amount of battery remaining • 200 possible levels • amount of energy harvested • 5 possible levels Total possible states: 200 x 5 = 1000 22 8 March 2017 NAKAMURA LABORATORY

  23. Action Space Action: Choose duty cycle of the sensor node 𝐵 = 𝑏 𝑢 𝑙 ∈ 10%, 20%, 30% … .100% 10% 50 mW 50% 250 mW 100% 500 mW 23 8 March 2017 NAKAMURA LABORATORY

  24. Reward Function The reward depends on: • Distance from energy neutrality at time t k ∆𝑓 𝑜𝑓𝑣𝑢𝑠𝑏𝑚 𝑢 𝑙 = 𝑓 ℎ𝑏𝑠𝑤𝑓𝑡𝑢 𝑢 𝑙 − 𝑓 𝑜𝑝𝑒𝑓 𝑢 𝑙 • Amount of battery remaining 24 8 March 2017 NAKAMURA LABORATORY

  25. RESULTS

  26. Training and Testing Training: Tokyo (2000 to 2009) Testing : Tokyo (2010) ◦ Compare it with other methods. ◦ Adaptation to diurnal and seasonal variations. ◦ Greedy and ε -greedy Implementations 26 8 March 2017 NAKAMURA LABORATORY

  27. Comparison with other methods 100% Higher Efficiency 90% 𝐵𝑑𝑢𝑣𝑏𝑚 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 80% 70% 𝐵𝑑ℎ𝑗𝑓𝑤𝑏𝑐𝑚𝑓 𝑁𝑏𝑦𝑗𝑛𝑣𝑛 𝐸𝑣𝑢𝑧 𝐷𝑧𝑑𝑚𝑓 60% 50% Efficiency(%) 40% Energy Wasted(%) 30% 20% 10% 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓𝑒 Lower Waste 0% 𝑈𝑝𝑢𝑏𝑚 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 Naïve Kansal Our method using RL 𝐹𝑜𝑓𝑠𝑕𝑧 𝑋𝑏𝑡𝑢𝑓 = 𝐹𝑜𝑓𝑠𝑕𝑧 𝐼𝑏𝑠𝑤𝑓𝑡𝑢𝑓𝑒 Fix duty cycle for Duty Cycle is −𝑂𝑝𝑒𝑓 𝐹𝑜𝑓𝑠𝑕𝑧 present day by proportional to − 𝐷ℎ𝑏𝑠𝑕𝑗𝑜𝑕 𝐹𝑜𝑓𝑠𝑕𝑧 predicting total battery level energy for next day 27 8 March 2017 NAKAMURA LABORATORY

  28. ADAPTATION TO SEASONAL CHANGES 28 8 March 2017 NAKAMURA LABORATORY

  29. Performance in Summer High Duty Cycle even during the night 29 8 March 2017 NAKAMURA LABORATORY

  30. Performance in Winter Lower duty cycle during the night Duty Cycle (%) 100% Harvested Energy (%) 80% 60% 40% 20% 0% 1360 1400 1440 1480 EPOCH Battery (%) 30 8 March 2017 NAKAMURA LABORATORY

  31. ADAPTATION TO CHANGE IN LOCATION 31 8 March 2017 NAKAMURA LABORATORY

  32. Implementation: ε -greedy approach Perfect Q-convergence takes too long. Instead, use ε -greedy approach with non-converged Q-table. ε -greedy approach: ◦ Take the best action by default. ◦ Take a random action with probability ε. ◦ Increasing ε → Exploration ◦ Decreasing ε → Exploitation 32 8 March 2017 NAKAMURA LABORATORY

  33. Adaptation to change in climate • Wakkanai (very little sunshine) • Compare between • a greedy approach (Offline) and • an  -greedy approach (Online). • Training: 2000-2009 Tokyo • Testing: 2010 Wakkanai 33 8 March 2017 NAKAMURA LABORATORY

  34. Adaptation to change in location Total number of times the 65 battery was completely 64 Average Duty Cycle (%) exhausted 63 Greedy (Non adaptive) 62 61 (8) 60 59 (14) (14) 58  -greedy (adaptive) 57 With  -greedy implementation, the agent adapts to the 56 environment and minimizes instances of battery exhaustion. 55 2010 2011 2012 2013 2014 2015 Wakkanai Offline Wakkanai Online 34 8 March 2017 NAKAMURA LABORATORY

  35. With and Without Forecast Information 35 8 March 2017 NAKAMURA LABORATORY

  36. CONCLUSION

  37. CONCLUSION • Proposed system is able to meet objectives of • Energy neutrality • Maximizing performance • Exceeds the performance of other schemes • Capable of adaptation • Inclusion of weather forecast results in smarter operation 37 8 March 2017 NAKAMURA LABORATORY

  38. THANK YOU FOR LISTENING ANY COMMENTS OR QUESTIONS ARE WELCOME

Recommend


More recommend