RE REIN INFOR FORCEM CEMENT ENT LE LEAR ARNI NING NG AS A PR AS A PROD ODUC UCTION TION TOOL OOL ON ON AAA AAA GA GAME MES Olivie ier r DELALL ALLEAU EAU & A Adrien n LOGU GUT 2017-10 10-18 18
AGENDA Project Overview Fighting in For Honor Driving in Watch_Dogs 2
Project Overview Build AIs that can play y games es like our players ers wo would FOR HONOR Olivier Delalleau Fr édéric Doll Maxim Peter S 2 DOGS WATCH_DOG Adrien Logut Olivier Lamothe-Penelle
Motivations Automated testing Design assistance In-game AI
Why Reinforcement Learning Google Trends reinforcement learning genetic algorithm imitation learning Evolutionary methods Imitation learning
RL & Video games (recent) (incomplete) Atari Doom Minecraft Universe SNES Starcraft II Dota 2 Unity
CENTURION GLADIATOR SHINOBI HIGHLANDER
• • • •
A u t o n o m o u s D r i v i n g
Watch_Dogs 2 Open world game within a living city Takes place in San Francisco » Living city Cars in the street » Cars need to be controlled » by an AI
Objectives How is it currently done? PID controller with custom curves » Hand-tuned curves » Takes a lot of time ▪ Not precise ▪ How about Reinforcement ➢ Learning?
Reinforcement Learning What the agent can see and do Environment Distance to the road: 0.1 Action Velocity: 0.3 Reward State Acceleration: [0,1] Desired speed: 0.9 +3 Brake: [-1, 1] ... Steering: [-1, 1] Agent
Reinforcement Learning What the agent can see and do Continuous States ➢ ➢ Neural network to approximate 𝑹 𝒕 𝒖 , 𝒃 𝒖
Reinforcement Learning What the agent can see and do Continuous States ➢ ➢ Neural network to approximate 𝑹(𝒕 𝒖 , 𝒃 𝒖 ) Continuous Actions ➢ ➢ Cannot use greedy policy from DQN (For Honor) ➢ Neural network to approximate a policy 𝒃 𝒖 ~ 𝝂 𝒕 𝒖
Reinforcement Learning Actor Critic Architecture Two neural networks, approximate functions » Updates Actor : 𝑏 𝑢 ~ 𝜈 𝑡 𝑢 ▪ Critic : 𝑅 𝑡 𝑢 , 𝑏 𝑢 ▪ 𝑏 𝑢 Critic update: 𝑡 𝑢 𝑅(𝑡 𝑢 , 𝑏 𝑢 ) » » Actor » Critic Expected discounted reward Q-Learning (same as For Honor) Actor update: » Policy gradient 𝑏 𝑅 𝑡, 𝑏 𝜄 𝑅 )| 𝑡=𝑡 𝑢 ,𝑏=𝜈 𝑡 𝑢 𝛼 𝜄 𝜈 𝜈 𝑡 𝜄 𝜈 )| 𝑡=𝑡 𝑢 ∇ 𝜄 𝜈 𝐾 = 𝛼
Reinforcement Learning Actor Critic Architecture Two neural networks, approximate functions » Updates Actor : 𝑏 𝑢 ~ 𝜈 𝑡 𝑢 ▪ Critic : 𝑅 𝑡 𝑢 , 𝑏 𝑢 ▪ 𝑏 𝑢 Critic update: 𝑡 𝑢 𝑅(𝑡 𝑢 , 𝑏 𝑢 ) » » Actor » Critic Expected discounted reward Q-Learning (same as For Honor) Actor update: » Policy gradient 𝑏 𝑅 𝑡, 𝑏 𝜄 𝑅 )| 𝑡=𝑡 𝑢 ,𝑏=𝜈 𝑡 𝑢 𝛼 𝜄 𝜈 𝜈 𝑡 𝜄 𝜈 )| 𝑡=𝑡 𝑢 ∇ 𝜄 𝜈 𝐾 = 𝛼
Reinforcement Learning Actor update - Intuition Actor update: » Updates Policy gradient Intuition: » 𝑏 𝑢 Critic gives the direction ▪ 𝑡 𝑢 𝑅(𝑡 𝑢 , 𝑏 𝑢 ) » Actor » Critic to update the actor ▪ “In which way should I change actor parameters in order to maximize the critic output given a state.”
First experience Since we have the PID, what about imitating it? Supervised learning on the actor » Updated with Mean Squared Error between actor ▪ output and PID output Updates 𝑏 𝑢, 𝑏𝑑𝑢𝑝𝑠 𝑡 𝑢 » Actor 2 𝜀 𝑢 = 𝑏 𝑢,𝑏𝑑𝑢𝑝𝑠 − 𝑏 𝑢,𝑄𝐽𝐸 𝑏 𝑢, 𝑄𝐽𝐸 » PID
First experience Supervised vs Original – Slight improvement
First experience Supervised vs Original – Slight improvement
Reward Shaping Defining the reward function The reward is the only signal received by the agent » Am I doing good or bad? ▪ This is the key part of reinforcement learning » Called Reward Shaping ▪ Requires a good understanding of the problem ▪ For driving: » Follow the given path at the right speed ▪ Stop when needed ▪
Reward Shaping Defining the reward function - Configuration Three main components are measured: » Velocity along the path 𝒘 𝒚 » Velocity perpendicular to the path 𝒘 𝒛 » 𝒆 Distance from the path » 𝒆 𝒘 𝒚 𝒘 𝒘 𝒛
Reward Shaping Defining the reward function – Desired speed Positive reward when driving close to the desired speed » Negative when far from » the desired speed Punish more when driving » Reward vs. velocity along path faster than slower Desired speed in red » Reward Velocity x
Reward Shaping Defining the reward function – Velocity y Only negative reward » Want to punish harder for » small values (Power < 1) Reward vs. velocity perpendicular to path Reward Velocity y
Reward Shaping Defining the reward function – Distance Only negative reward » Want to punish less » for small values (Power > 1) Reward vs. distance from path Reward Distance
Results The learning curve
Results The learning curve
Results Good Results after 15 mins
Results One model to rule them all?
Results One model to rule them all? Each vehicle has its own physical model » Accelerate, Steer, Brake all have different reactions » over the vehicles We can still group physically close vehicles » Need more state info for bigger vehicles (Bus, Trucks, …) »
Results One model to rule them all?
Results One model to rule them all?
Results Need to deal with a lot of variance Game is not deterministic » Even with seeding, different results »
Tools Multi-dimensional function visualizer Developed with » PyQt5 Load the models » and plot the output
Tools Multi-dimensional function visualizer Developed with » PyQt5 Load the models » and plot the output
Tools Archives reader and comparison tools Developed with » PyQt5 Load the metrics » and plot them to compare models
What’s next? Awesome stuff! Analyze what could be introduced into the game » Level of quality ? Robustness ? Computation time ? Learning time ? Size of models in memory ? Try other learning algorithms » Optimize workflow with multiple » agents
Conclusion Reinforcement learning is promising Found efficient fighting behavior in For Honor Already better driving in Watch_Dogs 2 compared to PID It is just the beginning… Still a lot of work and research to do Not ready to use in production... yet The future? Player-facing AIs
Thank you! Do you have questions? laforge@ubisoft.com PS: we’re hiring (!)
Recommend
More recommend