LunarLander-v2 using Deep Reinforcement Learning A project developed for Autonomous Agents Course PLH513 Portokalakis Petros February 2020
Simple Game 8-Dimensional state space ● 4 actions per state ● +100 points for landing ● -100 points when crashed ● Infinite fuel, but -0.3 points per ● frame when firing main engine +10 for each leg ground contact (to ● encourage smooth landing)
Deep Reinforcement Learning Objective: approximate the optimal Q-Function (which satisfies the Bellman Equation) Neural network: 8 node input layer - dimensionality of state space ● 150 node fully connected 1st hidden layer ● 128 node fully connected 2nd hidden layer ● 4 node output layer - q-values for actions ● 4 layer approach works well with a variety of hidden layer node number 5 layers prove insufficient to even train the agent
Deep Reinforcement Learning: Advancing performance Experience replay: Every tuple(s,a,r,s’,done) is stored in a replay buffer (maxlength=1M) ● Randomly sample a batch of previous experiences (64). Break correlation ● between consecutive samples Predict best action for all items in the batch via the NN ● Update neural network weights ● Generate episodes via exploration or exploitation ●
Deep Reinforcement Learning: Advancing performance Calculating loss between output Q-value and target Q-value requires a seconds ● pass to the network for the next state s and s’ share the same network and have one step difference ● Optimization becomes unstable ● Target network: Use an identical network to the policy network, but update target network weight’s every C iterations (C is a hyperparameter) First pass occures with the policy network Second pass occures with the target network
Deep Reinforcement Learning: Advancing performance Abstract version of the agent algorithm implemented
Deep Reinforcement Learning: Performance of Lunar Lander
Deep Reinforcement Learning: Performance of Lunar Lander Adding a third hidden layer
Deep Reinforcement Learning: Hyperparameter Tuning Hyperparameter Value Starting epsilon 1 Minimum epsilon 0.01 Decay factor of epsilon 0.99 Discount factor gamma 0.99 Learning rate 0.001 Batch size 64 Replay buffer 1000000
Thank you Questions? Contact: pportokalakis@gmail.com
Recommend
More recommend