David Silver et al. from Google DeepMind Mastering the game of Go with deep neural networks and tree search Article overview by Reinforcement Learning Seminar Ilya Kuzovkin University of Tartu, 2016
T HE G AME OF G O
B OARD
B OARD S TONES
B OARD S TONES G ROUPS
B OARD L IBERTIES S TONES G ROUPS
B OARD L IBERTIES C APTURE S TONES G ROUPS
B OARD L IBERTIES C APTURE K O S TONES G ROUPS
B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS
B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS
B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS
B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS T WO EYES
F INAL COUNT B OARD L IBERTIES E XAMPLES C APTURE K O S TONES G ROUPS T WO EYES
T RAINING
T RAINING THE BUILDING BLOCKS S UPERVISED S UPERVISED R EINFORCEMENT CLASSIFICATION REGRESSION Supervised Reinforcement Value � policy network policy network network p σ ( a | s ) p ρ ( a | s ) v θ ( s ) Rollout policy network p π ( a | s ) Tree policy network p τ ( a | s )
Supervised policy network p σ ( a | s )
Supervised policy network p σ ( a | s ) Softmax 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 48 input
• 29.4M positions from games Supervised between 6 to 9 dan players policy network p σ ( a | s ) Softmax 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 48 input
• 29.4M positions from games Supervised between 6 to 9 dan players policy network p σ ( a | s ) Softmax • stochastic gradient ascent � • learning rate = 0.003, 1 convolutional layer 1x1 α ReLU halved every 80M steps � 11 convolutional layers 3x3 • batch size m = 16 with k =192 filters, ReLU � • 3 weeks on 50 GPUs to make 1 convolutional layer 5x5 340M steps with k =192 filters, ReLU 19 x 19 x 48 input
• 29.4M positions from games Supervised between 6 to 9 dan players policy network • Augmented: 8 reflections/rotations • Test set (1M) accuracy: 57.0% p σ ( a | s ) • 3 ms to select an action Softmax • stochastic gradient ascent � • learning rate = 0.003, 1 convolutional layer 1x1 α ReLU halved every 80M steps � 11 convolutional layers 3x3 • batch size m = 16 with k =192 filters, ReLU � • 3 weeks on 50 GPUs to make 1 convolutional layer 5x5 340M steps with k =192 filters, ReLU 19 x 19 x 48 input
19 X 19 X 48 INPUT
19 X 19 X 48 INPUT
19 X 19 X 48 INPUT
19 X 19 X 48 INPUT
Rollout policy p π ( a | s ) • Supervised — same data as p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax
Rollout policy p π ( a | s ) • Supervised — same data as p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax
Rollout policy p π ( a | s ) Tree policy p τ ( a | s ) • Supervised — same data as p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax
Rollout policy p π ( a | s ) Tree policy p τ ( a | s ) • Supervised — same data as • “ similar to the rollout policy but p σ ( a | s ) • Less accurate: 24.2% ( vs. 57.0% ) with more features ” • Faster: 2 μ s per action ( 1500 times ) • Just a linear model with softmax
Reinforcement policy network p ρ ( a | s ) Same architecture Weights are initialized with ρ σ
• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) Same architecture Weights are initialized with ρ σ
• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) • Play a game until the end, get the reward z t = ± r ( s T ) = ± 1 Same architecture Weights are initialized with ρ σ
• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t Same architecture Weights are initialized with ρ σ
• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t v ( s i t ) • = … ‣ 0 “ on the first pass through the training pipeline ” v θ ( s i t ) ‣ “ on the second pass ” Same architecture Weights are initialized with ρ σ
• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network p ρ ( a | s ) z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t v ( s i t ) • = … ‣ 0 “ on the first pass through the training pipeline ” v θ ( s i t ) ‣ “ on the second pass ” • batch size n = 128 games • 10,000 batches • One day on 50 GPUs Same architecture Weights are initialized with ρ σ
• Self-play: current network vs. Reinforcement randomized pool of previous versions policy network • 80% wins against Supervised Network • 85% wins against Pachi (no search yet!) p ρ ( a | s ) • 3 ms to select an action z t = ± r ( s T ) = ± 1 • Play a game until the end, get the reward z i • Set and play the same game again, this time t = z t updating the network parameters at each time step t v ( s i t ) • = … ‣ 0 “ on the first pass through the training pipeline ” v θ ( s i t ) ‣ “ on the second pass ” • batch size n = 128 games • 10,000 batches • One day on 50 GPUs Same architecture Weights are initialized with ρ σ
Value � network v θ ( s )
Value � network v θ ( s ) Fully connected layer 1 tanh unit Fully connected layer 256 ReLU units 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input
Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) Fully connected layer 1 tanh unit Fully connected layer 256 ReLU units 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input
Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit Fully connected layer 256 ReLU units 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input
Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit • Train on 30M state-outcome Fully connected layer ( s, z ) pairs , each from a unique 256 ReLU units game generated by self-play: 1 convolutional layer 1x1 ReLU 11 convolutional layers 3x3 with k =192 filters, ReLU 1 convolutional layer 5x5 with k =192 filters, ReLU 19 x 19 x 49 input
Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit • Train on 30M state-outcome Fully connected layer ( s, z ) pairs , each from a unique 256 ReLU units game generated by self-play: ‣ choose a random time step u 1 convolutional layer 1x1 ‣ sample moves t =1… u -1 from ReLU SL policy ‣ make random move u 11 convolutional layers 3x3 ‣ sample t = u +1…T from RL with k =192 filters, ReLU policy and get game 1 convolutional layer 5x5 outcome z with k =192 filters, ReLU ( s u , z u ) ‣ add pair to the training set 19 x 19 x 49 input
Evaluate the value of the position s under policy p: • Value � � network Double approximation • v θ ( s ) • Stochastic gradient descent to Fully connected layer minimize MSE 1 tanh unit • Train on 30M state-outcome Fully connected layer ( s, z ) pairs , each from a unique 256 ReLU units game generated by self-play: ‣ choose a random time step u 1 convolutional layer 1x1 ‣ sample moves t =1… u -1 from ReLU SL policy ‣ make random move u 11 convolutional layers 3x3 ‣ sample t = u +1…T from RL with k =192 filters, ReLU policy and get game 1 convolutional layer 5x5 outcome z with k =192 filters, ReLU ( s u , z u ) ‣ add pair to the training set 19 x 19 x 49 input • One week on 50 GPUs to train on 50M batches of size m =32
Recommend
More recommend