Lecture 33 – Reinforcement Learning for Two-Player Games Mark Hasegawa-Johnson, 4/2020 CC-BY 4.0: you may remix or redistribute if you cite the source Snapshot of a gnugo game, http://www.gnu.org/software/gnugo/
Outline • Review: minimax and alpha-beta • Move ordering: policy network • Evaluation function: value network • Training the value network • Exact training: endgames • Stochastic training: Monte Carlo tree search • Case study: alphago
Minimax games Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of states reachable in one move), then the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min
Minimax games Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of 8 9 5 7 5 7 states reachable in one move), then 6 9 3 the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min
Minimax games Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of 8 9 5 7 5 7 states reachable in one move), then 6 9 3 the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min
Minimax games 6 Let 𝑡 be the state of the game: complete specification of the board, and a statement about whose turn it is. • If it’s the turn of the MAX player, and if 𝐷(𝑡) are the children of 𝑡 (the set of 8 9 5 7 5 7 states reachable in one move), then 6 9 3 the value of the board is !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = max 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • If it’s MIN’s turn, then !"∈$(!) 𝑉(𝑡 " ) 𝑉 𝑡 = min
Minimax complexity 6 𝑐 = branching factor 𝑒 = search depth Complexity = 𝑃{𝑐 ! } 8 9 5 7 5 7 6 9 3 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7
Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ Each node has two internal meta-parameters, initialized from its parent: • 𝛽 = highest value that MAX knows how to force MIN to accept • 𝛾 = lowest value that MIN knows how to force MAX to 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 accept • 𝛽 ≤ 𝛾 • Initial values: 𝛽 = −∞, 𝛾 = ∞
Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ Each node has two internal meta-parameters, initialized from its parent: 𝛽 = −∞ 𝛾 = ∞ • 𝛽 = highest value that MAX knows how to force MIN to 𝛽 = −∞ accept 𝛾 = ∞ • 𝛾 = lowest value that MIN knows how to force MAX to 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 accept • 𝛽 ≤ 𝛾 • Initial values: 𝛽 = −∞, 𝛾 = ∞
Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = ∞ 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = −∞ MIN will never let us reach 6 𝛾 = ∞ this node. • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .
Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = ∞ 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = 6 MIN will never let us reach 6 𝛾 = ∞ this node. • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .
Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MIN node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ < 𝛾 = 6 𝛽 𝑡 then prune all remaining children of 𝑡 : MIN will never let us reach 6 this node. • Otherwise, if 𝑉 𝑡’ < 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛾 𝑡 , then set 𝛾 𝑡 = 𝑉(𝑡’) . MAX might still choose 𝑡 (because 𝑉 𝑡’ ≥ 𝛽 𝑡 ), then MIN can choose 𝑡’ .
Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = 6 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = −∞ MIN will never let us reach 𝛾 = 6 6 ≥ 𝟗 this node. XX • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .
Alpha-Beta Pruning 𝛽 = −∞ 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = −∞ • If you realize that 𝑉 𝑡’ > 𝛾 = 6 𝛾 𝑡 then prune all remaining children of 𝑡 : 𝛽 = −∞ MIN will never let us reach 𝛾 = 6 6 ≥ 𝟗 ≥ 𝟘 this node. XX XX • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .
Alpha-Beta Pruning 𝛽 = 6 𝛾 = ∞ If 𝑡 is a MAX node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : • If you realize that 𝑉 𝑡’ > 𝛾 𝑡 then prune all remaining children of 𝑡 : MIN will never let us reach 6 ≥ 𝟗 ≥ 𝟘 this node. XX XX • Otherwise, if 𝑉(𝑡’) > 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛽 𝑡 , then set 𝛽 𝑡 = 𝑉(𝑡’) . MIN might still choose 𝑡 (because 𝑉 𝑡’ ≤ 𝛾 𝑡 ), then MAX can choose 𝑡’ .
Alpha-Beta Pruning 𝛽 = 6 6 𝛾 = ∞ If 𝑡 is a MIN node, then: • For each child 𝑡’ ∈ 𝐷(𝑡) : 𝛽 = 6 𝛽 = 6 • If you realize that 𝑉 𝑡’ < 𝛾 = ∞ 𝛾 = ∞ 𝛽 𝑡 then prune all X X X X remaining children of 𝑡 : MAX will never let us reach 5 6 3 ≥ 𝟗 ≥ 𝟘 this node. XX XX • Otherwise, if 𝑉 𝑡’ < 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 𝛾 𝑡 , then set 𝛾 𝑡 = 𝑉(𝑡’) . MAX might still choose 𝑡 (because 𝑉 𝑡’ ≥ 𝛽 𝑡 ), then MIN can choose 𝑡’ .
Optimum node ordering 𝛽 = 6 6 𝛾 = ∞ Imagine you had an oracle, who could tell you which node to evaluate first. Which one should you evaluate first? X X X X • Children of MAX nodes: evaluate 5 6 3 ≥ 𝟗 ≥ 𝟘 the highest-value child first. XX XX • Children of MIN nodes: evaluate 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 the lowest-value child first.
Complexity of alpha-beta 𝛽 = 6 6 𝛾 = ∞ If nodes are optimally ordered, then for each node 𝑡 , we evaluate • The 𝑐 children of its first child. • The first child of each of its other X X X X 𝑐 − 1 children. 5 6 3 ≥ 𝟗 ≥ 𝟘 Total complexity: 2𝑐 − 1 = 𝑃{𝑐} per XX XX two levels. 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 • With 𝑒 levels, total complexity = (2𝑐 − 1) !/# = 𝑃{𝑐 !/# } . Evaluated
Optimal node ordering???!!! Op 𝛽 = 6 6 𝛾 = ∞ How on Earth can we decide which child to evaluate first? • “Children of MAX nodes: evaluate the highest-value child first.” X X X X 5 6 3 ≥ 𝟗 ≥ 𝟘 But if we knew which one had the XX XX highest value, we wouldn’t need to 6 4 3 8 49 7 3 6 69 8 1 3 3 1 5 4 47 4 1 6 2 5 2 7 search the tree! We would already know the optimal move! Evaluated
Outline • Review: minimax and alpha-beta • Move ordering: policy network • Evaluation function: value network • Training the value network • Exact training: endgames • Stochastic training: Monte Carlo tree search • Case study: alphago
Op Optimal node ordering???!!! • If we knew which child had the highest value, we wouldn’t need to search the tree! We would already know the optimal move! • Solution: train a policy network, 𝜌 𝑡, 𝑏
Recommend
More recommend