bandit based search for constraint programming
play

Bandit-based Search for Constraint Programming Manuel Loth 1 , 2 , 4 - PowerPoint PPT Presentation

Bandit-based Search for Constraint Programming Manuel Loth 1 , 2 , 4 , Mich` ele Sebag 2 , 4 , 1 , Youssef Hamadi 3 , 1 , Marc Schoenauer 4 , 2 , 1 , Christian Schulte 5 1 Microsoft-INRIA joint centre 2 LRI, Univ. Paris-Sud and CNRS 3 Microsoft


  1. Bandit-based Search for Constraint Programming Manuel Loth 1 , 2 , 4 , Mich` ele Sebag 2 , 4 , 1 , Youssef Hamadi 3 , 1 , Marc Schoenauer 4 , 2 , 1 , Christian Schulte 5 1 Microsoft-INRIA joint centre 2 LRI, Univ. Paris-Sud and CNRS 3 Microsoft Research Cambridge 4 INRIA Saclay 5 KTH, Stockholm Review AERES, Nov. 2013 LABORATOIRE DE RECHERCHE EN INFORMATIQUE 1 / 23

  2. Search/Optimization and Machine Learning Different Learning contexts ◮ Supervised (from examples) vs Reinforcement (from reward) ◮ Off-line (static) vs On-line (while searching) Here: Use on-line Reinforcement Learning (MCTS) To improve CP search 2 / 23

  3. Main idea Constraint Programming ◮ Explore a search tree ◮ Heuristics: (learn to) order variables & values Monte-Carlo Tree Search ◮ A tree-search method ◮ Breathrough for games and planning Hybridizing MCTS and CP Bandit-based Search for Constraint Programming 3 / 23

  4. Overview MCTS BaSCoP Experimental validation Conclusions and Perspectives 4 / 23

  5. The Multi-Armed Bandit problem Lai, Robbins 85 In a casino, one wants to maximize one’s gains while playing. Lifelong learning Exploration vs Exploitation Dilemma ◮ Play the best arm so far ? Exploitation ◮ But there might exist better arms... Exploration 5 / 23

  6. The Multi-Armed Bandit problem (2) ◮ K arms, i th arm gives reward 1 with proba. µ i , 0 otherwise ◮ At each time t , one selects an arm i ∗ t and gets a reward r t = number of times i has been selected in [0,t] n i , t ˆ = average reward of arm i in [0,t] µ i , t Upper Confidence Bound Auer et al. 2002 Be optimistic when facing the unknown log ( � n j , t ) � � � Select argmax µ i , t + C ˆ n i , t ǫ -greedy with probability 1 − ǫ , select argmax { ˆ µ i , t } exploitation else select an arm uniformly exploration 6 / 23

  7. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase ◮ Add a node Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward Evaluate ◮ Update information in visited nodes of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  8. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward Evaluate ◮ Update information in visited nodes of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  9. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward Evaluate ◮ Update information in visited nodes of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  10. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward Evaluate ◮ Update information in visited nodes of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  11. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward Evaluate ◮ Update information in visited nodes of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  12. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward Evaluate ◮ Update information in visited nodes of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  13. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward Evaluate ◮ Update information in visited nodes of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  14. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward Evaluate ◮ Update information in visited nodes of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  15. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward Evaluate ◮ Update information in visited nodes of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  16. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward New Node Evaluate ◮ Update information in visited nodes of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  17. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward New Node Evaluate Random ◮ Update information in visited nodes Phase of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  18. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward New Node Evaluate Random ◮ Update information in visited nodes Phase of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  19. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward New Node Evaluate Random ◮ Update information in visited nodes Phase of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  20. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward New Node Evaluate Random ◮ Update information in visited nodes Phase of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  21. Monte-Carlo Tree Search Kocsis Szepesv´ ari, 06 UCT == UCB for Trees: gradually grow the search tree ◮ Iterate Tree-Walk ◮ Building Blocks ◮ Select next action Bandit phase Bandit−Based ◮ Add a node Phase Grow a leaf of the search tree Search Tree ◮ Select next action bis Random phase, roll-out ◮ Compute instant reward New Node Evaluate Random ◮ Update information in visited nodes Phase of the search tree Propagate Explored Tree ◮ Returned solution: ◮ Path visited most often 7 / 23

  22. Overview MCTS BaSCoP Experimental validation Conclusions and Perspectives 8 / 23

More recommend