apprentissage par renforcement plan du cours
play

Apprentissage par Renforcement: Plan du cours Contexte Algorithms - PowerPoint PPT Presentation

Apprentissage par Renforcement: Plan du cours Contexte Algorithms Value functions Optimal policy Temporal differences and eligibility traces Q-learning Playing Go: MoGo Feature Selection as a Game Position du probl` eme Monte-Carlo Tree


  1. Apprentissage par Renforcement: Plan du cours Contexte Algorithms Value functions Optimal policy Temporal differences and eligibility traces Q-learning Playing Go: MoGo Feature Selection as a Game Position du probl` eme Monte-Carlo Tree Search Feature Selection: the FUSE algorithm Experimental Validation Active Learning as a Game Position du probl` eme Algorithme BAAL Validation exp´ erimentale Constructive Induction

  2. Go as AI Challenge Features ◮ Number of games 2 . 10 170 ∼ number of atoms in universe. ◮ Branching factor: 200 ( ∼ 30 for chess) ◮ Assessing a game ? ◮ Local and global features (symmetries, freedom, ...) Principles of MoGo Gelly Silver 2007 ◮ A weak but unbiased assessment function: Monte Carlo-based ◮ Allowing the machine to play against itself and build its own strategy

  3. Weak unbiased assessment Monte-Carlo-based Br¨ ugman (1993) 1. While possible, add a stone (white, black) 2. Compute Win(black) 3. Average on 1-2 Remark: The point is to be unbiased if there exists situations where you (wrongly) think you’re in good shape then you go there and you’re in bad shape...

  4. Build a strategy: Monte-Carlo Tree Search In a given situation: Select a move Multi-Armed Bandit In the end: 1. Assess the final move Monte-Carlo 2. Update reward for all moves

  5. Select a move Exploration vs Exploitation Dilemma Multi-Armed Bandits Lai, Robbins 1985 ◮ In a casino, one wants to maximize one’s gains while playing ◮ Play the best arms so far ? Exploitation ◮ But there might exist better arms... Exploration

  6. Multi-Armed Bandits, foll’d Auer et al. 2001, 2002; Kocsis Szepesvari 2006 For each arm (move) ◮ Reward: Bernoulli variable ∼ µ i , 0 ≤ µ i ≤ 1 ◮ Empirical estimate: ˆ µ i ± Confidence ( n i ) nb trials Decision: Optimism in front of unknown! log ( � n j ) � Select i ∗ = argmax ˆ µ i + C n i

  7. Multi-Armed Bandits, foll’d Auer et al. 2001, 2002; Kocsis Szepesvari 2006 For each arm (move) ◮ Reward: Bernoulli variable ∼ µ i , 0 ≤ µ i ≤ 1 ◮ Empirical estimate: ˆ µ i ± Confidence ( n i ) nb trials Decision: Optimism in front of unknown! log ( � n j ) � Select i ∗ = argmax ˆ µ i + C n i ◮ Take into account standard deviation of ˆ µ i Variants ◮ Trade-off controlled by C ◮ Progressive widening

  8. Monte-Carlo Tree Search Comments: MCTS grows an asymmetrical tree ◮ Most promising branches are more explored ◮ thus their assessment becomes more precise ◮ Needs heuristics to deal with many arms... ◮ Share information among branches MoGo: World champion in 2006, 2007, 2009 First to win over a 7th Dan player in 19 × 19

  9. Apprentissage par Renforcement: Plan du cours Contexte Algorithms Value functions Optimal policy Temporal differences and eligibility traces Q-learning Playing Go: MoGo Feature Selection as a Game Position du probl` eme Monte-Carlo Tree Search Feature Selection: the FUSE algorithm Experimental Validation Active Learning as a Game Position du probl` eme Algorithme BAAL Validation exp´ erimentale Constructive Induction

  10. Quand l’apprentissage c’est la s´ election d’attributs Bio-informatique ◮ 30 000 g` enes ◮ peu d’exemples (chers) ◮ but : trouver les g` enes pertinents

  11. Position du probl` eme Buts • S´ election : trouver un sous-ensemble d’attributs • Ordre/Ranking : ordonner les attributs Formulation Soient les attributs F = { f 1 , .. f d } . Soit la fonction : G : P ( F ) �→ I R F ⊂ F �→ Err ( F ) = erreur min. des hypoth` eses fond´ ees sur F Trouver Argmin ( G ) Difficult´ es eme d’optimisation combinatoire (2 d ) • Un probl` • D’une fonction F inconnue...

  12. Approches Filter m´ ethode univari´ ee D´ efinir score ( f i ); ajouter it´ erativement les attributs maximisant score ou retirer it´ erativement les attributs minimisant score + simple - pas cher − optima tr` es locaux Rq : on peut bactracker : meilleurs optima, mais plus cher Wrapping m´ ethode multivari´ ee Mesurer la qualit´ e d’attributs en rapport avec d’autres attributs : estimer G ( f i 1 , ... f ik ) − cher : une estimation = un pb d’apprentissage. + optima meilleurs M´ ethodes hybrides.

  13. Approches filtre Notations Base d’apprentissage : E = { ( x i , y i ) , i = 1 .. n , y i ∈ {− 1 , 1 }} f ( x i ) = valeur attribut f pour exemple ( x i ) Gain d’information arbres de d´ ecision p ([ f = v ]) = Pr ( y = 1 | f ( x i ) = v ) QI ([ f = v ]) = − p log p − (1 − p ) log (1 − p ) � QI = p ( v ) QI ([ f = v ]) v Corr´ elation � i f ( x i ) . y i � corr ( f ) = ∝ f ( x i ) . y i �� i ( f ( x i )) 2 × � i y 2 i i

  14. Approches wrapper Principe g´ en´ erer/tester Etant donn´ e une liste de candidats L = { f 1 , .., f p } • G´ en´ erer un candidat F • Calculer G ( F ) • apprendre h F ` a partir de E | F = ˆ • tester h F sur un ensemble de test G ( F ) • Mettre ` a jour L . Algorithmes • hill-climbing / multiple restart • algorithmes g´ en´ etiques Vafaie-DeJong, IJCAI 95 • (*) programmation g´ en´ etique & feature construction. Krawiec, GPEH 01

  15. Approches a posteriori Principe • Construire des hypoth` eses • En d´ eduire les attributs importants • Eliminer les autres • Recommencer Algorithme : SVM Recursive Feature Elimination Guyon et al. 03 eaire → h ( x ) = sign ( � w i . f i ( x ) + b ) • SVM lin´ • Si | w i | est petit, f i n’est pas important • Eliminer les k attributs ayant un poids min. • Recommencer.

  16. Limites Hypoth` eses lin´ eaires • Un poids par attribut. Quantit´ e des exemples • Les poids des attributs sont li´ es. • La dimension du syst` eme est li´ ee au nombre d’exemples. Or le pb de FS se pose souvent quand il n’y a pas assez d’exemples

  17. Some references ◮ Filter approaches [1] ◮ Wrapper approaches ◮ Tackling combinatorial optimization [2,3,4] ◮ Exploration vs Exploitation dilemma ◮ Embedded approaches ◮ Using the learned hypothesis [5,6] ◮ Using a regularization term [7,8] ◮ Restricted to linear models [7] or linear combinations of kernels [8] [1] K. Kira, and L. A. Rendell ML’92 [2] D. Margaritis NIPS’09 [3] T. Zhang NIPS’08 [4] M. Boull´ e J. Mach. Learn. Res. 07 [5] I. Guyon, J. Weston, S. Barnhill, and V. Vapnik Mach. Learn. 2002 [6] J. Rogers, and S. R. Gunn SLSFS’05 [7] R. Tibshirani Journal of the Royal Statistical Society 94 [8] F. Bach NIPS’08

  18. Feature Selection F : Set of features Optimization problem F : Feature subset E : Training data set F ∗ = argmin Err ( A , F , E ) Find A : Machine Learning algorithm Err : Generalization error Feature Selection Goals ◮ Reduced Generalization Error ◮ More cost-effective models ◮ More understandable models Bottlenecks ◮ Combinatorial optimization problem: find F ⊆ F ◮ Generalization error unknown

  19. FS as A Markov Decision Process f 1 f 3 f 2 Set of features F Set of states S = 2 F f 1 f 2 f 3 f 3 Initial state ∅ f 2 f 1 f 3 f 2 f 1 Set of actions A = { add f , f ∈ F} f , f f , f f , f Final state any state 1 2 1 3 2 3 f 2 f 1 f 3 Reward function V : S �→ [0 , 1] f , f 1 2 f 3 Goal: Find argmin Err ( A ( F , D )) F ⊆F

  20. Optimal Policy Policy π : S → A Final state following a policy F π f 1 f 3 Optimal policy π ⋆ = f 2 argmin Err ( A ( F π , E )) f 1 f 2 f 3 π f 3 Bellman’s optimality principle f 2 f 1 f 1 f 3 f 2 π ⋆ ( F ) = argmin V ⋆ ( F ∪ { f } ) f ∈F f , f f , f f , f 1 2 1 3 2 3 f 2 f 1 f 3 � Err ( A ( F )) if final ( F ) V ⋆ ( F ) = f , f f ∈F V ⋆ ( F ∪ { f } ) otherwise 1 2 min f 3 ◮ π ⋆ intractable ⇒ approximation using UCT In practice ◮ Computing Err ( F ) using a fast estimate

  21. FS as a game Exploration vs Exploitation tradeoff ◮ Virtually explore the whole lattice ◮ Gradually focus the search on most promising F s f 1 f 2 f 3 ◮ Use a frugal, unbiased assessment of F f , f f , f f , f 1 2 1 3 2 3 How ? ◮ Upper Confidence Tree (UCT) [1] ◮ UCT ⊂ Monte-Carlo Tree Search f , f 1 2 f 3 ◮ UCT tackles tree-structured optimization problems [1] L. Kocsis, and C. Szepesv´ ari ECML’06

  22. Apprentissage par Renforcement: Plan du cours Contexte Algorithms Value functions Optimal policy Temporal differences and eligibility traces Q-learning Playing Go: MoGo Feature Selection as a Game Position du probl` eme Monte-Carlo Tree Search Feature Selection: the FUSE algorithm Experimental Validation Active Learning as a Game Position du probl` eme Algorithme BAAL Validation exp´ erimentale Constructive Induction

  23. The UCT scheme ◮ Upper Confidence Tree (UCT) [1] ◮ Gradually grow the search tree ◮ Building Blocks ◮ Select next action (bandit-based Search Tree phase) ◮ Add a node (leaf of the search tree) ◮ Select next action bis (random phase) ◮ Compute instant reward ◮ Update information in visited nodes ◮ Returned solution: Explored Tree ◮ Path visited most often [1] L. Kocsis, and C. Szepesv´ ari ECML’06

Recommend


More recommend