de Scale-free adaptive PLANNING for deterministic dynamics & discounted rewards Peter Bartlett, Victor Gabillon, Jennifer Healey, Michal Valko ICML - June 13th, 2019 1/11
An MCTS setting MDP with starting state x 0 ∈ X , action space A n interactions: At time t playing a t in x t leads to Deterministic dynamics g : x t +1 � g ( x t , a t ), Reward: r t ( x t , a t ) + ε t with ε t being the noise Objective: Recommend action a ( n ) that minimizes r n � max a ∈ A Q ⋆ ( x , a ) − Q ⋆ ( x , a ( n )) simple regret � γ t r ( x t , π ( x t )) where Q ⋆ ( x , a ) � r ( x , a ) + sup π Assumption: r t ∈ [0 , R max ] and | ε t | ≤ b Approach: Trying to explore without the parameters R max and b 2/11
An MCTS setting MDP with starting state x 0 ∈ X , action space A n interactions: At time t playing a t in x t leads to Deterministic dynamics g : x t +1 � g ( x t , a t ), Reward: r t ( x t , a t ) + ε t with ε t being the noise Objective: Recommend action a ( n ) that minimizes r n � max a ∈ A Q ⋆ ( x , a ) − Q ⋆ ( x , a ( n )) simple regret � γ t r ( x t , π ( x t )) where Q ⋆ ( x , a ) � r ( x , a ) + sup π Assumption: r t ∈ [0 , R max ] and | ε t | ≤ b Approach: Trying to explore without the parameters R max and b 2/11
An MCTS setting MDP with starting state x 0 ∈ X , action space A n interactions: At time t playing a t in x t leads to Deterministic dynamics g : x t +1 � g ( x t , a t ), Reward: r t ( x t , a t ) + ε t with ε t being the noise Objective: Recommend action a ( n ) that minimizes r n � max a ∈ A Q ⋆ ( x , a ) − Q ⋆ ( x , a ( n )) simple regret � γ t r ( x t , π ( x t )) where Q ⋆ ( x , a ) � r ( x , a ) + sup π Assumption: r t ∈ [0 , R max ] and | ε t | ≤ b Approach: Trying to explore without the parameters R max and b 2/11
OLOP (Bubeck and Munos, 2010) OLOP implements Optimistic Planning using Upper Confidence Bound (UCB) on the Q value of a sequence of q actions a 1 , . . . , a q : � � � q � + R max γ q +1 1 � Q UCB ( a 1: q ) � γ h � r h ( t ) + γ h b t T a h ( t ) 1 − γ � �� � h =1 � �� � unseen reward estimation of observed reward in optimization under a fixed budget n , excellent strategies allocate samples to actions without knowing R max or b 3/11
OLOP (Bubeck and Munos, 2010) OLOP implements Optimistic Planning using Upper Confidence Bound (UCB) on the Q value of a sequence of q actions a 1 , . . . , a q : � � � q � + R max γ q +1 1 � Q UCB ( a 1: q ) � γ h � r h ( t ) + γ h b t T a h ( t ) 1 − γ � �� � h =1 � �� � unseen reward estimation of observed reward in optimization under a fixed budget n , excellent strategies allocate samples to actions without knowing R max or b 3/11
Tree Search x 0 h=0 r 04 r 02 r 03 x 2 x 3 x 4 h=1 r 35 x 5 x 6 x 7 h=2 r 56 x 6 h=3 h=4 Q(x 6 )=r 03 + γ r 35 + γ 2 r 56 h=5 4/11
Tree Search x 0 h=0 r 04 r 02 r 03 x 3 x 2 x 4 h=1 r 35 x 5 x 6 x 7 h=2 r 56 x 6 h=3 h=4 Q(x 6 )=r 03 + γ r 35 + γ 2 r 56 h=5 This is a zero order optimization! 4/11
Black-box optimization: use the partitioning to explore f (uniformly) 5/11
Black-box optimization: use the partitioning to explore f (uniformly) h=0 5/11
Black-box optimization: use the partitioning to explore f (uniformly) h=0 h=1 5/11
Black-box optimization: use the partitioning to explore f (uniformly) h=0 h=1 h=2 5/11
Zipf exploration: Open best n h cells at depth h h=0 h=1 ... ... ... ... n ... h 6/11
Noisy case • need to pull more each x to limit uncertainty • tradeoff: the more you pull each x the shallower you can explore 7/11
Noisy case: StroquOOL (Bartlett et al. 2019) At depth h : • order the cells by decreasing value and • open the i -th best cell with m = n hi estimations 8/11
Black-box optimization vs planning: Reuse of samples and γ Optimization Planning x 0 x 0 h=0 h=0 r 04 r 02 r 03 x 2 x 3 x 4 x 2 x 3 x 4 h=1 h=1 r 35 x 5 x 5 x 6 x 7 x 6 x 7 h=2 h=2 r 56 h=3 h=3 h=4 h=4 h=5 h=5 Lower regret for planning! (Bubeck & Munos 2010) 9/11
Black-box optimization vs planning: Reuse of samples and γ Optimization Planning x 0 x 0 h=0 h=0 r 1 x 2 x 3 x 4 x 2 x 3 x 4 h=1 h=1 r 2 x 5 x 5 x 6 x 7 x 6 x 7 h=2 h=2 r 3 h=3 h=3 r 4 h=4 h=4 f 105 h=5 h=5 Lower regret for planning! (Bubeck & Munos 2010) 9/11
Black-box optimization vs planning: Reuse of samples and γ Optimization Planning x 0 x 0 h=0 h=0 r' 1 x 2 x 3 x 4 x 2 x 3 x 4 h=1 h=1 r' 2 x 5 x 5 x 6 x 7 x 6 x 7 h=2 h=2 r' 3 h=3 h=3 r' 4 h=4 h=4 f 134 h=5 h=5 Lower regret for planning! (Bubeck & Munos 2010) 9/11
Black-box optimization vs planning: Reuse of samples and γ Optimization Planning x 0 x 0 h=0 h=0 r' 1 x 2 x 3 x 4 x 2 x 3 x 4 h=1 h=1 r' 2 x 5 x 5 x 6 x 7 x 6 x 7 h=2 h=2 r' 3 h=3 h=3 r' 4 h=4 h=4 f 134 h=5 h=5 K H samples near the root How many samples near the root? Lower regret for planning! (Bubeck & Munos 2010) 9/11
Black-box optimization vs. planning: Reuse samples and take advantage of γ Uniform exploration Zipf exploration h=0 h=1 x 0 ... r 04 r 04 r 04 h=0 r 04 r 04 r 04 r 04 r 04 ... r 04 r 04 r 04 r 04 x 3 x 2 x 4 h=1 ... x 5 x 6 x 7 ... n h=2 ... h h=3 h=4 h=5 Bubeck & Munos: Only for uniform strategies . . . We figured the amount the samples needed! 10/11
Black-box optimization vs. planning: Reuse samples and take advantage of γ Uniform exploration Zipf exploration h=0 h=1 x 0 ... r 04 r 04 r 04 h=0 r 04 r 04 r 04 r 04 r 04 ... r 04 r 04 r 04 r 04 x 3 x 2 x 4 h=1 ... x 5 x 6 x 7 ... n h=2 ... h h=3 h=4 h=5 not sharing information Sharing information Bubeck & Munos: Only for uniform strategies . . . We figured the amount the samples needed! 10/11
PlaT γ POOS The power of PlaT γ POOS • implements Zipf exploration for MCTS StroquOOL , • explicitly pulls an action at depth h + 1, γ times less than � γ t r ( x t , π ( x t )) , action at depth h , ( Q ⋆ ( x , a ) = r ( x , a ) + sup π • does not use UCB & no use of R max and b ,) • improves over OLOP with adaptation to low noise and additional unknown smoothness • gets exponential speedups when no noise is present! 11/11
Recommend
More recommend