to
play

to ECR Search Policy - Hill Climbing to " 01 ECR - PowerPoint PPT Presentation

Policysearchttill Climbing to ECR Search Policy - Hill Climbing to " 01 ECR Search Genetic Doit a . a toooo : Thilo On Policysearchttill Climbing # to " ECR Search Genetic rennin


  1. Policysearchttill Climbing µ to ECR

  2. Search Policy - Hill Climbing µ to " 01 ECR Search Genetic Doit a . a toooo : Thilo On

  3. Policysearchttill Climbing # to " ECR Search Genetic rennin ÷ ::÷÷?÷i÷÷÷i¥⇒÷ - .

  4. Search Policy - CMA-t

  5. Gradient Bandits :

  6. Gradient Bandits : scalar Just per a arm y . States ! No ! policy But States full RL inflames future in case ,

  7. policy Proof of theorem gradient

  8. policy Proof of theorem gradient push gradient R in Marginalize , → constant Reward Dynamics t Q E r w . . .

  9. policy Proof of theorem gradient push gradient R in Marginalize , → constant Reward Dynamics t ⑦ E r w . . . ' ) Vals creates Expanding → computation deeply nested ; compute At step every every , state could from to get you have stale could every been you in t Transform simple into Sum over time and steps states : What prob total of at is being each each at time state step ?

  10. policy Proof of theorem gradient push gradient R in Marginalize , - constant Reward Dynamics t ⑦ E r w . . . ' ) Vt ( s creates Expanding → computation deeply nested ; compute At step every every , state could from to get you have stale could every been you in I normalized Transform simple into ① Sum on over stole steady time and steps states . : 5 prob of What prob total of at is being each each at time state step ? normalized O version

  11. REINFORCE → f- actions I not All Q approx , Sample return a

  12. REINFORCE →

  13. REINFORCE →

  14. Gradient Bandits + Base line I ← Expectation Mean of Zero Samples

  15. ! Baseline REINFORCE Gradient Bandits t + Baseline I ① Mean Expectation of Zero Samples f Lse ) I

  16. Actually search policy - parameterized Directly - policy valve functions No - ( except baseline ) REINFORCE in Continuous actions - natural to represent High variance - , No bootstrapping w/ policy Scales - Complexity not size , of state space

  17. Critic only Actor only - - value search function policy - - methods parameterized Directly - policy Indirect - policy via VE value functions No - actions Discrete - ( except baseline only ) REINFORCE in variance Lower - Continuous actions I - bootstrapping natural to represent with Scales size - High variance - , state of space No bootstrapping w/ policy Scales - Complexity not size , of state space

  18. Critic only Actor only Actor Critic - - - - - Policy value valve search frickin function policy Search t - . methods Directly parametrized ! both Benefits of - - policy Indirect - policy via UF Continuous actions - valve functions No - actions Discrete - ( except baseline - Bootstrapping only ) Scales primarily REINFORCE in with - variance Lower Policy complexity - Continuous actions I - bootstrapping natural to represent with Scales size - High variance - , state of space No bootstrapping w/ policy Scales - Complexity not size , of state space

  19. Critic only Actor only Actor Critic - - - - - Policy value valve search frickin function policy Search t - . methods Directly parametrized ! both Benefits of - - policy Indirect - policy via VF Continuous actions - valve functions No - actions Discrete - ( except baseline - Bootstrapping only ) Scales primarily REINFORCE in with - variance Lower Policy complexity - Continuous actions I - bootstrapping natural to represent popular Many of most with Scales size - High variance - , state methods of A- space contemporary c are : No bootstrapping Proximal Policy Optimization - w/ policy Scales - A 3C - Complexity not size , Critic Actor of Soft state space - PG DD - : (

Recommend


More recommend