policy certificates towards accountable reinforcement
play

Policy Certificates: Towards Accountable Reinforcement Learning - PowerPoint PPT Presentation

Policy Certificates: Towards Accountable Reinforcement Learning Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill CMU Google Research Stanford University Minimax-Optimal PAC


  1. Policy Certificates: Towards Accountable Reinforcement Learning Christoph Dann, Lihong Li, Wei Wei, Emma Brunskill CMU Google Research Stanford University

  2. Minimax-Optimal PAC Bounds Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy

  3. Minimax-Optimal PAC Bounds Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound First minimax-optimal! (for small ϵ) S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy Prior work: [DLB ‘17] [DB ‘15]

  4. <latexit sha1_base64="30xMli8LGXUeEmLEMSjzuAbUbRA=">ACX3icbVHLTgIxFC3jG1+jroybRmJigiEzg0aXqAtZ+gAlgYF0StGzsP2jpFM5v8BpeuXLnVrR3ARMGbtjk59x721MvElyBZb3mjJnZufmFxaX8srq2rq5sXmrwlhSVqehCGXDI4oJHrA6cBCsEUlGfE+wO69/nuXvnphUPAxqMIiY65P7gPc4JaCpjkla6lFCcnNabTu1FBdxC9gzDPs2ry/O3MSxrAOcrTQZSavtci1Np3T2kRb9HGlSxDdtJ2uadsyCVbKGgaeBPQYFNI7Ljvne6oY09lkAVBClmrYVgZsQCZwKluZbsWIRoX1yz5oaBsRnyk2GV0nxnma6uBdKvQPAQ/Z3RUJ8pQa+p5U+gQc1mcvI/3OU6gupifHQO3ETHkQxsICOpvdigSHEmdm4yWjIAYaECq5fgCmD0QSCvpL8toZe9KHaXDrlOxybk6LFQqY48W0Q7aRfvIRseogqroEtURS/oA32ir9ybsWCsGeZIauTGNVvoTxjb34GntLY=</latexit> Minimax-Optimal PAC Bounds Key contribution: new algorithm for episodic tabular MDPs with PAC Bound High-prob. Regret Bound First minimax-optimal! (for small ϵ) Matches existing + improves for large H S: #states, A: #actions, H: episode length, T: #episodes, ϵ: accuracy Prior work: √ √ H 3 T + S 2 AH 2 SAH 2 T + [AOM ‘17] [DLB ‘17] [DB ‘15]

  5. Motivation: Need for Accountability in Online RL current episode Even with PAC + regret bounds: expected return in next episode during learning unknown

  6. Motivation: Need for Accountability in Online RL How good will my treatment be? Is it the best possible?

  7. Our Proposal: Algorithms output policy certificates before each episode

  8. Algorithms with policy certificates Natural extension of model-based optimistic algorithms 1. UCB on optimal value function 2. Greedy Policy 3. LCB on value function of current policy 4. Output certificate 0

  9. Algorithms with Policy Certificates Natural extension of model-based optimistic algorithms 1. UCB on optimal value function 2. Greedy Policy 3. LCB on value function of current policy 4. Output certificate 0

  10. Symbiosis of Optimism and Certificates Certificates: • Challenge: random • Insight from optimism: at known rate

  11. Symbiosis of Optimism and Certificates Certificates: Optimism: • Challenge: random • Challenge: exploration bonus depends on • Insight from optimism: • Insight from certificates: at known rate bound by

  12. Symbiosis of Optimism and Certificates Certificates: Optimism: • Challenge: random • Challenge: exploration bonus depends on • Insight from optimism: • Insight from certificates: at known rate bound by More accountable algorithms Better exploration bonuses yield through accurate policy certificates minimax-optimal PAC & regret bounds

Recommend


More recommend