dac the double actor critic architecture for learning
play

DAC: The Double Actor-Critic Architecture for Learning Options - PowerPoint PPT Presentation

DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang, Shimon Whiteson Presenter: Ehsan Mehralian March 17, 2020 Outline Problem statement Option Critic Double Actor Critic Problem statement


  1. DAC: The Double Actor-Critic Architecture for Learning Options NeurIPS 2019 Shangtong Zhang, Shimon Whiteson Presenter: Ehsan Mehralian March 17, 2020

  2. Outline • Problem statement • Option Critic • Double Actor Critic

  3. Problem statement • Temporal abstraction is a key component in RL: • Better exploration • Faster learning • Better generalization • Transfer learning • MDP + Temporal Abstract actions = SMDP • SMDP algorithms are data ine ffi cient —> The option framework (Sutton et al., 1999) • Rises two problems: • Learning options • Learning a master policy 3

  4. Previous works • Based on finding subgoals: • Di ffi cult to scale up • Can be as expensive as the entire task • Using value-based methods: • Can’t cope with large action spaces • Policy based methods have better convergence properties with function approximation 4

  5. The Option Critic framework (PL Bacon et.al, 2017) • Blurs the line between discovering options and learning options • The first scalable end-to-end approach • No slow down within a single task • Faster convergence in transfer learning 5

  6. <latexit sha1_base64="pYEcPR864IoyROfGrRImFH96CHg=">ACu3icbVHbtNAEF2bWwmXBnjkZURU4SAT2VxEXyIVEBIvSOGStlI2sdabtb3Ua5vdcUXk+id5429YJxGUtiOtdPacGc3MmbjKpcEg+O24167fuHlr53bvzt1793f7Dx4emrLWXEx5mZf6OGZG5LIQU5SYi+NKC6biXBzFJ+87/ehUaCPL4huKjFXLC1kIjlDS0X9X3ufgIoftTwF2nz13/pfPOzoQ8Tzw9W8NJFHhmCD5NmVIMaNujlVw8gzEwnVLFfkYNxUwga6nOSvCs+pcZwhVpYH+YxXHzoe04VgM1UkH1L8Hr2GE7o6ZWUYPjsF0VBYJrtrNFIsGn4etjvDMRIEP51vOo/4gGAXrgMsg3IB2cbEmkCXJa+VKJDnzJhZGFQ4b5hGyXNh162NqBg/YamYWVgwJcy8WXvfwp5lpCU2r4CYc2er2iYMmalYpvZbW0uah15lTarMdmfN7KoahQF3zRK6hywhO6QsJRacMxXFjCupZ0VeMY042jP3bMmhBdXvgwOX4zCl6PXn18NDt5t7dghj8kT4pGQvCEH5COZkCnhzr6zcFInc8cud7+7+SbVdbY1j8h/4dZ/AHcW1q0=</latexit> <latexit sha1_base64="/x6PvheGy3sArpYDHIY2QZBqjE=">ADM3icbVJNixMxGM6MX2v96urRy4ulOJWxzOgueimsiqAHoat2d6Fph0yacPOl0lmsczmP3nxj3gQxIMiXv0PZqZVa7svBJ48z/udhHnMpfK8L5Z97vyFi5e2LjeuXL12/UZz+aBzApB2YBmcSaOQiJZzFM2UFzF7CgXjCRhzA7D42eVfnjChORZ+lbNczZKyDTlEadEGSrYtl62XwFm7wp+Arh84z5xXzvSJR0X+o68e1rDfuA5sgMunpIkIYB1o41zPr4HPSBihPyPixmjFNBazDByj/mU6cIYbmJuahWH5XFcKQBLnkD+z8Gp2I4eYlkQal6vh6XmKeRmutFG+NS3fe1CNSpDwXVkuOGjgShJY4J0JxEkPVlV65Lmv0oM4uYWJy51ybIWvCjLgWn/N6KZs59v9EGjVotryuVxtsAn8JWmhp/aD5CU8yWiQsVTQmUg59L1ejsqpAY6YbuJAsJ/SYTNnQwJQkTI7K+s01tA0zgSgT5qQKanY1oiSJlPMkNJ7VruW6VpFnacNCRY9HJU/zQrGULgpFRQwqg+oDwYQLRlU8N4BQwU2vQGfE7EuZb9YwS/DXR94EBw+6/sPu7v5Oa+/pch1b6Da6gxzko0doD71AfTRA1Ppgfba+Wd/tj/ZX+4f9c+FqW8uYW+g/s3/9Bu9JBrA=</latexit> <latexit sha1_base64="ozDunxBrI7/gQrzaWm7L8O9tGs=">ACG3icbZDLSsNAFIYn3q23qks3g0WsEriBV162bgR6qW10JRyMp2gzNJnJkIJfY93Pgqblwo4kpw4ds4jVlo6w8DH/85hzPn9yPOlHacL2tsfGJyanpmNjc3v7C4lF9eqaowloRWSMhDWfNBUc4CWtFMc1qLJAXhc3rt35wM6td3VCoWBle6F9GgE7A2oyANlYzv32GPXobszvsJZf2kX1RVDZs2bhcVJv3KZabTlFtYdvrgBCAvX4zX3BKTio8Cm4GBZSp3Mx/eK2QxIGmnBQqu46kW4kIDUjnPZzXqxoBOQGOrRuMABVSNJb+vjDeO0cDuU5gUap+7viQSEUj3hm04BuquGawPzv1o91u2DRsKCKNY0ID+L2jHOsSDoHCLSUo07xkAIpn5KyZdkEC0iTNnQnCHTx6F6nbJ3Sntne8WDo+zOGbQGlpHReSifXSITlEZVRBD+gJvaBX69F6t6s95/WMSubWUV/ZH1+A9nFnZU=</latexit> <latexit sha1_base64="FD9wRK489dXj4FzHPCVtNLhkGI=">ADcXicbVLbtNAFJ04PEp4pYUNQqCrRhFOMZFdQLCJVEBIbJBSIG2lTGKNJ+NkVL+YGVdErtf8Hzt+g0/wNhxm5D0SpbOnPs69/p6ScClsu3fNaN+7fqNm1u3Grfv3L13v7m9cyTjVFA2oHEQixOPSBbwiA0UVwE7SQjoRewY+/0Q+E/PmNC8j6puYJG4VkGnGfU6I05W7XfrY/A2bfU34GOPtqvbO+mNIiHQv6pnx2XsK+a5uyAxaekjAkgPNGyd8vAc9IGKQ/LDzbCaMUVyLGYxmNp7yXTgijDQLzXzvOxjXnAkBSx5CMkywCzYTj7EMg3dTPWcfJxhHvlqni9kjDP1wsmFq86la1uw2nKk9fmC0AwnRChOAihk5SvPqkPyvISJrp4wnM9ZUnoGdfyE15uZbPG4UWm9jaWZS5Ka+X2UjlU0hX0hSld1ZOX4jtus2V37dJgEzgVaKHK+m7zF57ENA1ZpGhApBw6dqJGWSGPBixv4FSyhNBTMmVDSMSMjnKyovJoa2ZCfix0F+koGRXMzISjkPR1Z/Cm57ivIq3zDVPlvRxmPklSxiC4a+WkAKobi/GDCBaMqmGtAqOBaK9AZ0ctW+kgbegnO+sib4Gi/67zsvj581Tp4X61jCz1Gu8hEDnqDtAn1EcDRGt/jIfGE+Op8bf+qA713UWoUatyHqD/rP78H3fwF8A=</latexit> Background • MDP M ≡ { S, A, R ( s, a ) , P ( s 0 | s, a ) , P 0 ( s ) , γ } • Goal: ∞ π ∗ = arg max X γ t − 1 r t | s 0 , π θ ] ρ ( π θ ) = arg max E τ ∼ p θ ( τ ) [ θ θ t =1 • Policy Gradient ∂π ( s, a ) ∂ρ X X ∂θ = d π ( s ) Q π ( s, a ) ∂θ s a ∞ X γ t Pr ( s t = s | s 0 , π ) d π ( s ) = 6 t =0

Recommend


More recommend