Efficient Exploration by Novelty Pursuit Ziniu Li ziniuli@link.cuhk.edu.cn The Chinese University of Hong Kong, Shenzhen & Polixir Joint work with Xiong-Hui Chen, Nanjing University International Conference on Distributed Artificial Intelligence (DAI), 2020 October 12, 2020 Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 1 / 30
Overview Introduction Proposed Method Experiment Conclusion Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 2 / 30
Outline Introduction Proposed Method Experiment Conclusion Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 3 / 30
<latexit sha1_base64="EeOpa/oL5zGoQgAt0eCud3mg+Os=">AC1HicjVHLSsNAFD3GV62PRl26CRbBVUlE0GXRjcsK9gFtKZN0WkPTJEwmaqldiVt/wK1+k/gH+hfeGVNQi+iEJGfOPefO3HvdOPATaduvc8b8wuLScm4lv7q2vlEwN7dqSZQKj1e9KIhEw2UJD/yQV6UvA96IBWdDN+B1d3Cq4vUrLhI/Ci/kKObtIeuHfs/3mCSqYxZakt9IKceCXzPRnXTMol2y9bJmgZOBIrJVicwXtNBFBA8phuAIQkHYEjoacKBjZi4NsbECUK+jnNMkCdvSipOCkbsgL592jUzNqS9yplot0enBPQKclrYI09EOkFYnWbpeKozK/a3GOdU91tRH83yzUkVuKS2L98U+V/faoWiR6OdQ0+1RrRlXnZVlS3RV1c+tLVZIyxMQp3KW4IOxp57TPlvYkunbVW6bjb1qpWLX3Mm2Kd3VLGrDzc5yzoHZQcuySc35YLJ9ko85hB7vYp3keoYwzVFDVM3/E56NmnFr3Bn3n1JjLvNs49syHj4Anj+WSA=</latexit> <latexit sha1_base64="EeOpa/oL5zGoQgAt0eCud3mg+Os=">AC1HicjVHLSsNAFD3GV62PRl26CRbBVUlE0GXRjcsK9gFtKZN0WkPTJEwmaqldiVt/wK1+k/gH+hfeGVNQi+iEJGfOPefO3HvdOPATaduvc8b8wuLScm4lv7q2vlEwN7dqSZQKj1e9KIhEw2UJD/yQV6UvA96IBWdDN+B1d3Cq4vUrLhI/Ci/kKObtIeuHfs/3mCSqYxZakt9IKceCXzPRnXTMol2y9bJmgZOBIrJVicwXtNBFBA8phuAIQkHYEjoacKBjZi4NsbECUK+jnNMkCdvSipOCkbsgL592jUzNqS9yplot0enBPQKclrYI09EOkFYnWbpeKozK/a3GOdU91tRH83yzUkVuKS2L98U+V/faoWiR6OdQ0+1RrRlXnZVlS3RV1c+tLVZIyxMQp3KW4IOxp57TPlvYkunbVW6bjb1qpWLX3Mm2Kd3VLGrDzc5yzoHZQcuySc35YLJ9ko85hB7vYp3keoYwzVFDVM3/E56NmnFr3Bn3n1JjLvNs49syHj4Anj+WSA=</latexit> <latexit sha1_base64="EeOpa/oL5zGoQgAt0eCud3mg+Os=">AC1HicjVHLSsNAFD3GV62PRl26CRbBVUlE0GXRjcsK9gFtKZN0WkPTJEwmaqldiVt/wK1+k/gH+hfeGVNQi+iEJGfOPefO3HvdOPATaduvc8b8wuLScm4lv7q2vlEwN7dqSZQKj1e9KIhEw2UJD/yQV6UvA96IBWdDN+B1d3Cq4vUrLhI/Ci/kKObtIeuHfs/3mCSqYxZakt9IKceCXzPRnXTMol2y9bJmgZOBIrJVicwXtNBFBA8phuAIQkHYEjoacKBjZi4NsbECUK+jnNMkCdvSipOCkbsgL592jUzNqS9yplot0enBPQKclrYI09EOkFYnWbpeKozK/a3GOdU91tRH83yzUkVuKS2L98U+V/faoWiR6OdQ0+1RrRlXnZVlS3RV1c+tLVZIyxMQp3KW4IOxp57TPlvYkunbVW6bjb1qpWLX3Mm2Kd3VLGrDzc5yzoHZQcuySc35YLJ9ko85hB7vYp3keoYwzVFDVM3/E56NmnFr3Bn3n1JjLvNs49syHj4Anj+WSA=</latexit> <latexit sha1_base64="EeOpa/oL5zGoQgAt0eCud3mg+Os=">AC1HicjVHLSsNAFD3GV62PRl26CRbBVUlE0GXRjcsK9gFtKZN0WkPTJEwmaqldiVt/wK1+k/gH+hfeGVNQi+iEJGfOPefO3HvdOPATaduvc8b8wuLScm4lv7q2vlEwN7dqSZQKj1e9KIhEw2UJD/yQV6UvA96IBWdDN+B1d3Cq4vUrLhI/Ci/kKObtIeuHfs/3mCSqYxZakt9IKceCXzPRnXTMol2y9bJmgZOBIrJVicwXtNBFBA8phuAIQkHYEjoacKBjZi4NsbECUK+jnNMkCdvSipOCkbsgL592jUzNqS9yplot0enBPQKclrYI09EOkFYnWbpeKozK/a3GOdU91tRH83yzUkVuKS2L98U+V/faoWiR6OdQ0+1RrRlXnZVlS3RV1c+tLVZIyxMQp3KW4IOxp57TPlvYkunbVW6bjb1qpWLX3Mm2Kd3VLGrDzc5yzoHZQcuySc35YLJ9ko85hB7vYp3keoYwzVFDVM3/E56NmnFr3Bn3n1JjLvNs49syHj4Anj+WSA=</latexit> <latexit sha1_base64="DpwzqE8QmuIvEwFykJmgJTAdULA=">AC1HicjVHLSsNAFD2Nr1ofjbp0EyCq5KIoMuiG5cV7APaUpLptA5Nk5BMxFK7Erf+gFv9JvEP9C+8M6agFtEJSc6ce86dufd6kS8SaduvOWNhcWl5Jb9aWFvf2CyaW9v1JExjxms9MO46bkJ90XAa1JInzejmLsjz+cNb3im4o1rHiciDC7lOKdkTsIRF8wVxLVNYtyW+klBOXKWLaNUt2dbLmgdOBkrIVjU0X9BGDyEYUozAEUAS9uEioacFBzYi4jqYEBcTEjrOMUWBvCmpOClcYof0HdCulbEB7VXORLsZneLTG5PTwj5QtLFhNVplo6nOrNif8s90TnV3cb097JcI2Ilroj9yzdT/tenapHo40TXIKimSDOqOpZlSXVX1M2tL1VJyhARp3CP4jFhp2zPlvak+jaVW9dHX/TSsWqPcu0Kd7VLWnAzs9xzoP6Ydmxy87FUalymo06j13s4YDmeYwKzlFTc/8EU94NurGrXFn3H9KjVzm2cG3ZTx8AI0/lkE=</latexit> <latexit sha1_base64="DpwzqE8QmuIvEwFykJmgJTAdULA=">AC1HicjVHLSsNAFD2Nr1ofjbp0EyCq5KIoMuiG5cV7APaUpLptA5Nk5BMxFK7Erf+gFv9JvEP9C+8M6agFtEJSc6ce86dufd6kS8SaduvOWNhcWl5Jb9aWFvf2CyaW9v1JExjxms9MO46bkJ90XAa1JInzejmLsjz+cNb3im4o1rHiciDC7lOKdkTsIRF8wVxLVNYtyW+klBOXKWLaNUt2dbLmgdOBkrIVjU0X9BGDyEYUozAEUAS9uEioacFBzYi4jqYEBcTEjrOMUWBvCmpOClcYof0HdCulbEB7VXORLsZneLTG5PTwj5QtLFhNVplo6nOrNif8s90TnV3cb097JcI2Ilroj9yzdT/tenapHo40TXIKimSDOqOpZlSXVX1M2tL1VJyhARp3CP4jFhp2zPlvak+jaVW9dHX/TSsWqPcu0Kd7VLWnAzs9xzoP6Ydmxy87FUalymo06j13s4YDmeYwKzlFTc/8EU94NurGrXFn3H9KjVzm2cG3ZTx8AI0/lkE=</latexit> <latexit sha1_base64="DpwzqE8QmuIvEwFykJmgJTAdULA=">AC1HicjVHLSsNAFD2Nr1ofjbp0EyCq5KIoMuiG5cV7APaUpLptA5Nk5BMxFK7Erf+gFv9JvEP9C+8M6agFtEJSc6ce86dufd6kS8SaduvOWNhcWl5Jb9aWFvf2CyaW9v1JExjxms9MO46bkJ90XAa1JInzejmLsjz+cNb3im4o1rHiciDC7lOKdkTsIRF8wVxLVNYtyW+klBOXKWLaNUt2dbLmgdOBkrIVjU0X9BGDyEYUozAEUAS9uEioacFBzYi4jqYEBcTEjrOMUWBvCmpOClcYof0HdCulbEB7VXORLsZneLTG5PTwj5QtLFhNVplo6nOrNif8s90TnV3cb097JcI2Ilroj9yzdT/tenapHo40TXIKimSDOqOpZlSXVX1M2tL1VJyhARp3CP4jFhp2zPlvak+jaVW9dHX/TSsWqPcu0Kd7VLWnAzs9xzoP6Ydmxy87FUalymo06j13s4YDmeYwKzlFTc/8EU94NurGrXFn3H9KjVzm2cG3ZTx8AI0/lkE=</latexit> <latexit sha1_base64="DpwzqE8QmuIvEwFykJmgJTAdULA=">AC1HicjVHLSsNAFD2Nr1ofjbp0EyCq5KIoMuiG5cV7APaUpLptA5Nk5BMxFK7Erf+gFv9JvEP9C+8M6agFtEJSc6ce86dufd6kS8SaduvOWNhcWl5Jb9aWFvf2CyaW9v1JExjxms9MO46bkJ90XAa1JInzejmLsjz+cNb3im4o1rHiciDC7lOKdkTsIRF8wVxLVNYtyW+klBOXKWLaNUt2dbLmgdOBkrIVjU0X9BGDyEYUozAEUAS9uEioacFBzYi4jqYEBcTEjrOMUWBvCmpOClcYof0HdCulbEB7VXORLsZneLTG5PTwj5QtLFhNVplo6nOrNif8s90TnV3cb097JcI2Ilroj9yzdT/tenapHo40TXIKimSDOqOpZlSXVX1M2tL1VJyhARp3CP4jFhp2zPlvak+jaVW9dHX/TSsWqPcu0Kd7VLWnAzs9xzoP6Ydmxy87FUalymo06j13s4YDmeYwKzlFTc/8EU94NurGrXFn3H9KjVzm2cG3ZTx8AI0/lkE=</latexit> <latexit sha1_base64="b3TGODdXTClhWpMK20HD13T4Yg=">AC2XicjVHLSsNAFD2Nr1pf8bFzEyCq5KIoMuiG5cV7APaUpJ0WkPzYjIp1tKFO3HrD7jVHxL/QP/CO2MKahGdkOTMufecmXuvE/teIkzNafNzS8sLuWXCyura+sb+uZWLYlS7rKqG/kRbzh2wnwvZFXhCZ81Ys7swPFZ3RmcyXh9yHjiReGlGMWsHdj90Ot5ri2I6ug7LcGuhRDjyEkYHyp20tGLZslUy5gFVgaKyFYl0l/QhcRXKQIwBCEPZhI6GnCQsmYuLaGBPHCXkqzjBgbQpZTHKsIkd0LdPu2bGhrSXnolSu3SKTy8npYF90kSUxwnL0wVT5WzZH/zHitPebcR/Z3MKyBW4IrYv3TzP/qZC0CPZyoGjyqKVaMrM7NXFLVFXlz40tVghxi4iTuUpwTdpVy2mdDaRJVu+ytreJvKlOycu9muSne5S1pwNbPc6C2mHJMkvWxVGxfJqNOo9d7OGA5nmMs5RQZW8b/CIJzxrTe1Wu9PuP1O1XKbZxrelPXwAncaYoQ=</latexit> <latexit sha1_base64="b3TGODdXTClhWpMK20HD13T4Yg=">AC2XicjVHLSsNAFD2Nr1pf8bFzEyCq5KIoMuiG5cV7APaUpJ0WkPzYjIp1tKFO3HrD7jVHxL/QP/CO2MKahGdkOTMufecmXuvE/teIkzNafNzS8sLuWXCyura+sb+uZWLYlS7rKqG/kRbzh2wnwvZFXhCZ81Ys7swPFZ3RmcyXh9yHjiReGlGMWsHdj90Ot5ri2I6ug7LcGuhRDjyEkYHyp20tGLZslUy5gFVgaKyFYl0l/QhcRXKQIwBCEPZhI6GnCQsmYuLaGBPHCXkqzjBgbQpZTHKsIkd0LdPu2bGhrSXnolSu3SKTy8npYF90kSUxwnL0wVT5WzZH/zHitPebcR/Z3MKyBW4IrYv3TzP/qZC0CPZyoGjyqKVaMrM7NXFLVFXlz40tVghxi4iTuUpwTdpVy2mdDaRJVu+ytreJvKlOycu9muSne5S1pwNbPc6C2mHJMkvWxVGxfJqNOo9d7OGA5nmMs5RQZW8b/CIJzxrTe1Wu9PuP1O1XKbZxrelPXwAncaYoQ=</latexit> <latexit sha1_base64="b3TGODdXTClhWpMK20HD13T4Yg=">AC2XicjVHLSsNAFD2Nr1pf8bFzEyCq5KIoMuiG5cV7APaUpJ0WkPzYjIp1tKFO3HrD7jVHxL/QP/CO2MKahGdkOTMufecmXuvE/teIkzNafNzS8sLuWXCyura+sb+uZWLYlS7rKqG/kRbzh2wnwvZFXhCZ81Ys7swPFZ3RmcyXh9yHjiReGlGMWsHdj90Ot5ri2I6ug7LcGuhRDjyEkYHyp20tGLZslUy5gFVgaKyFYl0l/QhcRXKQIwBCEPZhI6GnCQsmYuLaGBPHCXkqzjBgbQpZTHKsIkd0LdPu2bGhrSXnolSu3SKTy8npYF90kSUxwnL0wVT5WzZH/zHitPebcR/Z3MKyBW4IrYv3TzP/qZC0CPZyoGjyqKVaMrM7NXFLVFXlz40tVghxi4iTuUpwTdpVy2mdDaRJVu+ytreJvKlOycu9muSne5S1pwNbPc6C2mHJMkvWxVGxfJqNOo9d7OGA5nmMs5RQZW8b/CIJzxrTe1Wu9PuP1O1XKbZxrelPXwAncaYoQ=</latexit> <latexit sha1_base64="b3TGODdXTClhWpMK20HD13T4Yg=">AC2XicjVHLSsNAFD2Nr1pf8bFzEyCq5KIoMuiG5cV7APaUpJ0WkPzYjIp1tKFO3HrD7jVHxL/QP/CO2MKahGdkOTMufecmXuvE/teIkzNafNzS8sLuWXCyura+sb+uZWLYlS7rKqG/kRbzh2wnwvZFXhCZ81Ys7swPFZ3RmcyXh9yHjiReGlGMWsHdj90Ot5ri2I6ug7LcGuhRDjyEkYHyp20tGLZslUy5gFVgaKyFYl0l/QhcRXKQIwBCEPZhI6GnCQsmYuLaGBPHCXkqzjBgbQpZTHKsIkd0LdPu2bGhrSXnolSu3SKTy8npYF90kSUxwnL0wVT5WzZH/zHitPebcR/Z3MKyBW4IrYv3TzP/qZC0CPZyoGjyqKVaMrM7NXFLVFXlz40tVghxi4iTuUpwTdpVy2mdDaRJVu+ytreJvKlOycu9muSne5S1pwNbPc6C2mHJMkvWxVGxfJqNOo9d7OGA5nmMs5RQZW8b/CIJzxrTe1Wu9PuP1O1XKbZxrelPXwAncaYoQ=</latexit> Reinforcement Learning ◮ RL is a learning paradigm that an agent interacts with the unknown environment to find the optimal decisions. action reward observation ◮ RL directly learns from stochastic feedbacks ( s, a, r, s ′ ) . Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 4 / 30
Exploration v.s. Exploitation ◮ In an unknown environment, the agent is uncertain about the possible outcomes. ◮ Exploration : investigate the unknown actions which may bring large returns or unexpected losses. ◮ Exploitation : implement the well-known but possibly sub-optimal actions. Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 5 / 30
Towards Efficient Exploration ◮ Simple exploration strategies based on “dithering” methods are inefficient. • ǫ -greedy and Boltzmann strategy require almost O (2 N ) samples to make progress on deep-sea. [Osband et al., 2019]. Figure 2: Deep-sea. Figure from [Osband et al., 2019]. There are only two actions and the reward is released at the most bottom right corner. Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 6 / 30
Towards Efficient Exploration ◮ Theoretically, efficient exploration requires to “write-off” the known and inferior actions. ◮ There are general two principles borrowed from bandits: optimism in the face of uncertainty ( OFU ) and Thompson sampling ( TS ). • OFU : add “reward bonus” by constructing upper confidence intervals [Stadie et al., 2015, Pathak et al., 2017, Burda et al., 2019b]. • TS : sample the plausible actions from the iteratively updated posterior distribution [Osband et al., 2016a,b, O’Donoghue et al., 2018]. Ziniu Li (CUHKSZ & Polixir) Efficient Exploration by Novelty Pursuit October 12, 2020 7 / 30
Recommend
More recommend