Bayesian Counterfactual Risk Minimization Ben London (blondon@) Amazon Music Ted Sandler (sandler@) Amazon Music International Conference on Machine Learning Long Beach, CA, June 11, 2019
Learning from Logged Data Pull log data e.g., user i listened to item j Launch new Design/train if unsuccessful policy new rec policy if successful A/B test new policy vs old policy
Problem 1: Bandit Feedback • Only observe outcomes from actions taken • e.g., only get feedback on recommendations Alexa, play music Here’s a station you might like …
Problem 2: Bias • Logged data is biased • Policy typically not uniform distribution high support → better estimate • User typically doesn’t see everything • Bias affects inferences • Self-fulfilling prophecies; “rich get richer” low support • Miss key insights due to insufficient support → who knows?
<latexit sha1_base64="PSf7DjZY4iJDoPC0SXiPXT/dCmw=">AChnicjVFdaxNBFJ1dP5rGr9g+nIxCAmUkK1K+6AQUMHCqYtZNdldjLbDJ0vZmYlyzj9nz7D/wFziZ5sKLghQuHc869dzhTac6sm06/J+mdu/fu7/X2+w8ePnr8ZPD04NyqxhA6J4orc1lhSzmTdO6Y4/RSG4pFxelFdf2u0y+UmOZkp9dq2kh8JVkNSPYRaochPyDLv1ofZSblRpDXrWxq/cBNjzeEpqN1uOwgM4TrXgMRf8mx1obtb6BvDaY+Cx4GadsI0rP3mbhiwRTMsiPdnq3BHfEN1iXbBy8LlkoB8NsMt0U/BsM0a7OysHPfKlI6h0hGNrF9lUu8Jj4xjhNPTzxlKNyTW+osIJRbUFn4TU4AXkVlCrUxs6WD/j7hsbC2FV0CuxW9k+tI/+mLRpXnxaeSd04Ksn2UN1wcAq6zGHJDCWOtxFgYlh8K5AVjqm4+DO3roh2SWsbN4BoQUSzsv3/C+n8eJK9nBx/ejWcvdnF1UP0HM0Qhk6QTP0EZ2hOSLoR7KfHCSHaS+dpK/Tk601TXYzh+hWpbNfrhDCKg=</latexit> <latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit> <latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit> <latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit> <latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit> <latexit sha1_base64="vq06mY9CIP/+RPZL4LrwG8KBOrc=">ACf3icjVFdSxwxFM1Mv7arat9CV0ERXsMBntugsVhL74aKGrws50yGQzazDJDElmcUjzQwv9Ff4Cs+uCVrohcDJuefmHk6KmjNt4vhXEL54+er1m87b7tr6u/cbvc2tC101itAxqXilrgqsKWeSjg0znF7VimJRcHpZ3Hxd9C/nVGlWye+mrWkm8EykhFsPJX35ilWM8FkbtOaOZgewLRUmFjkrPRX3YjcshPkfkj4SeXsUeDle3hB/IS3Odt31qZLNxM1KzIbR0kSD4ajgzj6PDgcoMSD0fAoQcjVOXMu7/VRFC8L/hv0warO895dOq1I6g0hGOtJyiuTWaxMoxw6rpo2mNyQ2e0YmHEguqM7t05OCOZ6awrJQ/0sAl+eExULrVhReKbC51s97C/JvUljymFmawbQyV5WFQ2HJoKLsKGU6YoMbz1ABPFvFdIrGPz/gvebJFtFNav8CFC0UXlzp7v+FdJFE6DBKvh31T7+s4uqAbfAR7AEjsEpOAPnYAwI+B2EwVqwHgbhbhiF8YM0DFYzH8CTCkf3fJ69cA=</latexit> IPS Policy Optimization • Use inverse propensity score (IPS) estimator n 1 π ( a i | x i ) X logged propensity p i = π 0 ( a i | x i ) arg min − r i n p i π i =1 • IPS is an unbiased estimator of expected reward n a ∼ π ( x ) [ ρ ( x, a )] ≈ 1 π ( a i | x i ) X E E r i ( x, ρ ) ∼ D n p i i =1 • Caveat: logging policy must have full support
<latexit sha1_base64="vq06mY9CIP/+RPZL4LrwG8KBOrc=">ACf3icjVFdSxwxFM1Mv7arat9CV0ERXsMBntugsVhL74aKGrws50yGQzazDJDElmcUjzQwv9Ff4Cs+uCVrohcDJuefmHk6KmjNt4vhXEL54+er1m87b7tr6u/cbvc2tC101itAxqXilrgqsKWeSjg0znF7VimJRcHpZ3Hxd9C/nVGlWye+mrWkm8EykhFsPJX35ilWM8FkbtOaOZgewLRUmFjkrPRX3YjcshPkfkj4SeXsUeDle3hB/IS3Odt31qZLNxM1KzIbR0kSD4ajgzj6PDgcoMSD0fAoQcjVOXMu7/VRFC8L/hv0warO895dOq1I6g0hGOtJyiuTWaxMoxw6rpo2mNyQ2e0YmHEguqM7t05OCOZ6awrJQ/0sAl+eExULrVhReKbC51s97C/JvUljymFmawbQyV5WFQ2HJoKLsKGU6YoMbz1ABPFvFdIrGPz/gvebJFtFNav8CFC0UXlzp7v+FdJFE6DBKvh31T7+s4uqAbfAR7AEjsEpOAPnYAwI+B2EwVqwHgbhbhiF8YM0DFYzH8CTCkf3fJ69cA=</latexit> <latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit> <latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit> <latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit> <latexit sha1_base64="8NS4mO5kWFTfLpxFhH3ZJ5ymWCA=">ACSXicjVDLSsNAFJ3UV62vqks3g0VQkJEre1CKbhxWcGqkIQwmU7q0JkzEzEPs7/oxuFfwMXYkrp2kXKgoeuHA4574QcKoVKb5YpSmpmdm58rzlYXFpeWV6urahYxTgUkXxywWVwGShNGIdBVjFwlgiAeMHIZDE5G/uUNEZLG0bnKEuJx1I9oSDFSWvKr7dwtljiH3i5Wbdts9Fs7Zr1g8Zew7I1aTX3bcsaJj4dwiPoJtQ3t5FPoXsHb32641drVt0sAP8mNTBx6+ub0Yp5xECjMkpWOZifJyJBTFjAwrbipJgvA9YmjaYQ4kV5ePDmEW1rpwTAWuiIFC/XrRI64lBkPdCdH6lr+9Ebib56TqrDp5TRKUkUiPD4UpgyqGI5igz0qCFYs0wRhQfWvEF8jgbDS4X67wrMeCaXeAHkGuW6OZeV/IV3YdUvzs/1a+3gSVxlsgE2wDSxwCNrgFHRAF2BwDx7BE3g2HoxX4934GLeWjMnMOviG0tQna3itoA=</latexit> IPS Policy Optimization • Use inverse propensity score (IPS) estimator n 1 π ( a i | x i ) X logged propensity p i = π 0 ( a i | x i ) arg min − r i n p i π i =1 • Problem: IPS has high variance ! " ($|&) ! " ($|&) !($|&) !((|&)
<latexit sha1_base64="IUqBlo6BGJCKN8j7pt+Y8e1iNg=">ACNHicdVBNb9NAEF2Xj5bw0QBHLisipKIiy3ZLSCUOlbhwbAVJK8VRNF6Pm1V3bXd3jLAs9z/wZ9or/AwkbohrD/0F3aRBoghG2tXTe/Nmdl9SKmkpCL57K7du37m7unavc/Bw0fr3cdPRraojMChKFRhDhOwqGSOQ5Kk8LA0CDpReJAcv5vrB5/QWFnkH6kucaLhKJeZFECOmnY3N/npKY+Vc6TAY3tiqIln4C7Cz9SMwLTtRlzKVx9etNuL/CjKOgPdnjgv+5v9cPIgZ3BdhSGPSDRfXYsvam3cs4LUSlMSehwNpxGJQ0acCQFArbTlxZLEcwxGOHcxBo50i0+1/IVjUp4Vxp2c+IL909GAtrbWievUQDP7tzYn/6WNK8oGk0bmZUWYi+tFWaU4FXyeE+lQUGqdgCEke6tXMzAgCX40tuk4xs24C1zXrmwHRfS7yT4/8Eo8sMtP9rf7u2+Xca1xp6x52yDhewN2Xv2R4bMsG+sHP2lX3zrwf3k/v13Xrirf0PGU3yru4ApDq4s=</latexit> <latexit sha1_base64="cWvFQAa06y/zjgRhGkPYJf5mUuE=">ACVnicjVDLSgMxFM2Mr1pfoy7dBItQUtHBV0oFNy4ESrYKnTqkEkzNTJDElGHOJ8lT9jt/oHfoCY1oIPFLwQODn3HuTE6WMKl2vDx13anpmdq40X15YXFpe8VbX2irJCYtnLBEXkdIEUYFaWmqGblOJUE8YuQqGpyO9Ks7IhVNxKXOU9LlqC9oTDHSlgq98wDJPqciNEFKCxjswCWCBu/MJeVcZDQ0/84kbAXRnST4O1V9GIeID3Id0uTBrSIvQqfq0+Lvg3qIBJNUPvNeglONEaMyQUh2/nuquQVJTzEhRDjJFUoQHqE86FgrEieqa8bcLuGWZHowTaY/QcMx+7TCIK5XzyDo50rfqpzYif9M6mY6PuoaKNE4I9FcagTuAoQ9ijkmDNcgsQltS+FeJbZFPRNulvW3jeI7GyEyDPIbfmRJX/F1J7r+bv1/YuDiqN40lcJbABNkEV+OAQNMAZaIWwOARDMEzeHGenDd3xp37sLrOpGcdfCvXewf7/bVW</latexit> CRM Principle • Counterfactual Risk Minimization (CRM) principle n 1 π ( a i | x i ) q X ˆ arg min + λ Var( π , S ) − r i n p i π i =1 variance regularization • Motivated by PAC risk analysis • Stochastic optimization of variance regularizer is tricky • Policy optimization for exponential models (POEM) algorithm [Swaminathan & Joachims, ICML 2015]
Recommend
More recommend