Introduction Task Objectives Experiments Conclusion Improving a Neural Semantic Parser by Counterfactual Learning from Human Bandit Feedback Carolin Lawrence , Stefan Riezler Heidelberg University Institute for Computational Linguistics July 17, 2018 1/18
Introduction Task Objectives Experiments Conclusion Situation Overview ◮ Situation: deployed system (e.g. QA, MT ...) ◮ Goal: improve system using human feedback ◮ Plan: create a log D log of user-system interactions & improve system offline (safety) Here: Improve a Neural Semantic Parser 2/18
Introduction Task Objectives Experiments Conclusion Contrast to Previous Approaches parses Answers Database y 1 , ..., y s a 1 , ..., a s predict train Rewards Parser Comparison r 1 , ..., r s required data gold question x answer for 1...n 3/18
Introduction Task Objectives Experiments Conclusion Our Approach parse y Database answer a predict log train Parser (x, y, r) Parser User Feedback r required data question x for 1...n 4/18
Introduction Task Objectives Experiments Conclusion Our Approach parses Answers Database parse y Database answer a y 1 , ..., y s a 1 , ..., a s predict predict train log train Rewards Parser Parser (x, y, r) Parser Comparison r 1 , ..., r s User Feedback r required data required data gold question x question x answer for 1...n for 1...n ◮ No supervision: given an input, the gold output is unknown ◮ Bandit: feedback is given for only one system output ◮ Bias: log D is biased to the decisions of the deployed system Solution: Counterfactual / Off-policy Reinforcement Learning 5/18
Introduction Task Objectives Experiments Conclusion Task 6/18
Introduction Task Objectives Experiments Conclusion A natural language interface to OpenStreetMap ◮ OpenStreetMap (OSM): geographical database ◮ NLmaps v2 : extension of the previous corpus, now totalling 28,609 question-parse pairs 7/18
Introduction Task Objectives Experiments Conclusion A natural language interface to OpenStreetMap ◮ example question: “ How many hotels are there in Paris? ” Answer: 951 ◮ correctness of answers are difficult to judge → judge parses by making them human-understandable ◮ feedback collection setup: 1. automatically convert a parse to a set of statements 2. humans judge the statements 8/18
Introduction Task Objectives Experiments Conclusion Example: Feedback Formula q u e r y ( a r o u n d ( c e n t e r ( a r e a ( k e y v a l ( ' n a m e ' , ' P a r i s ' ) ) , n w r ( k e y v a l ( ' n a m e ' , ' P l a c e d e l a R é p u b l i q u e ' ) ) ) , s e a r c h ( n w r ( k e y v a l ( ' a m e n i t y ' , ' p a r k i n g ' ) ) ) , m a x d i s t ( WA L K I N G _ D I S T ) ) , q t y p e ( fj n d k e y ( ' n a m e ' ) ) ) 9/18
Introduction Task Objectives Experiments Conclusion Objectives 10/18
Introduction Task Objectives Experiments Conclusion Counterfactual Learning Resources collected log D log = { ( x t , y t , δ t ) } n t =1 with ◮ x t : input ◮ y t : most likely output of deployed system π 0 ◮ δ t ∈ [ − 1 , 0]: loss (i.e. negative reward) received from user Deterministic Propensity Matching (DPM) ◮ minimize the expected risk for a target policy π w n R DPM ( π w ) = 1 ˆ � δ t π w ( y t | x t ) n t =1 ◮ improve π w using (stochastic) gradient descent ◮ high variance → use multiplicative control variate 11/18
Introduction Task Objectives Experiments Conclusion Multiplicative Control Variate ◮ for random variables X and Y , with ¯ Y the expectation of Y : E [ X ] = E [ X Y ] · ¯ Y → RHS has lower variance if Y positively correlates with X DPM with Reweighting (DPM+R) 1 � n t =1 δ t π w ( y t | x t ) ˆ n R DPM+R ( π w ) = t =1 π w ( y t | x t ) · 1 Reweight Sum R � n 1 n ◮ reduces variance but introduces a bias of order O ( 1 n ) that decreases as n increases → n should be as large as possible ◮ Problem: in stochastic minibatch learning, n is too small 12/18
Introduction Task Objectives Experiments Conclusion One-Step Late (OSL) Reweighting Perform gradient descent updates & reweighting asynchronously ◮ evaluate reweight sum R on the entire log of size n using parameters w ′ ◮ update using minibatches of size m , m ≪ n ◮ periodically update R → retains all desirable properties DPM+OSL 1 � m t =1 δ t π w ( y t | x t ) ˆ m R DPM+OSL ( π w ) = 1 � n t =1 π w ′ ( y t | x t ) n 13/18
Introduction Task Objectives Experiments Conclusion Token-Level Feedback DPM+T | y | n R DPM+T ( π w ) = 1 ˆ � � δ j π w ( y j | x t ) n t =1 j =1 DPM+T+OSL �� | y | � 1 � m j =1 δ j π w ( y j | x t ) t =1 m ˆ R DPM+T+OSL ( π w ) = 1 � n t =1 π w ′ ( y t | x t ) n 14/18
Introduction Task Objectives Experiments Conclusion Experiments 15/18
Introduction Task Objectives Experiments Conclusion Experimental Setup ◮ sequence-to-sequence neural network Nematus ◮ deployed system: pre-trained on 2k question-parse pairs ◮ feedback collection: 1. humans judged 1k system outputs ◮ average time to judge a parse: 16.4s ◮ most parses ( > 70%) judged in < 10s 2. simulated feedback for 23k system outputs ◮ token-wise comparison to gold parse ◮ bandit-to-supervised conversion (B2S): all instances in log with reward 1 are used as supervised training 16/18
Introduction Task Objectives Experiments Conclusion Experimental Results B2S DPM+T+OSL 65.45 +6.96 64.45 +5.77 63.45 F1 Score 62.45 61.45 60.45 59.45 +0.99 58.45 +0.34 57.45 Human Feedback (1k) Large-Scale Simulated Feedback (23k) 17/18
Introduction Task Objectives Experiments Conclusion Take Away Counterfactual Learning ◮ safely improve a system by collecting interaction logs ◮ applicable to any task if the underlying model is differentiable ◮ DPM+OSL: new objective for stochastic minibatch learning Improving a Semantic Parser ◮ collect feedback by making parses human-understandable ◮ judging a parse is often easier & faster than formulating a parse or answer NLmaps v2 ◮ large question-parse corpus for QA in the geographical domain Future Work ◮ integrate feedback form in the online NL interface to OSM 18/18
Recommend
More recommend