A Coactive Learning View of Online Structured Prediction in SMT - PowerPoint PPT Presentation

A Coactive Learning View of Online Structured Prediction in SMT Artem Sokolov ∗ , Stefan Riezler ∗ , Shay B. Cohen ‡ ∗ Heidelberg University, ‡ University of Edinburgh

Motivation Online learning protocol 1 observe input structure x t 2 predict output structure y t 3 receive feedback (gold-standard or post-edit) 4 update parameters A tool of choice in SMT ■ memory & runtime efficiency ■ interactive scenarios with user feedback

Online learning (for SMT) Usual assumptions ■ convexity (for regret bounds) ■ reachable feedbacks (for gradients) Reality ■ SMT has latent variables (non-convex) ■ most references live outside the search space (nonreachable) ■ references/full-edits are expensive ( = professional translation) Intuition ■ light post-edits are cheaper ■ have better chance to be reachable Question Should editors put much effort into correcting SMT outputs anyway?

Contribution & Goal Goals ■ demonstrate feasibility of learning from weak feedback for SMT ■ propose a new perspective on learning from surrogate translations ■ note: the goal is not to improve over any full-information model Contributions ➡ Theory ➡ extension of the coactive learning model to latent structure ➡ improvements by a derivation-dependent update scaling ➡ straight-forward generalization bounds ➡ Practice ➡ learning from weak post-edits does translate to improved MT quality ➡ surrogate references work better if they admit an underlying linear model

Coactive Learning [Shivaswami & Joachims, ICML’12] ■ rational user: feedback ¯ y t improves some utility over prediction y t U ( x t , ¯ y t ) ≥ U ( x t , y t ) ■ regret: how much the learner is ‘sorry’ for not using optimal y ∗ t T REG T = 1 � U ( x t , y ∗ t ) − U ( x t , y t ) → min T t =1 ■ feedback is α -informative if y t ) − U ( x t , y t ) ≥ α ( U ( x t , y ∗ U ( x t , ¯ t ) − U ( x t , y t )) ■ no latent variables

Algorithm Feedback-based Structured Perceptron 1: Initialize w ← 0 2: for t = 1 , . . . , T do Observe x t 3: y t ← arg max y w ⊤ t φ ( x t , y ) 4: Obtain weak feedback ¯ y t 5: if y t � = ¯ y t then 6: � � w t +1 ← w t + φ ( x t , ¯ y t ) − φ ( x t , y t ) 7:

Algorithm Feedback-based Latent Structured Perceptron 1: Initialize w ← 0 2: for t = 1 , . . . , T do Observe x t 3: ( y t , h t ) ← arg max ( y,h ) w ⊤ t φ ( x t , y, h t ) 4: Obtain weak feedback ¯ y t 5: if y t � = ¯ y t then 6: ¯ h t ← arg max h w ⊤ t φ ( x t , ¯ y t , h ) 7: y t , ¯ � � w t +1 ← w t + ∆ ¯ φ ( x t , ¯ h t ) − φ ( x t , y t , h t ) 8: h t ,h t

Analysis Under the same assumptions as in [Shivaswami & Joachims’12]: ■ linear utility: U ( x t , y t ) = w ∗⊤ φ ( x t , y t ) ■ w ∗ is the optimal parameter, known only to the user ■ || φ ( x t , y t , h t ) || ≤ R ■ some violations of α -informativeness are allowed y t ) − U ( x t , y t ) ≥ α ( U ( x t , y ∗ U ( x t , ¯ t ) − U ( x t , y t )) − ξ t Convergence Let D T = � T t ∆ 2 h t ,h t . Then ¯ √ D T T 1 ξ t + 2 R || w ∗ || � REG T ≤ αT α T t =1 ■ standard perceptron proof [Novikoff’62] √ ■ better than O (1 / T ) if D T doesn’t grow too fast ■ [Shivaswami & Joachims’12] is a special case of ∆ ¯ h t ,h t = 1

Analysis Generalization Let 0 < δ < 1 , and let x 1 , . . . , x T be a sequence of observed inputs. Then with probability at least 1 − δ , � T ln 1 2 E x 1 ,..,x T [REG T ] ≤ REG T + 2 || w ∗ || R δ . ■ how far the expected regret is from the empirical regret we observe ■ proof uses the results of [Cesa-Bianchi’04] ■ see the paper for more

Experimental Setup ■ LIG corpus [Potet et al.’10] ➡ news domain, FR → EN ➡ ( FR input , MT output, EN post-edit , EN reference), 11k in total ➡ split train 7k online input data dev 2k to get w ∗ for simulation/checking convergence test 2k testing ■ Moses, 1000-best lists ■ cyclic order

Simulated Experiments User simulation: ■ scan the n -best list for derivations that are α -informative ■ return the first ¯ y t � = y t that satisfies y t ) − U ( x t , y t ) ≥ α ( U ( x t , y ∗ U ( x t , ¯ t ) − U ( x t , y t )) − ξ t (with minimal ξ t , if no ξ t = 0 found for a given α )

Regret and TER for α -informative feedback 0.90 0.32 =0.1 =0.1 0.80 =0.5 =0.5 =1.0 =1.0 0.70 0.60 0.31 0.50 regret TER 0.40 0.30 0.30 0.20 0.10 0.00 0.29 0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000 iterations iterations ■ convergence in regret when learning from weak feedback of differing strength ■ simultaneous improvement TER (on test) ■ stronger feedback leads to faster improvements of regret/TER ■ setting ∆ ¯ h t ,h t to Euclidean distance between feature vectors leads to even faster regret/TER improvements

Feedback from Surrogate Translations ■ so far the feedback was simulated ■ what about real post-edits? ■ main question: how do the practices for extracting surrogates from user post-edits for discriminative SMT match with the coactive learning?

Standard heuristics for surrogates 1 oracle – closest to the post-edit in the full search graph TER ( y ′ , y ) ¯ y = arg min y ′ ∈Y ( x t ; w t ) 2 local – closest to the post-edit from the n -best list [Liang et al.’06] TER ( y ′ , y ) y = ¯ arg min y ′ ∈ n -best ( x t ; w t ) 3 filtered – first hyp in the n -best list w/ better TER than the 1-best TER (¯ y, y ) < TER ( y t , y ) 4 hope – hyp that maximizes model score and negative TER [Chiang’12] ( − TER ( y ′ , y ) + w ⊤ t φ ( x t , y ′ , h )) ¯ y = arg max y ′ ∈ n -best ( x t ; w t ) Degrees of model-awareness ■ oracle – model-agnostic ■ local – constrained to the n -best list, but ignores the ordering ■ filtered & hope – letting the model score/ordering influence the surrogate

Results 1.40 0.35 =0.1 =1.0 1.20 oracles 0.34 local ltered 1.00 hope 0.33 0.80 regret TER 0.32 0.60 =0.1 =1.0 0.31 oracles 0.40 local ltered 0.30 0.20 hope 0.29 0.00 0 4000 8000 12000 16000 20000 0 4000 8000 12000 16000 20000 iterations iterations ■ regret diverges when learning with model-unaware surrogates ■ convergence in regret when learning with model-aware surrogates % strictly α -informative local 39.46% filtered 47.73% hope 83.30 %

Conclusions ■ regret & generalization bounds ➡ latent variables ➡ changing feedback ■ concept of weak feedback in online learning in SMT ➡ still can learn without observing references ➡ surrogate references should admit an underlying linear model

Conclusions ■ regret & generalization bounds ➡ latent variables ➡ changing feedback ■ concept of weak feedback in online learning in SMT ➡ still can learn without observing references ➡ surrogate references should admit an underlying linear model Thank you!

A Coactive Learning View of Online Structured Prediction in SMT - PowerPoint PPT Presentation

A Coactive Learning View of Online Structured Prediction in SMT Artem Sokolov , Stefan Riezler , Shay B. Cohen Heidelberg University, University of Edinburgh Motivation Online learning protocol 1 observe input structure x t 2

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Machine Learning Fall 2017 Structured Prediction (structured perceptron, HMM, structured SVM)

A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE A STRUCTURED L IFE

L101: Introduction to Structured Prediction Ryan Cotterell What is structured prediction?

Training Strategies CS 6355: Structured Prediction 1 So far we saw What is structured output

Structured Prediction Final words CS 6355: Structured Prediction 1 A look back What is a

CSCE 496/896 Lecture 11: Structured Prediction and Structured Prediction and Probabilistic

Course Information CS 6355: Structured Prediction Building up structured output prediction

L101: Incremental structured prediction Structured prediction reminder Given an input x (e.g. a

Complex Prediction Problems A novel approach to multiple Structured Output Prediction Yasemin

Cumbernauld Academy Existing aerial view from west Site Plan Aerial view from South Aerial view

Scaling Log-Structured KV-Stores featuring Monkey and Dostoevsky SIGMOD17 / SIGMOD18 Niv Dayan

Online Learning Lorenzo Rosasco MIT, 9.520 L. Rosasco Online Learning About this class Goal

CSCE 970 Lecture 8: Prediction Stephen Scott Structured Prediction and Vinod Variyam

Review: Supervised Learning CS 6355: Structured Prediction 1 Previous lecture A broad

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

Training-Time Optimization of a Budgeted Booster Yi Huang *Brian Powers Lev Reyzin University

Digital Signatures Dennis Hofheinz (slides based on slides by Bjrn Kaidel) Digital Signatures

Machine Learning 2 DS 4420 - Spring 2020 Sequence-2-sequence models Byron C. Wallace Today

Active Learning with Disagreement Graphs Corinna Cortes 1 , Giulia DeSalvo 1 , Claudio Gentile 1 ,

Retirement in a Life Cycle Model With Home Production Richard Rogerson Johanna Wallenius

Microarchitectural Attacks: Protecting Cloud Accelerators By Ahmad Daniel Moghimi PhD

On the Generalization Ability of Online Learning Algorithms for Pairwise Loss Functions

Meta-transfer Learning for Few-shot Learning Yaoyao Liu Tianjin University and NUS School of