Performance Prediction and Shrinking Language Models . Chen - PowerPoint PPT Presentation

Performance Prediction and Shrinking Language Models . Chen † Stanley F IBM T.J. Watson Research Center Yorktown Heights, New York, USA 27 June 2011 † Joint work with Stephen Chu, Ahmad Emami, Lidia Mangu, ■❇▼ Bhuvana Ramabhadran, Ruhi Sarikaya, and Abhinav Sethy. Stanley F. Chen (IBM) Performance Prediction 27 June 2011 1 / 41

What Does a Good Model Look Like? (test error) ≡ (training error) + (overfit) ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 2 / 41

Overfitting: Theory e.g. , Akaike Information Criterion (1973) − (test LL) ≈ − (train LL) + (# params) e.g. , structural risk minimization (Vapnik, 1974) (test err) ≤ (train err) + f ( VC dimension ) Down with big models!? ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 3 / 41

The Big Idea Maybe overfit doesn’t act like we think it does. Let’s try to fit overfit empirically. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 4 / 41

What This Talk Is About An empirical estimate of the overfit in log likelihood of . . . Exponential language models . . . That is really simple and works really well. Why it works. What you can do with it. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 5 / 41

Outline Introduction 1 Finding an Empirical Law for Overfitting 2 Regularization 3 Why Does the Law Hold? 4 Things You Can Do With It 5 Discussion 6 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 6 / 41

Exponential N -Gram Language Models Language model: predict next word given previous, say, two words. P ( y = ate | x = the cat ) Log-linear model: features f i ( · ) ; parameters λ i . P ( y | x ) = exp ( � i λ i f i ( x , y )) Z Λ ( x ) A binary feature f i ( · ) for each n -gram in training set. An alternative parameterization of back-off n -gram models. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 7 / 41

Details: Regression Build hundreds of (regularized!) language models. Compute actual overfit: log likelihood (LL) per event = log PP . Calculate lots of statistics for each model. F = # parameters; D = # training events. D ; F log D F ; 1 λ i ; 1 i ; 1 4 � � λ 2 � | λ i | 3 ; . . . D D D D Do linear regression! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 8 / 41

What Doesn’t Work? AIC-like Prediction (overfit) ≡ LL test − LL train ≈ γ (# params) (# train evs) 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 9 / 41

What Doesn’t Work? BIC-like Prediction LL test − LL train ≈ γ (# params) log (# train evs) (# train evs) 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 10 / 41

What Does Work? (r = 0.9996) F γ � LL test − LL train ≈ | λ i | (# train evs) i = 1 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 11 / 41

γ = 0 . 938 Holds for many different types of data. Different domains ( e.g. , Wall Street Journal, . . . ) Different token types (letters, parts-of-speech, words). Different vocabulary sizes (27–84,000 words). Different training set sizes (100–100,000 sentences). Different n -gram orders (2–7). Holds for many different types of exponential models. Word n -gram models; class-based n -gram models; minimum discrimination information models. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 12 / 41

What About Other Languages? F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 9 Iraqi 8 Spanish 7 German Turkish 6 predicted 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 13 / 41

What About Genetic Data? F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 2 rice chicken human 1.5 predicted 1 0.5 0 0 0.5 1 1.5 2 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 14 / 41

Regularization Improves test set performance. ℓ 1 , ℓ 2 2 , ℓ 1 + ℓ 2 2 regularization: choose λ i to minimize F F 1 � � λ 2 (obj fn) ≡ LL train + α | λ i | + i 2 σ 2 i = 1 i = 1 The problem: γ depends on α , σ ! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 16 / 41

Regularization: Two Criteria Here: pick single α , σ across all models. Usual way: pick α , σ per model for good performance. Good performance and good overfit prediction? performance overfit prediction √ ℓ 1 √ ℓ 2 √ √ 2 ℓ 1 + ℓ 2 2 ( α = 0 . 5 , σ 2 = 6 ) as good as best n -gram smoothing. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 17 / 41

The Law and ℓ 1 + ℓ 2 2 Regularization F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 18 / 41

The Law and ℓ 2 2 Regularization F 0 . 882 � LL test − LL train ≈ | λ i | (# train evs) i = 1 6 5 predicted 4 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 19 / 41

Why Exponential Models Are Special Do some math (and include normalization features): F ′ 1 � LL test − LL train = λ i × (discount of f i ( · ) ) (# train evs) i = 1 Compare this to The Law: F 1 � LL test − LL train ≈ | λ i | × 0 . 938 (# train evs) i = 1 If only . . . (discount of f i ( · ) ) ≈ 0 . 938 × sgn λ i ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 21 / 41

What Are Discounts? How many times fewer an n -gram occurs in test set . . . Compared to training set (of equal length). Studied extensively in language model smoothing. Let’s look at the data. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 22 / 41

Smoothed Discount Per Feature ? (discount of f i ( · ) ) ≈ 0 . 938 × sgn λ i 2.5 very sparse sparse 2 less sparse 1.5 dense discount 1 0.5 0 -0.5 -1 -1 0 1 2 3 4 λ ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 23 / 41

Why The Law Holds More Than It Should Sparse models all act alike. Dense models don’t overfit much. F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 24 / 41

Explain Things Why backoff features help. Why word class features help. Why domain adaptation helps. Why increasing n doesn’t hurt. Why relative performance differences shrink with more data. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 26 / 41

Make Models Better (test error) ≈ (training error) + (overfit) Decrease overfit ⇒ decrease test error. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 27 / 41

Reducing Overfitting F 0 . 938 � (overfit) ≈ | λ i | (# train evs) i = 1 In practice, the number of features matters not! More features lead to less overfitting . . . If sum of parameters decreases! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 28 / 41

A Method for Reducing Overfitting Before: λ 1 = λ 2 = 2. P before ( y | x ) = exp ( 2 · f 1 ( x , y ) + 2 · f 2 ( x , y )) Z Λ ( x ) After: λ 1 = λ 2 = 0, λ 3 = 2, f 3 ( x , y ) = f 1 ( x , y ) + f 2 ( x , y ) . P after ( y | x ) = exp ( 2 · f 3 ( x , y )) Z Λ ( x ) = exp ( 2 · f 1 ( x , y ) + 2 · f 2 ( x , y )) Z Λ ( x ) ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 29 / 41

What’s the Catch? (Part I) Same test set performance? Re-regularize model: improves performance more! F F 1 � � λ 2 (obj fn) ≡ LL train + α | λ i | + i 2 σ 2 i = 1 i = 1 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 30 / 41

What’s the Catch? (Part II) Select features to sum in hindsight? When sum features, sums discounts! F ′ 1 � LL test − LL train = λ i × (discount of f i ( · ) ) (# train evs) i = 1 Need to pick features to sum a priori! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 31 / 41

Heuristic 1: Improving Model Performance Identify features a priori with similar λ i . Create new feature that is sum of original features. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 32 / 41

Performance Prediction and Shrinking Language Models . Chen - PowerPoint PPT Presentation

Performance Prediction and Shrinking Language Models . Chen Stanley F IBM T.J. Watson Research Center Yorktown Heights, New York, USA 27 June 2011 Joint work with Stephen Chu, Ahmad Emami, Lidia Mangu, Bhuvana Ramabhadran,

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Unfolding and Shrinking Neural Machine Translation Ensembles Felix Stahlberg and Bill Byrne

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

2 To produce strong economic growth in a country with a shrinking population is close to

Guess-then-algebraic attack on the Self-Shrinking Generator Blandine Debraize, Louis Goubin

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

2019 3Q NOTICE This presentation was prepared to inform the investors about uhadarolu Metal

Evidence into Cancer Drug Funding Decisions in Canada: A Qualitative Study of Stakeholders

3 1 J a n u a r y , 1 6 : 0 0 - 1 8 : 0 0 R o o m C , I T U H e a d q u a r t e r s WSIS

Measuring Carbon Price Volatility and Its Impact on the Effectiveness of the ETS Scheme

Capital Improvement Program Fiscal Years 2019-2024 Presented by: Kimberly Clark, Executive Vice

BOUNDARY COMMITTEE REPORT COMMUNITY INFORMATION MEETING September 15, 16, 2014 Educational

Data Visualization for M&E BRIDGE M&E Colloquium Jerusha Govender 8 August 2017 About

SRM BUSINESS CONSULTING LIMITED YOUR SOLUTION PARTNER IN TURKEY www.srmconsulting.com

Performance Prediction and Shrinking Language Models . Chen - PowerPoint PPT Presentation

Performance Prediction and Shrinking Language Models . Chen Stanley F IBM T.J. Watson Research Center Yorktown Heights, New York, USA 27 June 2011 Joint work with Stephen Chu, Ahmad Emami, Lidia Mangu, Bhuvana Ramabhadran,

Models of Language Evolution models thereof its evolution language Models of Language Evolution

Unfolding and Shrinking Neural Machine Translation Ensembles Felix Stahlberg and Bill Byrne

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

2 To produce strong economic growth in a country with a shrinking population is close to

Guess-then-algebraic attack on the Self-Shrinking Generator Blandine Debraize, Louis Goubin

Recurrent Neural Models: Language Models, and Sequence Prediction and Generation CMSC 473/673

4 Language Models 2: Log-linear Language Models This chapter will discuss another set of language

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Chapter 7 Language models Statistical Machine Translation Language models Language models

Language Models Language Models Dan Klein, John DeNero UC Berkeley Language Models Acoustic

Language Models Dan Klein, John DeNero UC Berkeley Language Models Language Models Acoustic

2019 3Q NOTICE This presentation was prepared to inform the investors about uhadarolu Metal

Evidence into Cancer Drug Funding Decisions in Canada: A Qualitative Study of Stakeholders

3 1 J a n u a r y , 1 6 : 0 0 - 1 8 : 0 0 R o o m C , I T U H e a d q u a r t e r s WSIS

Measuring Carbon Price Volatility and Its Impact on the Effectiveness of the ETS Scheme

Capital Improvement Program Fiscal Years 2019-2024 Presented by: Kimberly Clark, Executive Vice

BOUNDARY COMMITTEE REPORT COMMUNITY INFORMATION MEETING September 15, 16, 2014 Educational

Data Visualization for M&amp;E BRIDGE M&amp;E Colloquium Jerusha Govender 8 August 2017 About

SRM BUSINESS CONSULTING LIMITED YOUR SOLUTION PARTNER IN TURKEY www.srmconsulting.com

Data Visualization for M&E BRIDGE M&E Colloquium Jerusha Govender 8 August 2017 About