Performance Prediction and Shrinking Language Models . Chen † Stanley F IBM T.J. Watson Research Center Yorktown Heights, New York, USA 27 June 2011 † Joint work with Stephen Chu, Ahmad Emami, Lidia Mangu, ■❇▼ Bhuvana Ramabhadran, Ruhi Sarikaya, and Abhinav Sethy. Stanley F. Chen (IBM) Performance Prediction 27 June 2011 1 / 41
What Does a Good Model Look Like? (test error) ≡ (training error) + (overfit) ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 2 / 41
Overfitting: Theory e.g. , Akaike Information Criterion (1973) − (test LL) ≈ − (train LL) + (# params) e.g. , structural risk minimization (Vapnik, 1974) (test err) ≤ (train err) + f ( VC dimension ) Down with big models!? ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 3 / 41
The Big Idea Maybe overfit doesn’t act like we think it does. Let’s try to fit overfit empirically. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 4 / 41
What This Talk Is About An empirical estimate of the overfit in log likelihood of . . . Exponential language models . . . That is really simple and works really well. Why it works. What you can do with it. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 5 / 41
Outline Introduction 1 Finding an Empirical Law for Overfitting 2 Regularization 3 Why Does the Law Hold? 4 Things You Can Do With It 5 Discussion 6 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 6 / 41
Exponential N -Gram Language Models Language model: predict next word given previous, say, two words. P ( y = ate | x = the cat ) Log-linear model: features f i ( · ) ; parameters λ i . P ( y | x ) = exp ( � i λ i f i ( x , y )) Z Λ ( x ) A binary feature f i ( · ) for each n -gram in training set. An alternative parameterization of back-off n -gram models. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 7 / 41
Details: Regression Build hundreds of (regularized!) language models. Compute actual overfit: log likelihood (LL) per event = log PP . Calculate lots of statistics for each model. F = # parameters; D = # training events. D ; F log D F ; 1 λ i ; 1 i ; 1 4 � � λ 2 � | λ i | 3 ; . . . D D D D Do linear regression! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 8 / 41
What Doesn’t Work? AIC-like Prediction (overfit) ≡ LL test − LL train ≈ γ (# params) (# train evs) 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 9 / 41
What Doesn’t Work? BIC-like Prediction LL test − LL train ≈ γ (# params) log (# train evs) (# train evs) 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 10 / 41
What Does Work? (r = 0.9996) F γ � LL test − LL train ≈ | λ i | (# train evs) i = 1 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 11 / 41
γ = 0 . 938 Holds for many different types of data. Different domains ( e.g. , Wall Street Journal, . . . ) Different token types (letters, parts-of-speech, words). Different vocabulary sizes (27–84,000 words). Different training set sizes (100–100,000 sentences). Different n -gram orders (2–7). Holds for many different types of exponential models. Word n -gram models; class-based n -gram models; minimum discrimination information models. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 12 / 41
What About Other Languages? F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 9 Iraqi 8 Spanish 7 German Turkish 6 predicted 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 13 / 41
What About Genetic Data? F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 2 rice chicken human 1.5 predicted 1 0.5 0 0 0.5 1 1.5 2 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 14 / 41
Outline Introduction 1 Finding an Empirical Law for Overfitting 2 Regularization 3 Why Does the Law Hold? 4 Things You Can Do With It 5 Discussion 6 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 15 / 41
Regularization Improves test set performance. ℓ 1 , ℓ 2 2 , ℓ 1 + ℓ 2 2 regularization: choose λ i to minimize F F 1 � � λ 2 (obj fn) ≡ LL train + α | λ i | + i 2 σ 2 i = 1 i = 1 The problem: γ depends on α , σ ! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 16 / 41
Regularization: Two Criteria Here: pick single α , σ across all models. Usual way: pick α , σ per model for good performance. Good performance and good overfit prediction? performance overfit prediction √ ℓ 1 √ ℓ 2 √ √ 2 ℓ 1 + ℓ 2 2 ( α = 0 . 5 , σ 2 = 6 ) as good as best n -gram smoothing. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 17 / 41
The Law and ℓ 1 + ℓ 2 2 Regularization F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 18 / 41
The Law and ℓ 2 2 Regularization F 0 . 882 � LL test − LL train ≈ | λ i | (# train evs) i = 1 6 5 predicted 4 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 19 / 41
Outline Introduction 1 Finding an Empirical Law for Overfitting 2 Regularization 3 Why Does the Law Hold? 4 Things You Can Do With It 5 Discussion 6 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 20 / 41
Why Exponential Models Are Special Do some math (and include normalization features): F ′ 1 � LL test − LL train = λ i × (discount of f i ( · ) ) (# train evs) i = 1 Compare this to The Law: F 1 � LL test − LL train ≈ | λ i | × 0 . 938 (# train evs) i = 1 If only . . . (discount of f i ( · ) ) ≈ 0 . 938 × sgn λ i ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 21 / 41
What Are Discounts? How many times fewer an n -gram occurs in test set . . . Compared to training set (of equal length). Studied extensively in language model smoothing. Let’s look at the data. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 22 / 41
Smoothed Discount Per Feature ? (discount of f i ( · ) ) ≈ 0 . 938 × sgn λ i 2.5 very sparse sparse 2 less sparse 1.5 dense discount 1 0.5 0 -0.5 -1 -1 0 1 2 3 4 λ ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 23 / 41
Why The Law Holds More Than It Should Sparse models all act alike. Dense models don’t overfit much. F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 24 / 41
Outline Introduction 1 Finding an Empirical Law for Overfitting 2 Regularization 3 Why Does the Law Hold? 4 Things You Can Do With It 5 Discussion 6 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 25 / 41
Explain Things Why backoff features help. Why word class features help. Why domain adaptation helps. Why increasing n doesn’t hurt. Why relative performance differences shrink with more data. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 26 / 41
Make Models Better (test error) ≈ (training error) + (overfit) Decrease overfit ⇒ decrease test error. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 27 / 41
Reducing Overfitting F 0 . 938 � (overfit) ≈ | λ i | (# train evs) i = 1 In practice, the number of features matters not! More features lead to less overfitting . . . If sum of parameters decreases! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 28 / 41
A Method for Reducing Overfitting Before: λ 1 = λ 2 = 2. P before ( y | x ) = exp ( 2 · f 1 ( x , y ) + 2 · f 2 ( x , y )) Z Λ ( x ) After: λ 1 = λ 2 = 0, λ 3 = 2, f 3 ( x , y ) = f 1 ( x , y ) + f 2 ( x , y ) . P after ( y | x ) = exp ( 2 · f 3 ( x , y )) Z Λ ( x ) = exp ( 2 · f 1 ( x , y ) + 2 · f 2 ( x , y )) Z Λ ( x ) ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 29 / 41
What’s the Catch? (Part I) Same test set performance? Re-regularize model: improves performance more! F F 1 � � λ 2 (obj fn) ≡ LL train + α | λ i | + i 2 σ 2 i = 1 i = 1 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 30 / 41
What’s the Catch? (Part II) Select features to sum in hindsight? When sum features, sums discounts! F ′ 1 � LL test − LL train = λ i × (discount of f i ( · ) ) (# train evs) i = 1 Need to pick features to sum a priori! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 31 / 41
Heuristic 1: Improving Model Performance Identify features a priori with similar λ i . Create new feature that is sum of original features. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 32 / 41
Recommend
More recommend