performance prediction and shrinking language models
play

Performance Prediction and Shrinking Language Models . Chen - PowerPoint PPT Presentation

Performance Prediction and Shrinking Language Models . Chen Stanley F IBM T.J. Watson Research Center Yorktown Heights, New York, USA 27 June 2011 Joint work with Stephen Chu, Ahmad Emami, Lidia Mangu, Bhuvana Ramabhadran,


  1. Performance Prediction and Shrinking Language Models . Chen † Stanley F IBM T.J. Watson Research Center Yorktown Heights, New York, USA 27 June 2011 † Joint work with Stephen Chu, Ahmad Emami, Lidia Mangu, ■❇▼ Bhuvana Ramabhadran, Ruhi Sarikaya, and Abhinav Sethy. Stanley F. Chen (IBM) Performance Prediction 27 June 2011 1 / 41

  2. What Does a Good Model Look Like? (test error) ≡ (training error) + (overfit) ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 2 / 41

  3. Overfitting: Theory e.g. , Akaike Information Criterion (1973) − (test LL) ≈ − (train LL) + (# params) e.g. , structural risk minimization (Vapnik, 1974) (test err) ≤ (train err) + f ( VC dimension ) Down with big models!? ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 3 / 41

  4. The Big Idea Maybe overfit doesn’t act like we think it does. Let’s try to fit overfit empirically. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 4 / 41

  5. What This Talk Is About An empirical estimate of the overfit in log likelihood of . . . Exponential language models . . . That is really simple and works really well. Why it works. What you can do with it. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 5 / 41

  6. Outline Introduction 1 Finding an Empirical Law for Overfitting 2 Regularization 3 Why Does the Law Hold? 4 Things You Can Do With It 5 Discussion 6 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 6 / 41

  7. Exponential N -Gram Language Models Language model: predict next word given previous, say, two words. P ( y = ate | x = the cat ) Log-linear model: features f i ( · ) ; parameters λ i . P ( y | x ) = exp ( � i λ i f i ( x , y )) Z Λ ( x ) A binary feature f i ( · ) for each n -gram in training set. An alternative parameterization of back-off n -gram models. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 7 / 41

  8. Details: Regression Build hundreds of (regularized!) language models. Compute actual overfit: log likelihood (LL) per event = log PP . Calculate lots of statistics for each model. F = # parameters; D = # training events. D ; F log D F ; 1 λ i ; 1 i ; 1 4 � � λ 2 � | λ i | 3 ; . . . D D D D Do linear regression! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 8 / 41

  9. What Doesn’t Work? AIC-like Prediction (overfit) ≡ LL test − LL train ≈ γ (# params) (# train evs) 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 9 / 41

  10. What Doesn’t Work? BIC-like Prediction LL test − LL train ≈ γ (# params) log (# train evs) (# train evs) 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 10 / 41

  11. What Does Work? (r = 0.9996) F γ � LL test − LL train ≈ | λ i | (# train evs) i = 1 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 11 / 41

  12. γ = 0 . 938 Holds for many different types of data. Different domains ( e.g. , Wall Street Journal, . . . ) Different token types (letters, parts-of-speech, words). Different vocabulary sizes (27–84,000 words). Different training set sizes (100–100,000 sentences). Different n -gram orders (2–7). Holds for many different types of exponential models. Word n -gram models; class-based n -gram models; minimum discrimination information models. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 12 / 41

  13. What About Other Languages? F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 9 Iraqi 8 Spanish 7 German Turkish 6 predicted 5 4 3 2 1 0 0 1 2 3 4 5 6 7 8 9 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 13 / 41

  14. What About Genetic Data? F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 2 rice chicken human 1.5 predicted 1 0.5 0 0 0.5 1 1.5 2 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 14 / 41

  15. Outline Introduction 1 Finding an Empirical Law for Overfitting 2 Regularization 3 Why Does the Law Hold? 4 Things You Can Do With It 5 Discussion 6 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 15 / 41

  16. Regularization Improves test set performance. ℓ 1 , ℓ 2 2 , ℓ 1 + ℓ 2 2 regularization: choose λ i to minimize F F 1 � � λ 2 (obj fn) ≡ LL train + α | λ i | + i 2 σ 2 i = 1 i = 1 The problem: γ depends on α , σ ! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 16 / 41

  17. Regularization: Two Criteria Here: pick single α , σ across all models. Usual way: pick α , σ per model for good performance. Good performance and good overfit prediction? performance overfit prediction √ ℓ 1 √ ℓ 2 √ √ 2 ℓ 1 + ℓ 2 2 ( α = 0 . 5 , σ 2 = 6 ) as good as best n -gram smoothing. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 17 / 41

  18. The Law and ℓ 1 + ℓ 2 2 Regularization F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 6 5 4 predicted 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 18 / 41

  19. The Law and ℓ 2 2 Regularization F 0 . 882 � LL test − LL train ≈ | λ i | (# train evs) i = 1 6 5 predicted 4 3 2 1 0 0 1 2 3 4 5 6 ■❇▼ actual Stanley F. Chen (IBM) Performance Prediction 27 June 2011 19 / 41

  20. Outline Introduction 1 Finding an Empirical Law for Overfitting 2 Regularization 3 Why Does the Law Hold? 4 Things You Can Do With It 5 Discussion 6 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 20 / 41

  21. Why Exponential Models Are Special Do some math (and include normalization features): F ′ 1 � LL test − LL train = λ i × (discount of f i ( · ) ) (# train evs) i = 1 Compare this to The Law: F 1 � LL test − LL train ≈ | λ i | × 0 . 938 (# train evs) i = 1 If only . . . (discount of f i ( · ) ) ≈ 0 . 938 × sgn λ i ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 21 / 41

  22. What Are Discounts? How many times fewer an n -gram occurs in test set . . . Compared to training set (of equal length). Studied extensively in language model smoothing. Let’s look at the data. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 22 / 41

  23. Smoothed Discount Per Feature ? (discount of f i ( · ) ) ≈ 0 . 938 × sgn λ i 2.5 very sparse sparse 2 less sparse 1.5 dense discount 1 0.5 0 -0.5 -1 -1 0 1 2 3 4 λ ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 23 / 41

  24. Why The Law Holds More Than It Should Sparse models all act alike. Dense models don’t overfit much. F 0 . 938 � LL test − LL train ≈ | λ i | (# train evs) i = 1 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 24 / 41

  25. Outline Introduction 1 Finding an Empirical Law for Overfitting 2 Regularization 3 Why Does the Law Hold? 4 Things You Can Do With It 5 Discussion 6 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 25 / 41

  26. Explain Things Why backoff features help. Why word class features help. Why domain adaptation helps. Why increasing n doesn’t hurt. Why relative performance differences shrink with more data. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 26 / 41

  27. Make Models Better (test error) ≈ (training error) + (overfit) Decrease overfit ⇒ decrease test error. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 27 / 41

  28. Reducing Overfitting F 0 . 938 � (overfit) ≈ | λ i | (# train evs) i = 1 In practice, the number of features matters not! More features lead to less overfitting . . . If sum of parameters decreases! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 28 / 41

  29. A Method for Reducing Overfitting Before: λ 1 = λ 2 = 2. P before ( y | x ) = exp ( 2 · f 1 ( x , y ) + 2 · f 2 ( x , y )) Z Λ ( x ) After: λ 1 = λ 2 = 0, λ 3 = 2, f 3 ( x , y ) = f 1 ( x , y ) + f 2 ( x , y ) . P after ( y | x ) = exp ( 2 · f 3 ( x , y )) Z Λ ( x ) = exp ( 2 · f 1 ( x , y ) + 2 · f 2 ( x , y )) Z Λ ( x ) ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 29 / 41

  30. What’s the Catch? (Part I) Same test set performance? Re-regularize model: improves performance more! F F 1 � � λ 2 (obj fn) ≡ LL train + α | λ i | + i 2 σ 2 i = 1 i = 1 ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 30 / 41

  31. What’s the Catch? (Part II) Select features to sum in hindsight? When sum features, sums discounts! F ′ 1 � LL test − LL train = λ i × (discount of f i ( · ) ) (# train evs) i = 1 Need to pick features to sum a priori! ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 31 / 41

  32. Heuristic 1: Improving Model Performance Identify features a priori with similar λ i . Create new feature that is sum of original features. ■❇▼ Stanley F. Chen (IBM) Performance Prediction 27 June 2011 32 / 41

Recommend


More recommend