an investigation of why overparameterization exacerbates
play

An Investigation of Why Overparameterization Exacerbates Spurious - PowerPoint PPT Presentation

An Investigation of Why Overparameterization Exacerbates Spurious Correlations Shiori Sagawa* Aditi Raghunathan* Pang Wei Koh* Percy Liang Models can latch onto spurious correlations Misleading heuristics; might work on most training


  1. An Investigation of Why Overparameterization Exacerbates Spurious Correlations Shiori Sagawa* Aditi Raghunathan* Pang Wei Koh* Percy Liang

  2. Models can latch onto spurious correlations Misleading heuristics; might work on most training examples but may not always hold up input : bird image ML label: bird type model waterbird vs landbird Sagawa et al. (2020), Wah et al. (2011), Zhou et al. (2017)

  3. Models can latch onto spurious correlations Misleading heuristics; might work on most training examples but may not always hold up input : bird image spurious correlation: water background ML prediction : waterbird model ✓ waterbird true label : Sagawa et al. (2020), Wah et al. (2011), Zhou et al. (2017)

  4. Models can latch onto spurious correlations Misleading heuristics; might work on most training examples but may not always hold up input : bird image spurious correlation: land background ML prediction : landbird model ✕ true label : waterbird Sagawa et al. (2020), Wah et al. (2011), Zhou et al. (2017)

  5. Models can latch onto spurious correlations input : face image ML label: hair color model blonde hair vs dark hair true label : Sagawa et al. (2020), Liu et al. (2015)

  6. Models can latch onto spurious correlations input : face image spurious correlation: gender ML prediction : dark hair model ✕ true label : blonde hair Sagawa et al. (2020), Liu et al. (2015)

  7. Models can latch onto spurious correlations label: object waterbird landbird water background spurious attribute: background ma majority minority land background minority ma majority Sagawa et al. (2020)

  8. Models perform well on average label: object waterbird landbird water background spurious attribute: background error 0.05 0. 05 0. 0.21 21 land background avg 0. 0.40 40 0. 0.004 004 average error: 0.03 Sagawa et al. (2020)

  9. But models can have high worst-group error label: object waterbird landbird water background spurious attribute: background error 0. 0.05 05 0.21 0. 21 land background avg worst group 0. 0.40 40 0. 0.004 004 worst-group error: 0.40 Sagawa et al. (2020)

  10. Approaches for improving worst-group error fail on high-capacity models • Upweight minority groups: Lo Low-ca capacity models High-ca Hi capacity models Label y Label y error error 1 -1 1 -1 ✓ ✓ ✓ X Attribute a Attribute a 1 1 ✓ ✓ ✓ X avg worst avg worst -1 -1 group group • More robust to spurious correlation • Relies on spurious correlation • Low worst-group error • High worst-group error Sagawa et al. (2020)

  11. Overparameterization hurts worst-group error for models trained with the reweighted objective av average error wo worst-gr grou oup error or Overparameterized is better than Overparameterized is worse than underparameterized underparameterized Our work: Ou : why do does es over erpa parame meter erization n exacer erba bate e worst-gr grou oup error or?

  12. Overview 1. Empirical results 2. Analytical model and theoretical results 3. Subsampling

  13. Overparameterization exacerbates worst- group error ResNet10 Logistic regression on random features

  14. Intuition: overparameterized models learn the spurious attribute and memorize minority groups Overparameterized non-generalizable “memorizing” generalizable minority majority ! = 1 ! = −1 ! = 1 ! = −1 $ = 1 $ = −1 $ = −1 $ = 1

  15. Overview 1. Empirical results 2. Analytical model and theoretical results 3. Subsampling

  16. Toy example: data majority y 1 -1 1 a -1 Majority fraction minority

  17. Toy example: data Spurious-to-core information ratio (SCR) spurious core co sp

  18. Toy example: data For large N>>n, can be “memorized“ … spurious core noise co no sp

  19. Toy example: linear classifier model • Logistic regression • In overparameterized regime, equivalent to … ma max-ma margin cl classifier spurious core noise co no sp

  20. Worst-group error is provably higher in the overparameterized regime Th Theorem (informal). For any High High SCR majority , fraction there exists such that for all , with high probability, High worst-group error for overparameterized However, with and in the asymptotic regime with , Low worst-group error for underparameterized

  21. Underparameterized models need to learn the core feature to achieve low reweighted loss learning core learning spurious ✓ low reweighted loss X high reweighted loss

  22. In overparameterized regime, minimum-norm inductive bias favors less memorization No Norm scales with the number of points “m “memorized” learning core learning spurious memorizing outliers memorizing minority many examples memorized few examples memorized ✓ low norm X high norm

  23. Intuition: memorizing as few examples as possible under the min-norm inductive bias mo model y 1 -1 1 a -1 Tr Train error

  24. Learn spurious à memorize minority, low norm mo model y 1 -1 0 1 1 a 1 0 -1 Tr Train error

  25. Learn spurious à memorize minority, low norm mo model y 1 -1 0 1 1 a 1 0 -1 Tr Train error points to memorize

  26. Learn spurious à memorize minority, low norm ✓ low norm mo model y 1 -1 0 0 1 a 0 0 -1 Tr Train error points to memorize

  27. Learn core à memorize more, high norm mo model y 1 -1 >0 >0 1 a >0 >0 -1 Tr Train error

  28. Learn core à memorize more, high norm model mo y 1 -1 >0 >0 1 a >0 >0 -1 Train error Tr points to memorize

  29. Learn core à memorize more, high norm X high norm model mo y 1 -1 0 0 1 a 0 0 -1 Train error Tr points to memorize

  30. Overview 1. Empirical results 2. Simulations on synthetic data 3. Subsampling

  31. Reweighting vs subsampling upweighting up ng su subsa sampling ! = 1 ! = 1 $ = 1 $ = 1 ! = −1 $ = −1 ! = −1 $ = −1 ! = 1 $ = −1 ! = 1 $ = −1 ! = −1 $ = 1 ! = −1 $ = 1 # examples # examples Reduces majority fraction • Lowers memorization cost of • learning the core feature Chawla et al. (2011)

  32. Reweighting vs subsampling upweighting up ng su subsa sampling ! = 1 ! = 1 $ = 1 $ = 1 ! = −1 $ = −1 ! = −1 $ = −1 ! = 1 $ = −1 ! = 1 $ = −1 ! = −1 $ = 1 ! = −1 $ = 1 # examples # examples Chawla et al. (2011)

  33. Subsampling the majority group à overparameterization helps worst-group error Upweighting Subsampling Potential tension between using all of the data vs. using large overparameterized models. Both help average error, but can’t have both for good worst-group error.

  34. Thanks! Shiori Pang Wei Percy Aditi Thanks! Sagawa* Koh* Liang Raghunathan* Thank you to Yair Carmon, John Duchi, Tatsunori Hashimoto, Ananya Kumar, Yiping Lu, Tengyu Ma, and Jacob Steinhardt Funded by Open Philanthropy Project Award, Stanford Graduate Fellowship, Google PhD Fellowship, Open Philanthropy Project AI Fellowship, and Facebook Fellowship Program.

Recommend


More recommend