recent work in truncated statistics
play

Recent work in Truncated Statistics Andrew Ilyas Motivation: - PowerPoint PPT Presentation

Recent work in Truncated Statistics Andrew Ilyas Motivation: Poincar and the Baker Motivation: Poincar and the Baker Motivation: Poincar and the Baker Claimed weight: 1 kg/loaf Motivation: Poincar and the Baker Claimed weight: 1


  1. Theme: Maximum Likelihood Estimation Projected Gradient Descent on the Negative Log-Likelihood (NLL) • Step 3: SGD recovers the true parameters! • Ingredients: • Convexity always holds (not necessarily strong) • Guaranteed constant probability of a sample falling into α S • E ffi cient projection algorithm into the set of valid parameters (defined by ) α • Strong convexity within the projection set: H ⪰ C ⋅ α 4 ⋅ λ m ( T − 1 ) ⋅ I

  2. Theme: Maximum Likelihood Estimation Projected Gradient Descent on the Negative Log-Likelihood (NLL) • Step 3: SGD recovers the true parameters! • Ingredients: • Convexity always holds (not necessarily strong) • Guaranteed constant probability of a sample falling into α S • E ffi cient projection algorithm into the set of valid parameters (defined by ) α • Strong convexity within the projection set: H ⪰ C ⋅ α 4 ⋅ λ m ( T − 1 ) ⋅ I • Good initialization point (i.e., assigns constant mass to ) S

  3. Theme: Maximum Likelihood Estimation Projected Gradient Descent on the Negative Log-Likelihood (NLL) • Step 3: SGD recovers the true parameters! • Ingredients: • Convexity always holds (not necessarily strong) • Guaranteed constant probability of a sample falling into α S • E ffi cient projection algorithm into the set of valid parameters (defined by ) α • Strong convexity within the projection set: H ⪰ C ⋅ α 4 ⋅ λ m ( T − 1 ) ⋅ I • Good initialization point (i.e., assigns constant mass to ) S • Result: E ffi cient algorithm for recovering parameters from truncated data!

  4. Truncation bias in regression

  5. Truncation bias in regression • Goal: infer the e ff ect of height x i on basketball ability y i

  6. Truncation bias in regression • Goal: infer the e ff ect of height x i on basketball ability y i • Strategy: linear regression

  7. Truncation bias in regression What we expect: • Goal: infer the e ff ect of height x i on basketball ability y i • Strategy: linear regression

  8. Truncation bias in regression What we expect: • Goal: infer the e ff ect of height x i on basketball ability y i • Strategy: linear regression

  9. Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i • Strategy: linear regression z

  10. Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! z

  11. Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! ability z

  12. Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability z

  13. Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability z ε

  14. Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability z ε

  15. Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability NBA? z ε

  16. Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability NBA? Yes Observe y i z ε

  17. Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability NBA? Yes Observe y i z ε No Player unobserved

  18. Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability NBA? Yes Observe y i z ε No Player unobserved • Truncation: only observe data based on the value of y i

  19. Truncation in practice Not a hypothetical problem (or a new one!)

  20. Truncation in practice Not a hypothetical problem (or a new one!) Fig 1 [Hausman and Wise 1977]

  21. Truncation in practice Not a hypothetical problem (or a new one!) Fig 1 [Hausman and Wise 1977] Corrected previous findings about education (x) vs income (y) a ff ected by truncation on income (y)

  22. Truncation in practice Not a hypothetical problem (or a new one!) Fig 1 [Hausman and Wise 1977] Table 1 [Lin et al 1999] Corrected previous findings about education (x) vs income (y) a ff ected by truncation on income (y)

  23. Truncation in practice Not a hypothetical problem (or a new one!) Fig 1 [Hausman and Wise 1977] Table 1 [Lin et al 1999] Corrected previous findings about Found bias in income (x) vs child education (x) vs income (y) a ff ected support (y) because respondence by truncation on income (y) rate di ff ers based on y

  24. Truncation in practice Not a hypothetical problem (or a new one!) Fig 1 [Hausman and Wise 1977] Table 1 [Lin et al 1999] Corrected previous findings about Found bias in income (x) vs child education (x) vs income (y) a ff ected support (y) because respondence Has inspired lots of prior work in statistics/econometrics by truncation on income (y) rate di ff ers based on y Our goal: unified e ffi cient (polynomial in dimension) algorithm [Galton 1897; Pearson 1902; Lee 1914; Fisher 1931; Hotelling 1948; Tukey 1949; Tobin 1958; Amemiya 1973; Breen 1996; Balakrishnan, Cramer 2014]

  25. Truncated regression and classification

  26. Truncated regression and classification Sample a covariate x x ∼ D

  27. Truncated regression and classification Sample a covariate x x ∼ D

  28. Truncated regression and classification Sample noise ε , Sample a compute latent z covariate x z = h θ * ( x ) + ε x ∼ D ε ∼ D N

  29. Truncated regression and classification Sample noise ε , Sample a compute latent z covariate x z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z)

  30. Truncated regression and classification Sample noise ε , Sample a compute latent z covariate x z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart

  31. Truncated regression and classification Sample noise ε , Sample a compute latent z covariate x z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart

  32. Truncated regression and classification Sample noise ε , Sample a compute latent z covariate x w.p. φ (z) z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart

  33. Truncated regression and classification Project z to Sample noise ε , a label y Sample a compute latent z covariate x w.p. φ (z) y := π ( z ) z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart

  34. Truncated regression and classification Project z to Sample noise ε , a label y Sample a compute latent z covariate x w.p. φ (z) y := π ( z ) z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart

  35. Truncated regression and classification Add ( x ,y) to Project z to training set Sample noise ε , a label y Sample a compute latent z covariate x w.p. φ (z) T ∪ {( x , y )} y := π ( z ) z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart

  36. Parameter estimation

  37. ̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N

  38. ̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood

  39. ̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) dz

  40. ̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) dz All possible latent variables corresponding to label

  41. ̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label

  42. ̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label • Example: if is a linear function, then: h θ

  43. ̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label • Example: if is a linear function, then: h θ • If and , MLE is ordinary least squares regression ε ∼ 𝒪 (0,1) π ( z ) = z

  44. ̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label • Example: if is a linear function, then: h θ • If and , MLE is ordinary least squares regression ε ∼ 𝒪 (0,1) π ( z ) = z • If and , MLE is probit regression ε ∼ 𝒪 (0,1) π ( z ) = 1 z ≥ 0

  45. ̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label • Example: if is a linear function, then: h θ • If and , MLE is ordinary least squares regression ε ∼ 𝒪 (0,1) π ( z ) = z • If and , MLE is probit regression ε ∼ 𝒪 (0,1) π ( z ) = 1 z ≥ 0 • If and ε ∼ Logistic (0,1) , MLE is logistic regression π ( z ) = 1 z ≥ 0

  46. ̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label • Example: if is a linear function, then: h θ • If and , MLE is ordinary least squares regression ε ∼ 𝒪 (0,1) π ( z ) = z • If and , MLE is probit regression ε ∼ 𝒪 (0,1) π ( z ) = 1 z ≥ 0 • If and ε ∼ Logistic (0,1) , MLE is logistic regression π ( z ) = 1 z ≥ 0 • What about the truncated case?

  47. Parameter estimation from truncated data

  48. Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood

  49. Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood:

  50. Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) dz

  51. Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) dz

  52. Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) ϕ ( z ) dz D N ( z − h θ ( x )) dz p ( θ ; x , y ) = ∫ z D N ( z − h θ ( x )) ϕ ( z ) dz

  53. Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) ϕ ( z ) dz D N ( z − h θ ( x )) dz p ( θ ; x , y ) = ∫ z D N ( z − h θ ( x )) ϕ ( z ) dz

  54. Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) ϕ ( z ) dz D N ( z − h θ ( x )) dz p ( θ ; x , y ) = ∫ z D N ( z − h θ ( x )) ϕ ( z ) dz

  55. Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) ϕ ( z ) dz D N ( z − h θ ( x )) dz p ( θ ; x , y ) = ∫ z D N ( z − h θ ( x )) ϕ ( z ) dz • Again, we can compute a stochastic gradient of the log-likelihood with only oracle access to Leads to another SGD-based algorithm ϕ ⟹

  56. Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) ϕ ( z ) dz D N ( z − h θ ( x )) dz p ( θ ; x , y ) = ∫ z D N ( z − h θ ( x )) ϕ ( z ) dz • Again, we can compute a stochastic gradient of the log-likelihood with only oracle access to Leads to another SGD-based algorithm ϕ ⟹ • However: this time the loss can actually be non-convex

  57. Parameter estimation from truncated data

  58. Parameter estimation from truncated data • However: this time the loss can actually be non-convex

Recommend


More recommend