Theme: Maximum Likelihood Estimation Projected Gradient Descent on the Negative Log-Likelihood (NLL) • Step 3: SGD recovers the true parameters! • Ingredients: • Convexity always holds (not necessarily strong) • Guaranteed constant probability of a sample falling into α S • E ffi cient projection algorithm into the set of valid parameters (defined by ) α • Strong convexity within the projection set: H ⪰ C ⋅ α 4 ⋅ λ m ( T − 1 ) ⋅ I
Theme: Maximum Likelihood Estimation Projected Gradient Descent on the Negative Log-Likelihood (NLL) • Step 3: SGD recovers the true parameters! • Ingredients: • Convexity always holds (not necessarily strong) • Guaranteed constant probability of a sample falling into α S • E ffi cient projection algorithm into the set of valid parameters (defined by ) α • Strong convexity within the projection set: H ⪰ C ⋅ α 4 ⋅ λ m ( T − 1 ) ⋅ I • Good initialization point (i.e., assigns constant mass to ) S
Theme: Maximum Likelihood Estimation Projected Gradient Descent on the Negative Log-Likelihood (NLL) • Step 3: SGD recovers the true parameters! • Ingredients: • Convexity always holds (not necessarily strong) • Guaranteed constant probability of a sample falling into α S • E ffi cient projection algorithm into the set of valid parameters (defined by ) α • Strong convexity within the projection set: H ⪰ C ⋅ α 4 ⋅ λ m ( T − 1 ) ⋅ I • Good initialization point (i.e., assigns constant mass to ) S • Result: E ffi cient algorithm for recovering parameters from truncated data!
Truncation bias in regression
Truncation bias in regression • Goal: infer the e ff ect of height x i on basketball ability y i
Truncation bias in regression • Goal: infer the e ff ect of height x i on basketball ability y i • Strategy: linear regression
Truncation bias in regression What we expect: • Goal: infer the e ff ect of height x i on basketball ability y i • Strategy: linear regression
Truncation bias in regression What we expect: • Goal: infer the e ff ect of height x i on basketball ability y i • Strategy: linear regression
Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i • Strategy: linear regression z
Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! z
Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! ability z
Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability z
Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability z ε
Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability z ε
Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability NBA? z ε
Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability NBA? Yes Observe y i z ε
Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability NBA? Yes Observe y i z ε No Player unobserved
Bias from truncation: an illustration What we get: • Goal: infer the e ff ect of height x i on basketball ability y i Good • Strategy: linear regression enough for NBA! height ability NBA? Yes Observe y i z ε No Player unobserved • Truncation: only observe data based on the value of y i
Truncation in practice Not a hypothetical problem (or a new one!)
Truncation in practice Not a hypothetical problem (or a new one!) Fig 1 [Hausman and Wise 1977]
Truncation in practice Not a hypothetical problem (or a new one!) Fig 1 [Hausman and Wise 1977] Corrected previous findings about education (x) vs income (y) a ff ected by truncation on income (y)
Truncation in practice Not a hypothetical problem (or a new one!) Fig 1 [Hausman and Wise 1977] Table 1 [Lin et al 1999] Corrected previous findings about education (x) vs income (y) a ff ected by truncation on income (y)
Truncation in practice Not a hypothetical problem (or a new one!) Fig 1 [Hausman and Wise 1977] Table 1 [Lin et al 1999] Corrected previous findings about Found bias in income (x) vs child education (x) vs income (y) a ff ected support (y) because respondence by truncation on income (y) rate di ff ers based on y
Truncation in practice Not a hypothetical problem (or a new one!) Fig 1 [Hausman and Wise 1977] Table 1 [Lin et al 1999] Corrected previous findings about Found bias in income (x) vs child education (x) vs income (y) a ff ected support (y) because respondence Has inspired lots of prior work in statistics/econometrics by truncation on income (y) rate di ff ers based on y Our goal: unified e ffi cient (polynomial in dimension) algorithm [Galton 1897; Pearson 1902; Lee 1914; Fisher 1931; Hotelling 1948; Tukey 1949; Tobin 1958; Amemiya 1973; Breen 1996; Balakrishnan, Cramer 2014]
Truncated regression and classification
Truncated regression and classification Sample a covariate x x ∼ D
Truncated regression and classification Sample a covariate x x ∼ D
Truncated regression and classification Sample noise ε , Sample a compute latent z covariate x z = h θ * ( x ) + ε x ∼ D ε ∼ D N
Truncated regression and classification Sample noise ε , Sample a compute latent z covariate x z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z)
Truncated regression and classification Sample noise ε , Sample a compute latent z covariate x z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart
Truncated regression and classification Sample noise ε , Sample a compute latent z covariate x z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart
Truncated regression and classification Sample noise ε , Sample a compute latent z covariate x w.p. φ (z) z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart
Truncated regression and classification Project z to Sample noise ε , a label y Sample a compute latent z covariate x w.p. φ (z) y := π ( z ) z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart
Truncated regression and classification Project z to Sample noise ε , a label y Sample a compute latent z covariate x w.p. φ (z) y := π ( z ) z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart
Truncated regression and classification Add ( x ,y) to Project z to training set Sample noise ε , a label y Sample a compute latent z covariate x w.p. φ (z) T ∪ {( x , y )} y := π ( z ) z = h θ * ( x ) + ε x ∼ D ε ∼ D N w.p. 1 - φ (z) Throw away (x,z) and restart
Parameter estimation
̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N
̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood
̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) dz
̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) dz All possible latent variables corresponding to label
̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label
̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label • Example: if is a linear function, then: h θ
̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label • Example: if is a linear function, then: h θ • If and , MLE is ordinary least squares regression ε ∼ 𝒪 (0,1) π ( z ) = z
̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label • Example: if is a linear function, then: h θ • If and , MLE is ordinary least squares regression ε ∼ 𝒪 (0,1) π ( z ) = z • If and , MLE is probit regression ε ∼ 𝒪 (0,1) π ( z ) = 1 z ≥ 0
̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label • Example: if is a linear function, then: h θ • If and , MLE is ordinary least squares regression ε ∼ 𝒪 (0,1) π ( z ) = z • If and , MLE is probit regression ε ∼ 𝒪 (0,1) π ( z ) = 1 z ≥ 0 • If and ε ∼ Logistic (0,1) , MLE is logistic regression π ( z ) = 1 z ≥ 0
̂ Parameter estimation y i ∼ π ( h θ * ( x i ) + ε ) • We have a model where , want estimate θ for θ * ε ∼ D N • Standard (non-truncated) approach: maximize likelihood p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) Likelihood of latent under model D N ( z − h θ ( x )) dz All possible latent variables corresponding to label • Example: if is a linear function, then: h θ • If and , MLE is ordinary least squares regression ε ∼ 𝒪 (0,1) π ( z ) = z • If and , MLE is probit regression ε ∼ 𝒪 (0,1) π ( z ) = 1 z ≥ 0 • If and ε ∼ Logistic (0,1) , MLE is logistic regression π ( z ) = 1 z ≥ 0 • What about the truncated case?
Parameter estimation from truncated data
Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood
Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood:
Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) dz
Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) dz
Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) ϕ ( z ) dz D N ( z − h θ ( x )) dz p ( θ ; x , y ) = ∫ z D N ( z − h θ ( x )) ϕ ( z ) dz
Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) ϕ ( z ) dz D N ( z − h θ ( x )) dz p ( θ ; x , y ) = ∫ z D N ( z − h θ ( x )) ϕ ( z ) dz
Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) ϕ ( z ) dz D N ( z − h θ ( x )) dz p ( θ ; x , y ) = ∫ z D N ( z − h θ ( x )) ϕ ( z ) dz
Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) ϕ ( z ) dz D N ( z − h θ ( x )) dz p ( θ ; x , y ) = ∫ z D N ( z − h θ ( x )) ϕ ( z ) dz • Again, we can compute a stochastic gradient of the log-likelihood with only oracle access to Leads to another SGD-based algorithm ϕ ⟹
Parameter estimation from truncated data Main idea: maximization of the truncated log-likelihood • Truncated likelihood: p ( θ ; x , y ) = ∫ z ∈ π − 1 ( y ) ∫ z ∈ π − 1 ( y ) D N ( z − h θ ( x )) ϕ ( z ) dz D N ( z − h θ ( x )) dz p ( θ ; x , y ) = ∫ z D N ( z − h θ ( x )) ϕ ( z ) dz • Again, we can compute a stochastic gradient of the log-likelihood with only oracle access to Leads to another SGD-based algorithm ϕ ⟹ • However: this time the loss can actually be non-convex
Parameter estimation from truncated data
Parameter estimation from truncated data • However: this time the loss can actually be non-convex
Recommend
More recommend