STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson
Outline Approximation Methods Motivating Model: Logistic Regression Newton-Raphson Method
Approximation Methods ▸ Thus far we have done a lot of calculus and probability math to find exact optima/posterior/predictive distributions for simple models. ▸ We relied heavily on some strong assumptions (e.g., i.i.d. Normal errors, conjugate priors, some parameters fixed, etc.) ▸ In general, the “nice” properties that made exact solutions possible will not be present. ▸ Hence we need to rely on approximations to our optima/distributions/etc.
Two Classes of Approximation We can either 1. Solve for an approximate solution exactly ▸ Settling for local optima ▸ Making the “least bad” simplifying assumptions to make analytic solutions possible 2. Solve for an exact solution approximately ▸ Numerical/stochastic integration ▸ Stochastic searh
Logistic Regression ▸ Linear regression? ˆ P ( t n = 1 ∣ x ) = x n w ▸ This can work if we only care about whether ˆ P ( t = 1 ) > 0 . 5 , but a ▸ Consider binary linear model can return classification invalid probabilities ( t ∈ { 0 , 1 } ) where we ▸ Not great if we want to want to model quantify uncertainty P ( t = 1 ) as an explicit function of feature vector x .
Modeling a Transformed Probability ▸ Idea: keep the linear dependence idea, but instead of modeling P ( t = 1 ∣ x ) directly, model a nonlinear function of P that is not bounded to [ 0 , 1 ] . ▸ The odds : P ( t = 1 ) Odds ( t = 1 ) ∶ = 1 − P ( t = 1 ) ∈ [ 0 , ∞) ▸ The log odds or logit : P ( t = 1 ) Logit ( t = 1 ) ∶ = log ( 1 − P ( t = 1 )) ∈ (−∞ , ∞) ▸ Nice property: equal probabilities corresponds to Logit = 0 .
Logit Transformation 6 4 2 logit(p) 0 −2 −4 −6 0.0 0.2 0.4 0.6 0.8 1.0 p
Logistic Transformation ▸ η = Logit ( p ) = log ( p 1 − p ) ▸ Inverse is the logistic function: exp { η } p = Logistic ( η ) = Logit − 1 ( η ) = 1 + exp { η } 1.0 0.8 logistic ( η ) 0.6 0.4 0.2 0.0 −5 0 5 η
A Linear Model of the Logit ▸ Having defined η with an unrestricted range, we can now model η n = x n w ▸ Or, equivalently, exp { x n w } P ( t n = 1 ∣ x n ) = 1 + exp { x n w } ▸ With an independence assumption, yields a likelihood function L ( w ) = P ( t ∣ X ) = N ( 1 + e x n w ) t n ( 1 + e x n w ) 1 − t n ∏ e x n w 1 n = 1
MLE for w ▸ The likelihood for w is 1 − t n L ( w ) = ∏ N ( 1 + e x n w ) t n ( 1 + e x n w ) e x n w 1 n = 1 ▸ The log likelihood is log L ( w ; X , t ) = N ( t n x n w − log ( 1 + e x n w )) ∑ n = 1 ▸ The d th coordinate of the gradient is N ( t n x nd − x nd e x n w 1 + e x n w ) ∑ ∂ log L = ∂w d n = 1 ▸ Good luck solving for w analytically...
Gradient Ascent/Descent
Iterative Optimization ▸ We want to try to find a peak of the log likelihood iteratively : make a guess, improve near the guess, rinse and repeat until you can’t improve further ▸ Many algorithms exist to do this kind of thing ▸ One good one when we have a gradient is Newton-Raphson (old, old method originally used to find roots of polynomials)
Newton-Raphson Optimization ▸ Setting: have a function f ( w ) ; want to find ˆ w s.t. f ( ˆ w ) = 0 . ▸ Algorithm: w ( 0 ) . 1. Pick an initial guess: ˆ w ( n ) )∣ > ε : 2. For n = 0 , 1 ... while ∣ f ( ˆ a. Approximate f around f ( ˆ w ( n ) ) with a line, ˜ f n + 1 ( w ) . w ( n + 1 ) so that ˜ f ( ˆ w ( n + 1 ) ) = 0 . b. Find ˆ ▸ How to do 2a and 2b? 2. a. Use the tangent line: i.e., f ( n ) ( w ) = f ( ˆ ˜ w ( n ) ) + f ′ ( ˆ w ( n ) )( w − ˆ w ( n ) ) w ( n + 1 ) . b. Set this to zero and solve to find ˆ w ( n ) ) w ( n ) − f ( ˆ w ( n + 1 ) = ˆ ˆ f ′ ( ˆ w ( n ))
OK, but isn’t that just for zero-finding? ▸ Yes, but the stumbling block in our problem (maximum likelihood) was that we could set the gradient to zero! ▸ When optimizing, we want to find zeroes of f ′ ( w ) . So our update step is w ( n ) − f ′ ( ˆ w ( n ) ) w ( n + 1 ) = ˆ f ′′ ( ˆ w ( n ) ) ˆ
Why/when does this work? Intermediate value theorem If f ∶ [ a,b ] → R is continuous, u is real and f ( a ) > u > f ( b ) , then there is some c ∈ ( a,b ) so that f ( c ) = u . But, need to find reasonable initialization, or algorithm could diverge. Also, only finds a local optimum.
Multivariate functions ▸ Recall our log likelihood for logistic regression: log L ( w ; X , t ) = N ( t n x n w − log ( 1 + e x n w )) ∑ n = 1 ▸ The d th coordinate of the gradient is N ( t n x nd − x nd e x n w 1 + e x n w ) ∂ log L ∑ = ∂w d n = 1 ▸ This is a function with a vector input.
Multivariate Derivatives ▸ The analog of the first derivative is the gradient vector, ∇ f ( w ) = ( ∂f ( w ) ,..., ∂f ( w ) T ) w 1 w D ▸ The analog of the second derivative is the matrix of second partial derivatives , which is called the Hessian matrix . ⎛ ⎞ ∂ 2 f ( w ) ∂ 2 f ( w ) ∂ 2 f ( w ) ... ⎜ ⎟ ∂w 2 ∂w 1 ∂w 2 ∂w 1 ∂w D ⎜ ⎟ 1 ⎜ ⎟ H f ( w ) = ... ... ... ... ⎝ ⎠ ∂ 2 f ( w ) ∂ 2 f ( w ) ∂ 2 f ( w ) ... ∂w 2 ∂w D ∂w 1 ∂w D ∂w 2 D
Multivariate Update Equation ▸ The update equation for a function of one variable is w ( n ) − f ′ ( ˆ w ( n ) ) w ( n + 1 ) = ˆ f ′′ ( ˆ w ( n ) ) ˆ ▸ For more than one variable, this becomes w ( n ) − H − 1 f ( w ) ( ˆ w ( n ) )∇ f ( w ) ( ˆ w ( n ) ) w ( n + 1 ) = ˆ ˆ
Example: MLE for Logistic Regression ▸ Recall our log likelihood for logistic regression: log L ( w ; X , t ) = N ( t n x n w − log ( 1 + e x n w )) ∑ n = 1 ▸ The d th coordinate of the gradient is ( t n x nd − x nd e x n w 1 + e x n w ) ∑ N ∂ log L = ∂w d n = 1 ▸ The d,d ′ entry in the Hessian is ∂ 2 log L ∂w d ∂w d ′ = − N ∑ e x n w ( 1 + e x n w ) 2 x nd x nd ′ n = 1
Solution Path
Classification Result
Nonlinear Classification Result
Recommend
More recommend