STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson

Outline Approximation Methods Motivating Model: Logistic Regression Newton-Raphson Method

Approximation Methods ▸ Thus far we have done a lot of calculus and probability math to find exact optima/posterior/predictive distributions for simple models. ▸ We relied heavily on some strong assumptions (e.g., i.i.d. Normal errors, conjugate priors, some parameters fixed, etc.) ▸ In general, the “nice” properties that made exact solutions possible will not be present. ▸ Hence we need to rely on approximations to our optima/distributions/etc.

Two Classes of Approximation We can either 1. Solve for an approximate solution exactly ▸ Settling for local optima ▸ Making the “least bad” simplifying assumptions to make analytic solutions possible 2. Solve for an exact solution approximately ▸ Numerical/stochastic integration ▸ Stochastic searh

Logistic Regression ▸ Linear regression? ˆ P ( t n = 1 ∣ x ) = x n w ▸ This can work if we only care about whether ˆ P ( t = 1 ) > 0 . 5 , but a ▸ Consider binary linear model can return classification invalid probabilities ( t ∈ { 0 , 1 } ) where we ▸ Not great if we want to want to model quantify uncertainty P ( t = 1 ) as an explicit function of feature vector x .

Modeling a Transformed Probability ▸ Idea: keep the linear dependence idea, but instead of modeling P ( t = 1 ∣ x ) directly, model a nonlinear function of P that is not bounded to [ 0 , 1 ] . ▸ The odds : P ( t = 1 ) Odds ( t = 1 ) ∶ = 1 − P ( t = 1 ) ∈ [ 0 , ∞) ▸ The log odds or logit : P ( t = 1 ) Logit ( t = 1 ) ∶ = log ( 1 − P ( t = 1 )) ∈ (−∞ , ∞) ▸ Nice property: equal probabilities corresponds to Logit = 0 .

Logit Transformation 6 4 2 logit(p) 0 −2 −4 −6 0.0 0.2 0.4 0.6 0.8 1.0 p

Logistic Transformation ▸ η = Logit ( p ) = log ( p 1 − p ) ▸ Inverse is the logistic function: exp { η } p = Logistic ( η ) = Logit − 1 ( η ) = 1 + exp { η } 1.0 0.8 logistic ( η ) 0.6 0.4 0.2 0.0 −5 0 5 η

A Linear Model of the Logit ▸ Having defined η with an unrestricted range, we can now model η n = x n w ▸ Or, equivalently, exp { x n w } P ( t n = 1 ∣ x n ) = 1 + exp { x n w } ▸ With an independence assumption, yields a likelihood function L ( w ) = P ( t ∣ X ) = N ( 1 + e x n w ) t n ( 1 + e x n w ) 1 − t n ∏ e x n w 1 n = 1

MLE for w ▸ The likelihood for w is 1 − t n L ( w ) = ∏ N ( 1 + e x n w ) t n ( 1 + e x n w ) e x n w 1 n = 1 ▸ The log likelihood is log L ( w ; X , t ) = N ( t n x n w − log ( 1 + e x n w )) ∑ n = 1 ▸ The d th coordinate of the gradient is N ( t n x nd − x nd e x n w 1 + e x n w ) ∑ ∂ log L = ∂w d n = 1 ▸ Good luck solving for w analytically...

Gradient Ascent/Descent

Iterative Optimization ▸ We want to try to find a peak of the log likelihood iteratively : make a guess, improve near the guess, rinse and repeat until you can’t improve further ▸ Many algorithms exist to do this kind of thing ▸ One good one when we have a gradient is Newton-Raphson (old, old method originally used to find roots of polynomials)

Newton-Raphson Optimization ▸ Setting: have a function f ( w ) ; want to find ˆ w s.t. f ( ˆ w ) = 0 . ▸ Algorithm: w ( 0 ) . 1. Pick an initial guess: ˆ w ( n ) )∣ > ε : 2. For n = 0 , 1 ... while ∣ f ( ˆ a. Approximate f around f ( ˆ w ( n ) ) with a line, ˜ f n + 1 ( w ) . w ( n + 1 ) so that ˜ f ( ˆ w ( n + 1 ) ) = 0 . b. Find ˆ ▸ How to do 2a and 2b? 2. a. Use the tangent line: i.e., f ( n ) ( w ) = f ( ˆ ˜ w ( n ) ) + f ′ ( ˆ w ( n ) )( w − ˆ w ( n ) ) w ( n + 1 ) . b. Set this to zero and solve to find ˆ w ( n ) ) w ( n ) − f ( ˆ w ( n + 1 ) = ˆ ˆ f ′ ( ˆ w ( n ))

OK, but isn’t that just for zero-finding? ▸ Yes, but the stumbling block in our problem (maximum likelihood) was that we could set the gradient to zero! ▸ When optimizing, we want to find zeroes of f ′ ( w ) . So our update step is w ( n ) − f ′ ( ˆ w ( n ) ) w ( n + 1 ) = ˆ f ′′ ( ˆ w ( n ) ) ˆ

Why/when does this work? Intermediate value theorem If f ∶ [ a,b ] → R is continuous, u is real and f ( a ) > u > f ( b ) , then there is some c ∈ ( a,b ) so that f ( c ) = u . But, need to find reasonable initialization, or algorithm could diverge. Also, only finds a local optimum.

Multivariate functions ▸ Recall our log likelihood for logistic regression: log L ( w ; X , t ) = N ( t n x n w − log ( 1 + e x n w )) ∑ n = 1 ▸ The d th coordinate of the gradient is N ( t n x nd − x nd e x n w 1 + e x n w ) ∂ log L ∑ = ∂w d n = 1 ▸ This is a function with a vector input.

Multivariate Derivatives ▸ The analog of the first derivative is the gradient vector, ∇ f ( w ) = ( ∂f ( w ) ,..., ∂f ( w ) T ) w 1 w D ▸ The analog of the second derivative is the matrix of second partial derivatives , which is called the Hessian matrix . ⎛ ⎞ ∂ 2 f ( w ) ∂ 2 f ( w ) ∂ 2 f ( w ) ... ⎜ ⎟ ∂w 2 ∂w 1 ∂w 2 ∂w 1 ∂w D ⎜ ⎟ 1 ⎜ ⎟ H f ( w ) = ... ... ... ... ⎝ ⎠ ∂ 2 f ( w ) ∂ 2 f ( w ) ∂ 2 f ( w ) ... ∂w 2 ∂w D ∂w 1 ∂w D ∂w 2 D

Multivariate Update Equation ▸ The update equation for a function of one variable is w ( n ) − f ′ ( ˆ w ( n ) ) w ( n + 1 ) = ˆ f ′′ ( ˆ w ( n ) ) ˆ ▸ For more than one variable, this becomes w ( n ) − H − 1 f ( w ) ( ˆ w ( n ) )∇ f ( w ) ( ˆ w ( n ) ) w ( n + 1 ) = ˆ ˆ

Example: MLE for Logistic Regression ▸ Recall our log likelihood for logistic regression: log L ( w ; X , t ) = N ( t n x n w − log ( 1 + e x n w )) ∑ n = 1 ▸ The d th coordinate of the gradient is ( t n x nd − x nd e x n w 1 + e x n w ) ∑ N ∂ log L = ∂w d n = 1 ▸ The d,d ′ entry in the Hessian is ∂ 2 log L ∂w d ∂w d ′ = − N ∑ e x n w ( 1 + e x n w ) 2 x nd x nd ′ n = 1

Solution Path

Classification Result

Nonlinear Classification Result

STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson - PowerPoint PPT Presentation

STAT 339 Approximate Inference I 15 March 2017 Colin Reimer Dawson Outline Approximation Methods Motivating Model: Logistic Regression Newton-Raphson Method Approximation Methods Thus far we have done a lot of calculus and probability

Inference in Bayesian networks Chapter 14.45 Chapter 14.45 1 Outline Exact inference

Approximate inference: Sampling methods Probabilistic Graphical Models Sharif University of

STAT 830 Non-parametric Inference Basics Handwritten Notes Richard Lockhart Simon Fraser

Bayesian networks: approximate inference Machine Intelligence Thomas D. Nielsen September 2008

STAT 830 Blank Slides for Notes Richard Lockhart SFU STAT 830 Fall 2020 Richard Lockhart

STAT 339 Hidden Markov Models III 21 April 2017 Bayesian Estimation / Model Averaging Outline

Two Approximate- Programmability Birds, One Statistical- Inference Stone Adrian Sampson

Approximate Bayesian Computation Dr. Jarad Niemi STAT 615 - Iowa State University December 5,

Paws4medford.org www.paws4medford.org (339) 674-0085 Paws4medford.org www.paws4medford.org (339)

The Large-Scale Commercial Dog Breeder Act HB 4898 / SB 339 Getting to the Goal September 15,

STAT 339 Naive Bayes Classification 8-10 March 2017 Colin Reimer Dawson Outline Naive Bayes

STAT 339 A Generative Linear Model and Max Likelihood Estimation 20-22 February 2017 Colin

STAT 339 Evaluating a Classifier 3 February 2017 Colin Reimer Dawson Questions/Administrative

STAT 339 Markov Chain Monte Carlo (MCMC) 7 April 2017 Some theory and intuition about MCMC

STAT 339 Probabilistic Modeling and Machine Learning 30 January 2017 Colin Reimer Dawson

Approximate Inference: Randomized Methods October 15, 2015 Topics Hard Inference

MINCON INTERIM RESULTS 2020 The Drillers Choice SUMMARY H1 2020 PROGRESS THROUGH CHALLENGING

Optimizing an homogeneous polynomial on the unit sphere Rima Khouja Advisors: Bernard Mourrain

Algebraic models for multilinear dependence Jason Morton Stanford University February 21, 2009

Upcoming HORIZON 2020 HEALTH CALLS 2018-20 Digital Health Ecosystem Wales Andy Bleaden -

Last time: Problem-Solving Problem solving: Goal formulation Problem formulation

Lecture 30 Mo#onal emf is a special case of Faradays

Delaunay triangulations on hyperbolic surfaces Iordan Iordanov Monique Teillaud Astonishing

Catalan numbers, parking functions, and invariant theory Vic Reiner Univ. of Minnesota CanaDAM