Generalized Linear Models David Rosenberg New York University April 12, 2015 David Rosenberg (New York University) DS-GA 1003 April 12, 2015 1 / 20
Conditional Gaussian Regression Gaussian Regression Input space X = R d , Output space Y = R � w T x , σ 2 � Hypothesis space consists of functions f : x �→ N . For each x , f ( x ) returns a particular Gaussian density with variance σ 2 . Choice of w determines the function. For some parameter w ∈ R d , can write our prediction function as [ f w ( x )]( y ) = p w ( y | x ) = N ( y | w T x , σ 2 ) , where σ 2 > 0. Given some i.i.d. data D = { ( x 1 , y 1 ) ,..., ( x n , y n ) } , how to assess the fit? David Rosenberg (New York University) DS-GA 1003 April 12, 2015 2 / 20
Conditional Gaussian Regression Gaussian Regression: Likelihood Scoring Suppose we have data D = { ( x 1 , y 1 ) ,..., ( x n , y n ) } . Compute the model likelihood for D : n � p w ( D ) = p w ( y i | x i ) [by independence] i = 1 Maximum Likelihood Estimation (MLE) finds w maximizing p w ( D ) . Equivalently, maximize the data log-likelihood: n � w ∗ = argmax log p w ( y i | x i ) w ∈ R d i = 1 Let’s start solving this! David Rosenberg (New York University) DS-GA 1003 April 12, 2015 3 / 20
Conditional Gaussian Regression Gaussian Regression: MLE The conditional log-likelhood is: n � log p w ( y i | x i ) i = 1 n −( y i − w T x i ) 2 � � 1 � �� log 2 π exp = √ 2 σ 2 σ i = 1 n n −( y i − w T x i ) 2 � � � � � 1 � = log + √ 2 σ 2 σ 2 π i = 1 i = 1 � �� � independent of w MLE is the w where this is maximized. Note that σ 2 is irrelevant to finding the maximizing w . Can drop the negative sign and make it a minimization problem. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 4 / 20
Conditional Gaussian Regression Gaussian Regression: MLE The MLE is n � w ∗ = argmin ( y i − w T x i ) 2 w ∈ R d i = 1 This is exactly the objective function for least squares. From here, can use usual approaches to solve for w ∗ (linear algebra, calculus, iterative methods etc.) NOTE: Parameter vector w only interacts with x by an inner product David Rosenberg (New York University) DS-GA 1003 April 12, 2015 5 / 20
Poisson Regression Poisson Regression: Setup Input space X = R d , Output space Y = { 0 , 1 , 2 , 3 , 4 ,... } Hypothesis space consists of functions f : x �→ Poisson ( λ ( x )) . That is, for each x , f ( x ) returns a Poisson with mean λ ( x ) . What function? Recall λ > 0. GLMs (and Poisson is a special case) have a linear dependence on x . Standard approach is to take w T x � � λ ( x ) = exp , for some parameter vector w . Note that range of λ ( x ) = ( 0 , ∞ ) , (appropriate for the Poisson parameter). David Rosenberg (New York University) DS-GA 1003 April 12, 2015 6 / 20
Poisson Regression Poisson Regression: Likelihood Scoring Suppose we have data D = { ( x 1 , y 1 ) ,..., ( x n , y n ) } . Last time we found the log-likelihood for Poisson was: n � log p ( D , λ ) = [ y i log λ − λ − log ( y i ! )] i = 1 � w T x � Plugging in λ ( x ) = exp , we get n � � � � w T x �� � w T x � � log p ( D , λ ) = y i log exp − exp − log ( y i ! ) i = 1 n � y i w T x − exp w T x � � � � = − log ( y i ! ) i = 1 Maximize this w.r.t. w to find the Poisson regression. No closed form for optimum, but it’s concave, so easy to optimize. David Rosenberg (New York University) DS-GA 1003 April 12, 2015 7 / 20
Bernoulli Regression Linear Probabilistic Classifiers Setting: X = R d , Y = { 0 , 1 } For each X = x , p ( Y = 1 | x ) = θ . (i.e. Y has a Bernoulli ( θ ) distribution) θ may vary with x . For each x ∈ R d , just want to predict θ ∈ [ 0 , 1 ] . Two steps: �→ w T x �→ f ( w T x ) x , ���� ���� � �� � ∈ R D ∈ R ∈ [ 0 , 1 ] where f : R → [ 0 , 1 ] is called the transfer or inverse link function. Probability model is then p ( Y = 1 | x ) = f ( w T x ) David Rosenberg (New York University) DS-GA 1003 April 12, 2015 8 / 20
Recommend
More recommend