Linear Regression Let us assume that the target variable and the - PDF document

School of Computer Science Learning generalized linear models and tabular CPT of structured full BN Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 9, Oct 15, 2007 Receptor A Receptor A X 1 X 1 X 1 Receptor B Receptor B X 2 X 2 X 2 Eric Xing Eric Xing Kinase C Kinase C X 3 X 3 X 3 Kinase D Kinase D X 4 X 4 X 4 Kinase E Kinase E X 5 X 5 X 5 TF F TF F X 6 X 6 X 6 Reading: J-Chap. 7,8. Gene G Gene G X 7 X 7 X 7 X 8 X 8 X 8 Gene H Gene H 1 Linear Regression � Let us assume that the target variable and the inputs are related by the equation: = θ T + ε y x i i i where ε is an error term of unmodeled effects or random noise � Now assume that ε follows a Gaussian N (0, σ ), then we have: 1 ⎛ 2 ⎞ − θ T y ( ) x θ = ⎜ − ⎟ p y x i i ( | ; ) exp ⎜ ⎟ 2 2 i i 2 σ π σ ⎝ ⎠ Eric Xing 2 1

Logistic Regression (sigmoid classifier) � The condition distribution: a Bernoulli 1 1 = µ y − µ − y p y x x x ( | ) ( ) ( ( )) where µ is a logistic function 1 µ = x ( ) 1 − θ T + x e � We can used the brute-force gradient method as in LR � But we can also apply generic laws by observing the p ( y | x ) is an exponential family function, more specifically, a generalized linear model Eric Xing 3 Exponential family � For a numeric random variable X { } X n η = η T − η p x h x T x A ( | ) ( ) exp ( ) ( ) N 1 { } = η h x T T x ( ) exp ( ) η Z ( ) is an exponential family distribution with natural (canonical) parameter η � Function T ( x ) is a sufficient statistic . � Function A( η ) = log Z( η ) is the log normalizer. � Examples: Bernoulli, multinomial, Gaussian, Poisson, gamma,... Eric Xing 4 2

Multivariate Gaussian Distribution � For a continuous vector random variable X ∈ R k : 1 1 ⎧ ⎫ − 1 µ Σ = − − µ T Σ − µ p x ⎨ x x ⎬ ( , ) exp ( ) ( ) ( ) 1 2 2 2 2 π k Σ / ⎩ ⎭ / Moment parameter 1 { ( ) } = − Σ − 1 + µ Σ − 1 − µ Σ − 1 µ − Σ 1 xx T T x 1 T exp tr log ( ) 2 2 2 2 π k / � Exponential family representation Natural parameter [ ( ) ] [ ] ( ) − 1 − 1 − 1 − 1 η = Σ µ − 1 Σ = η η η = Σ µ η = − 1 Σ ; vec , vec , and 1 2 1 2 2 2 [ ( ) ] = T T x x xx ( ) ; vec ( ) η = µ Σ − 1 µ + Σ = − η η η − − 2 η 1 T 1 T 1 A ( ) log tr ( ) log 2 2 2 1 1 2 2 ( ) 2 2 − k = π / h x ( ) Note: a k -dimensional Gaussian is a ( d + d 2 )-parameter distribution with a ( d + d 2 )- � element vector of sufficient statistics (but because of symmetry and positivity, parameters are constrained and have lower degree of freedom) Eric Xing 5 Multinomial distribution x x π ~ multi ( | ), � For a binary vector random variable ⎧ ⎫ ∑ 1 2 K π = π x π x π x = k π p x ⎨ x ⎬ ( ) L exp ln 1 2 K k ⎩ ⎭ k ⎧ ⎫ − 1 ⎛ − − 1 ⎞ ⎛ − − 1 ⎞ K K K ∑ ∑ ∑ 1 1 = k π + ⎜ K ⎟ ⎜ π ⎟ ⎨ x x ⎬ exp ln ln k k ⎩ ⎝ ⎠ ⎝ ⎠ ⎭ = 1 = 1 = 1 k k k ⎧ ⎫ 1 ⎛ π ⎞ ⎛ − 1 ⎞ K − K − ∑ ∑ ⎜ ⎟ 1 = k + ⎜ π ⎟ x k exp ⎨ ln ln ⎬ ⎜ ⎟ 1 1 − ∑ K − π k ⎝ ⎠ ⎝ ⎠ ⎩ ⎭ = 1 = 1 k = 1 k k k � Exponential family representation ⎡ π ⎤ ⎛ ⎞ η = 0 ⎜ ⎟ ln k ; ⎢ ⎥ π ⎣ ⎝ ⎠ ⎦ K [ ] = T x x ( ) ⎛ − 1 ⎞ ⎛ ⎞ K − K ∑ ∑ 1 η = − ⎜ π ⎟ = ⎜ η ⎟ A e ( ) ln ln k k ⎝ ⎠ ⎝ ⎠ = 1 = 1 k k 1 = h x ( ) Eric Xing 6 3

Why exponential family? � Moment generating property 1 dA d d = η = η Z Z log ( ) ( ) η η η η d d Z d ( ) 1 d { } ∫ = η T h x T x dx ( ) exp ( ) η η Z d ( ) { } η T h x T x ( ) exp ( ) ∫ = T x dx ( ) η Z ( ) [ ] = E T x ( ) { } { } 2 η η 1 d A h x T T x h x T T x d ( ) exp ( ) ( ) exp ( ) ∫ 2 ∫ = − η T x dx T x dx Z ( ) ( ) ( ) 2 η η η η η d Z Z Z d ( ) ( ) ( ) [ ] [ ] 2 = − 2 E T x E T x ( ) ( ) [ ] = Var T x ( ) Eric Xing 7 Moment estimation � We can easily compute moments of any exponential family distribution by taking the derivatives of the log normalizer A ( η ). � The q th derivative gives the q th centered moment. η dA ( ) = mean η d 2 η d A ( ) = variance 2 η d L � When the sufficient statistic is a stacked vector, partial derivatives need to be considered. Eric Xing 8 4

Moment vs canonical parameters � The moment parameter µ can be derived from the natural (canonical) parameter η dA ( ) [ ] def = = µ 8 8 E T x A ( ) η d � A ( h ) is convex since 4 4 2 η d A ( ) [ ] η 0 = > Var T x ( ) η ∗ 2 η d -2 -2 -1 -1 0 0 1 1 2 2 � Hence we can invert the relationship and infer the canonical parameter from the moment parameter (1-to-1): def η = ψ µ ( ) A distribution in the exponential family can be parameterized not only by η − the � canonical parameterization, but also by µ − the moment parameterization. Eric Xing 9 MLE for Exponential Family � For iid data, the log-likelihood is { } ∏ l η = η T − η D h x T x A ( ; ) log ( ) exp ( ) ( ) n n n ⎛ ⎞ ∑ ∑ = + ⎜ η T ⎟ − η h x T x NA log ( ) ( ) ( ) n n ⎝ ⎠ n n � Take derivatives and set to zero: ∂ ∂ η l A ∑ ( ) 0 = − = T x N ( ) n ∂ η ∂ η n 1 ∂ η A ( ) ∑ = T x ( ) n ∂ η N ⇒ n 1 ) ∑ µ = T x ( ) MLE n N n � This amounts to moment matching . ) ) η = ψ µ � We can infer the canonical parameters using ( ) MLE MLE Eric Xing 10 5

Sufficiency � For p ( x | θ ), T ( x ) is sufficient for θ if there is no information in X regarding θ yeyond that in T ( x ). We can throw away X for the purpose pf inference w.r.t. θ . � θ θ = θ p T x x p T x X T ( x ) ( | ( ), ) ( | ( )) � Bayesian view θ = θ p x T x p x T x Frequentist view � X T ( x ) ( | ( ), ) ( | ( )) The Neyman factorization theorem � θ X T ( x ) T ( x ) is sufficient for θ if � θ = ψ θ ψ p x T x T x x T x ( , ( ), ) ( ( ), ) ( , ( )) 1 2 ⇒ θ = θ p x g T x h x T x ( | ) ( ( ), ) ( , ( )) Eric Xing 11 Examples � Gaussian: [ ] ( ) − 1 − 1 η = Σ µ − 1 Σ ; vec 2 [ ] ( ) 1 1 = T T x x xx ∑ ∑ ( ) ; vec ⇒ µ = = T x x ( ) 1 MLE n n N N η = µ Σ − 1 µ + Σ A 1 T 1 ( ) log n n 2 2 ( ) 2 2 − k = π / h x ( ) � Multinomial: ⎡ π ⎤ ⎛ ⎞ 0 η = ⎜ ⎟ ln k ; ⎢ ⎥ π ⎣ ⎝ ⎠ ⎦ K 1 [ ] ∑ = ⇒ µ = T x x x ( ) MLE n N ⎛ − − 1 ⎞ ⎛ ⎞ K K n ∑ ∑ η = − ⎜ 1 π ⎟ = ⎜ η ⎟ A e ( ) ln ln k k ⎝ ⎠ ⎝ ⎠ = 1 = 1 k k 1 = h x ( ) � Poisson: η = λ log = T x x 1 ( ) ∑ ⇒ µ = x η = λ = η A e ( ) MLE n N n 1 = h x ( ) x ! Eric Xing 12 6

Generalized Linear Models (GLIMs) � The graphical model X n Linear regression � Discriminative linear classification � Commonality: � Y n model E p ( Y )= µ = f ( θ T X ) N What is p ()? the cond. dist. of Y. � What is f ()? the response function. � � GLIM The observed input x is assumed to enter into the model via a linear � T x ξ = θ combination of its elements The conditional mean µ is represented as a function f ( ξ ) of ξ , where f is � known as the response function The observed output y is assumed to be characterized by an � exponential family distribution with conditional mean µ . Eric Xing 13 GLIM, cont. θ ψ f µ η ξ x { } η = η − η T p y h y x y A ( | ) ( ) exp ( ) ( ) { } ( ) ⇒ η = 1 η T − η p y h y x y A ( | ) ( ) exp ( ) ( ) φ The choice of exp family is constrained by the nature of the data Y � y is a continuous vector � multivariate Gaussian Example: � y is a class label � Bernoulli or multinomial The choice of the response function � Following some mild constrains, e.g., [0,1]. Positivity … � f − 1 = ψ ( ⋅ ) Canonical response function: � In this case θ T x directly corresponds to canonical parameter η . � Eric Xing 14 7

Linear Regression Let us assume that the target variable and the - PDF document

School of Computer Science Learning generalized linear models and tabular CPT of structured full BN Probabilistic Graphical Models (10- Probabilistic Graphical Models (10 -708) 708) Lecture 9, Oct 15, 2007 Receptor A Receptor A X 1 X 1 X

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Introduction to GSEM in Stata Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Multiclass Logistic Regression, Multilayer Perceptron Milan Straka October 26, 2020 Charles

Beyond GLM: The potential for a generic likelihood toolbox Peter Dalgaard Department of

Linear Models for Classification Greg Mori - CMPT 419/726 Bishop PRML Ch. 4 Discriminant

Lecture 9: Logistic Regression Discriminative vs. Generative Classification Linear

Machine Learning for Computational Linguistics May 3, 2016 regression non-parametric neighbors

Neural Network Learning Looking behind the scenes: a mathematical perspective Textbook reference:

Generalized linear mixed effects models Consider stochastic vector Y = ( Y 1 , . . . , Y n ) and