Linear regression . Course of Machine Learning Master Degree in Computer Science University of Rome ``Tor Vergata'' Giorgio Gambosi a.a. 2018-2019 1
Linear models • Linear combination of input features functions 2 y ( x , w ) = w 0 + w 1 x 1 + w 2 x 2 + . . . + w D x D with x = ( x 1 , . . . , x D ) • Linear function of parameters w • Linear function of features x . Extension to linear combination of base M − 1 ∑ y ( x , w ) = w 0 + w j φ j ( x ) j =1 • Let φ 0 ( x ) = 1 , then y ( x , w ) = w T φ ( x )
Base functions • Sigmoid (local) • Hyperbolic tangent (local) • Many types: 3 • Polynomial (global functions) • Gaussian (local) φ j ( x ) = x j − ( x − µ j ) 2 ( ) φ j ( x ) = exp 2 s 2 ( x − µ j ) 1 φ j ( x ) = σ = x − µj s 1 + e − s x − µj φ j ( x ) = tanh( x ) = 2 σ ( x ) − 1 = 1 − e − s x − µj 1 + e − s
Maximum likelihood and least squares • Assume an additional gaussian noise with • Then, and the expectation of the conditional distribution is 4 t = y ( x , w ) + ε p ( ε ) = N ( ε | 0 , β − 1 ) β = 1 σ 2 is the precision. p ( t | x , w , β ) = N ( t | y ( x , w ) , β − 1 ) ∫ E [ t | x ] = tp ( t | x ) dt = y ( x , w )
Maximum likelihood and least squares • The corresponding log-likelihood is then where 5 • The likelihood of a given training set X , t is N N ( t i | w T φ ( x i ) , β − 1 ) ∏ p ( t | X , w , β ) = i =1 N ln N ( t i | w T φ ( x i ) , β − 1 ) = N 2 ln β − N ∑ ln p ( t | X , w , β ) = 2 ln(2 π ) − βE D ( w ) i =1 N E D ( w ) = 1 ) 2 = 1 ( t i − w T φ ( x i ) 2( Φw − y ) T ( Φw − y ) ∑ 2 i =1
Maximum likelihood and least squares • Result . . . ... . . . . . . squares normal equations for least 6 • Maximization performed by setting the gradient to 0 • Maximizing the log-likelihood w.r.t. w is equivalent to minimizing the error function E D ( w ) N ∂ ( t i − w T φ ( x i ) ) ∑ φ ( x i ) T ∂ w ln p ( t | X , w , β ) = 0 = i =1 ( N N ) t i φ ( x i ) T − w T ∑ ∑ φ ( x i ) φ ( x i ) T = i =1 i =1 w ML = ( Φ T Φ ) − 1 Φ T t · · · φ 0 ( x 1 ) φ 1 ( x 1 ) φ M − 1 ( x 1 ) φ 0 ( x 2 ) φ 1 ( x 2 ) · · · φ M − 1 ( x 2 ) Φ = N ML = 1 ) 2 φ 0 ( x N ) φ 1 ( x N ) · · · φ M − 1 ( x N ) ( β − 1 ∑ t i − w T ML φ ( x i ) N i =1
Least squares geometry 7 • t = ( t 1 , . . . , t N ) T is a vector in I R N • Each basis function φ j applied to x 1 , . . . , x N is a vector ϕ j = ( φ j ( x 1 ) , . . . , φ j ( x N )) T ∈ I R N • If M < N , vectors ϕ 0 , . . . , ϕ M − 1 define a subspace S of dimension (at most) M • y = ( y ( x 1 , w ) , . . . , y ( x N , w )) T is a vector in I R N : it can be represented as linear combination y = ∑ M − 1 i =0 w i φ ( x i ) . Hence, it belongs to S R N , y ∈ I R N is the vector in subspace S at minimal squared • Given t ∈ I distance from t R N and vectors φ 0 , . . . , φ M − 1 , w ML is such that y is the • Given t ∈ I vector on S nearest to t
Gradient descent error value is updated as follows: 8 gradient descent methods • The minimum of E D ( w ) may be computed numerically, by means of • Initial assignment w (0) = ( w (0) 0 , w (0) 1 , . . . , w (0) D ) , with a corresponding N E D ( w (0) ) = 1 ) 2 ( t i − ( w (0) ) T φ ( x i ) ∑ 2 i =1 • Iteratively, the current value w ( i − 1) is modified in the direction of steepest descent of di E D ( w ) • At step i , w ( i − 1) j � − η ∂E D ( w ) � w ( i ) := w ( i − 1) � j j ∂w j � � w ( i − 1)
Gradient descent • In matrix notation: 9 w ( i ) := w ( i − 1) − η ∂E D ( w ) � � ∂ w � w ( i − 1) • By definition of E D ( w ) : w ( i ) := w ( i − 1) − η ( t i − w ( i − 1) φ ( x i )) φ ( x i )
Regularized least squares two terms. with solution • Regularization term in the cost function • Simple form 10 dependent from the parameters alone. E D ( w ) + λE W ( w ) E D ( w ) dependent from the dataset (and the parameters), E W ( w ) • The regularization coefficient controls the relative importance of the M − 1 E W ( w ) = 1 2 w T w = 1 ∑ w 2 i 2 i =0 • Sum-of squares cost function: weight decay N E ( w ) = 1 { t i − w T φ ( x i ) } 2 + λ 2 w T w = 1 2( Φw − y ) T ( Φw − y )+ λ 2 w T w ∑ 2 i =1 w = ( λ I + Φ T Φ ) − 1 Φ T t
Regularization • A more general form level curves of the cost function) 11 N M − 1 E ( w ) = 1 { t i − w T φ ( x i ) } 2 + λ ∑ ∑ | w j | q 2 2 i =1 j =0 • The case q = 1 is denoted as lasso: sparse models are favored (in blue,
Bias vs variance: an example regularized cost function 12 • Consider the case of function y = sin 2 πx and assume L = 100 training sets T 1 , . . . , T L are available, each of size n = 25 . • Given M = 24 gaussian basis functions φ 1 ( x ) , . . . , φ M ( x ) , from each training set T i a prediction function y i ( x ) is derived by minimizing the E D ( w ) = 1 2( Φw − t ) T ( Φw − t ) + λ 2 w T w
An example variance), but their expecation is a bad approximation of the unknown function (large bias). 13 1 1 ln λ = 2 . 6 t t 0 0 −1 −1 0 1 0 1 x x Left, a possible plot of prediction functions y i ( x ) ( i = 1 , . . . , 100 ), as derived, respectively, by training sets T i , i = 1 , . . . , 100 setting ln λ = 2 . 6 . Right, their expectation, with the unknown function y = sin 2 πx . The prediction functions y i ( x ) do not differ much between them (small
An example 14 1 1 ln λ = − 0 . 31 t t 0 0 −1 −1 0 1 0 1 x x Plot of the prediction functions obtained with ln λ = − 0 . 31 .
An example other), while bias decreases (their expectation is a better approximation of 15 1 1 ln λ = − 2 . 4 t t 0 0 −1 −1 0 1 0 1 x x Plot of the prediction functions obtained with ln λ = − 2 . 4 . As λ decreases, the variance increases (prediction functions y i ( x ) are more different each y = sin 2 πx ).
An example • Plot of (bias) bias increases and varinace decreases. Their sum has a minimum in 16 2 , variance and their sum as unctions of λ : las λ increases, correspondance to the optimal value of λ . • The term E x [ σ 2 y | x ] shows an inherent limit to the approximability of y = sin 2 πx .
Bayesian approach to regression • Applying maximum likelihood to determine the values of model • In order control model complexity, a bayesian approach assumes a prior distribution of parameter values. 17 parameters is prone to overfitting: need of a regularization term E ( w ) .
Prior distribution Conjugate of gaussian is gaussian: choosing a gaussian prior distribution of where results into a gaussian posterior distribution Posterior proportional to prior times likelihood: likelihood is gaussian 18 (gaussian noise). n N ( t i | w T φ ( x i ) , β − 1 ) ∏ p ( t | Φ , w , β ) = i =1 w p ( w ) = N ( w | m 0 , S 0 ) p ( w | t , Φ ) = N ( w | m N , S N ) ∝ p ( t , Φ | w ) p ( w ) 0 m 0 + β Φ T t ) m N = S N ( S − 1 + β Φ T Φ S − 1 N = S − 1 0
Prior distribution null covariance. proportional to the variance. 19 A common approach: zero-mean isotropic gaussian prior distribution of w ( α M − 1 ) 1 / 2 2 w 2 e − α ∏ p ( w | α ) = i 2 π i =0 • Parameters in w are assumed independent e identically distributed, according to a gaussian with mean 0 , uniform variance σ 2 = α − 1 and • Prior distribution defined with a hyper-parameter α , inversely
Posterior distribution Given the likelihood 20 n e − β 2 ( t i − w T φ ( x i )) 2 ∏ p ( t | Φ , w , β ) = i =1 the posterior distribution for w derives from Bayes' rule p ( w | t , Φ , α, σ ) = p ( t | Φ , w , σ ) p ( w | α ) ∝ p ( t | Φ , w , σ ) p ( w | α ) p ( t | Φ , α, σ )
In this case It is possible to show that, assuming the posterior distribution is itself a gaussian with case, 21 p ( w ) = N ( w | 0 , α − 1 I ) p ( t | w , Φ ) = N ( t | w T Φ , β − 1 I ) p ( w | t , Φ , α, σ ) = N ( w | m N , S N ) S N = ( α I + β Φ T Φ ) − 1 m N = β S N Φ T t Note that if α → 0 the prior tends to have infinite variance, and we have minimum information on w before the training set is considered. In this m N → ( Φ T β IΦ ) − 1 ( Φ T β It ) = ( Φ T Φ ) − 1 ( Φ T t ) that is w ML , the ML estimation of w .
Maximum a Posteriori • This is equivalent to maximizing its logarithm that is, 22 • Given the posterior distribution p ( w | Φ , t , α, β ) , we may derive the value of w MAP which makes it maximum (the mode of the distribution) log p ( w | Φ , t , α, β ) = log p ( t | w , Φ , β ) + log p ( w | α ) − log p ( t | Φ , β ) and, since p ( t | Φ , β ) is a constant wrt w w MAP = argmax log p ( w | Φ , t , α, β ) = argmax (log p ( t | w , Φ , β ) + log p ( w | α )) w w ( − log p ( t | Φ , w , β ) − log p ( w | α )) w MAP = argmin w
Derivation of MAP By considering the assumptions on prior and likelihood, this is equivalent to considering a cost function 23 ( n M − 1 ) β ( t i − w T φ ( x i )) 2 + α ∑ ∑ w 2 w MAP = argmin i + constants 2 2 w i =1 i =0 ( n M − 1 ) ( t i − w T φ ( x i )) 2 + α ∑ ∑ w 2 = argmin i β w i =1 i =0 n ( y i − w T φ ( x i )) + α β w T w ∑ E MAP ( w ) = i =1 that is to a regularized min square function with λ = α β
Recommend
More recommend