Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 4 - Linear Regression - Probabilistic Interpretation and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recap: Linear Regression is not Naively Linear Need to determine w for the linear function f ( x , w ) = ∑ n i =1 w i φ i ( x j ) = Φw which minimizes our error function E ( f ( x , w ) , D ) Owing to basis function φ , “Linear Regression” is linear in w but NOT in x (which could be arbitrarily non-linear)!   φ 1 ( x 1 ) φ 2 ( x 1 ) ...... φ p ( x 1 ) .   Φ = (1)   .   φ 1 ( x m ) φ 2 ( x m ) φ n ( x m ) ...... . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Recap: Linear Regression is not Naively Linear Need to determine w for the linear function f ( x , w ) = ∑ n i =1 w i φ i ( x j ) = Φw which minimizes our error function E ( f ( x , w ) , D ) Owing to basis function φ , “Linear Regression” is linear in w but NOT in x (which could be arbitrarily non-linear)!   φ 1 ( x 1 ) φ 2 ( x 1 ) ...... φ p ( x 1 ) .   Φ = (1)   .   φ 1 ( x m ) φ 2 ( x m ) φ n ( x m ) ...... Least Squares error and corresponding estimates: E ∗ = min ( ) w T Φ T Φw − 2y T Φw + y T y w E ( w , D ) = min (2) w ( n  ) 2  m w ∗ = arg min   ∑ ∑ E ( w , D ) = arg min w i φ i ( x j ) − y j w w . . . . . . . . . . . . . . . . . . . .   j = 1 i = 1 . . . . . . . . . . . . . . . . . . . . (3)

Recap: Geometric Interpretation of Least Square Solution Let y ∗ be a solution in the column space of Φ The least squares solution is such that the distance between y ∗ and y is minimized Therefore, the line joining y ∗ to y should be orthogonal to the column space of Φ ⇒ w = ( Φ T Φ ) − 1 Φ T y (4) Here Φ T Φ is invertible only if Φ has full column rank . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Building on questions on Least Squares Linear Regression 1 Is there a probabilistic interpretation? Gaussian Error, Maximum Likelihood Estimate 2 Addressing overfitting Bayesian and Maximum Aposteriori Estimates, Regularization 3 How to minimize the resultant and more complex error functions? Level Curves and Surfaces, Gradient Vector, Directional Derivative, Gradient Descent Algorithm, Convexity, Necessary and Sufficient Conditions for Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Probabilistic Modeling of Linear Regression Linear Model: Y is a linear function of φ ( x ), subject to a random noise variable ε which we believe is ‘mostly’ bounded by some threshold σ : Y = w T φ ( x ) + ε ε ∼ N (0 , σ 2 ) Motivation: N ( µ, σ 2 ), has maximum entropy among all real-valued distributions with a specified variance σ 2 3 − σ rule: About 68% of values drawn from N ( µ, σ 2 ) are within one standard deviation σ away from the mean µ ; about 95% of the values lie within 2 σ ; and about 99.7% are within 3 σ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Figure 1: 3 − σ rule: About 68% of values drawn from N ( µ, σ 2 ) are within one standard deviation σ away from the mean µ ; about 95% of the values lie within 2 σ ; and about 99.7% are within 3 σ . Source: https://en.wikipedia.org/wiki/Normal_distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Probabilistic Modeling of Linear Regression Linear Model: Y is a linear function of φ ( x ), subject to a random noise variable ε which we believe is ‘mostly’ around some threshold σ : Y = w T φ ( x ) + ε ε ∼ N (0 , σ 2 ) This allows for the Probabilistic model P ( y j | w , x j , σ 2 ) = N ( w T φ ( x j ) , σ 2 ) m ∏ P ( y | w , x j , σ 2 ) = P ( y j | w , x j , σ 2 ) j =1 Another motivation: E [ Y ( w , x j )] = . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Probabilistic Modeling of Linear Regression Linear Model: Y is a linear function of φ ( x ), subject to a random noise variable ε which we believe is ‘mostly’ around some threshold σ : Y = w T φ ( x ) + ε ε ∼ N (0 , σ 2 ) This allows for the Probabilistic model P ( y j | w , x j , σ 2 ) = N ( w T φ ( x j ) , σ 2 ) m ∏ P ( y | w , x j , σ 2 ) = P ( y j | w , x j , σ 2 ) j =1 Another motivation: E [ Y ( w , x j )] = w T φ ( x j ) = w T 0 + w T 1 φ 1 ( x j ) + ... + w T n φ n ( x j ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Estimating w : Maximum Likelihood If ϵ ∼ N (0 , σ 2 ) and y = w T φ ( x ) + ϵ where w , φ ( x ) ∈ R m then, given dataset D , find the most likely ˆ w ML ( ( y j − w T φ ( x j )) 2 1 ) Recall: Pr( y j | x j , w ) = √ 2 πσ 2 exp 2 σ 2 From Probability of data to Likelihood of parameters : Pr( D| w ) = Pr( y | x , w ) = . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Estimating w : Maximum Likelihood If ϵ ∼ N (0 , σ 2 ) and y = w T φ ( x ) + ϵ where w , φ ( x ) ∈ R m then, given dataset D , find the most likely ˆ w ML ( ( y j − w T φ ( x j )) 2 1 ) Recall: Pr( y j | x j , w ) = √ 2 πσ 2 exp 2 σ 2 From Probability of data to Likelihood of parameters : Pr( D| w ) = Pr( y | x , w ) = m m ( ( y j − w T φ ( x j )) 2 ) 1 ∏ ∏ Pr( y j | x j , w ) = √ 2 πσ 2 exp 2 σ 2 j =1 j =1 Maximum Likelihood Estimate w ML = argmax ˆ Pr( D| w ) = Pr( y | x , w ) = L ( w |D ) w . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Optimization Trick Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log ) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Optimization Trick Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log ) log L ( w |D ) = LL ( w |D ) = m 1 − m ∑ 2 ln (2 πσ 2 ) − ( w T φ ( x j ) − y j ) 2 2 σ 2 j =1 For a fixed σ 2 w ML = ˆ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Optimization Trick Optimization Trick: Optimal point is invariant under monotonically increasing transformation (such as log ) log L ( w |D ) = LL ( w |D ) = m 1 − m ∑ 2 ln (2 πσ 2 ) − ( w T φ ( x j ) − y j ) 2 2 σ 2 j =1 For a fixed σ 2 w ML = argmax LL ( y 1 ... y m | x 1 . . . x m , w , σ 2 ) ˆ m ∑ ( w T φ ( x j ) − y j ) 2 = argmin j =1 Note that this is same as the Least square solution!! . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Building on questions on Least Squares Linear Regression 1 Is there a probabilistic interpretation? Gaussian Error, Maximum Likelihood Estimate 2 Addressing overfitting Bayesian and Maximum Aposteriori Estimates, Regularization 3 How to minimize the resultant and more complex error functions? Level Curves and Surfaces, Gradient Vector, Directional Derivative, Gradient Descent Algorithm, Convexity, Necessary and Sufficient Conditions for Optimality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Redundant Φ and Overfitting Figure 2: Root Mean Squared (RMS) errors on sample train and test datasets as a function of the degree t of the polynomial being fit Too many bends (t=9 onwards) in curve ≡ high values of some w ′ i s . Try plotting values of w i ’s using applet at http://mste.illinois.edu/users/exner/java.f/leastsquares/#simulation Train and test errors differ significantly . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

X^0 * 0.13252679175596802 X^1 * 6.836159339696569 X^2 * -10.198794083500966 X^3 * 8.298738913209064 X^4 * -3.766949862252123 X^5 * 1.0274981119277349 X^6 * -0.17218031550131038 X^7 * 0.017340835860554016 X^8 * -9.623065771393043E-4 X^9 * 2.2595409656184083E-5 X^0 * -1.4218758581602278 X^1 * 14.756472312089675 X^2 * -24.299789484296475 X^3 * 20.63606795357865 X^4 * -9.934453145766518 X^5 * 2.8975181063446613

Bayesian Linear Regression The Bayesian interpretation of probabilistic estimation is a logical extension that enables reasoning with uncertainty but in the light of some background belief Bayesian linear regression : A Bayesian alternative to Maximum Likelihood least squares regression Continue with Normally distributed errors Model the w using a prior distribution and use the posterior over w as the result Intuitive Prior: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 4 - Linear Regression - Probabilistic Interpretation and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 7 - Linear

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Overview of Linear

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 13 - KKT

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Help System H "able" "absent" "add" "zoom" . . . The

Logics for Weighted Timed Pushdown Automata Manfred Droste and Vitaly Perevoshchikov Leipzig

Statistical Machine Learning Lecture 08: Regression Kristian Kersting TU Darmstadt Summer Term

Robust Interconnect Robust Interconnect Communication Capacity Algorithm Communication Capacity

Distributed Learning Environment Using XML Templates Sren Auer - University of Leipzig, Germany

Introduction to Java Server Faces(JSF) Deepak Goyal Vikas Varma Sun Microsystems Objective

Linear models Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology

Performance measures Nicolas Papageorgiou AQF-2005 Measuring hedge fund performance What

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh - PowerPoint PPT Presentation

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 4 - Linear Regression - Probabilistic Interpretation and Regularization . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 7 - Linear

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Overview of Linear

Introduction to Machine Learning - CS725 Instructor: Prof. Ganesh Ramakrishnan Lecture 13 - KKT

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

Help System H &quot;able&quot; &quot;absent&quot; &quot;add&quot; &quot;zoom&quot; . . . The

Logics for Weighted Timed Pushdown Automata Manfred Droste and Vitaly Perevoshchikov Leipzig

Statistical Machine Learning Lecture 08: Regression Kristian Kersting TU Darmstadt Summer Term

Robust Interconnect Robust Interconnect Communication Capacity Algorithm Communication Capacity

Distributed Learning Environment Using XML Templates Sren Auer - University of Leipzig, Germany

Introduction to Java Server Faces(JSF) Deepak Goyal Vikas Varma Sun Microsystems Objective

Linear models Oliver Stegle and Karsten Borgwardt Machine Learning and Computational Biology

Performance measures Nicolas Papageorgiou AQF-2005 Measuring hedge fund performance What

Help System H "able" "absent" "add" "zoom" . . . The