Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent ( credit : Yijun Zhao, Marc Toussaint, Bishop)

Administrativa Instructor Jan-Willem van de Meent Email : j.vandemeent@northeastern.edu   Phone : +1 617 373-7696   Office Hours : 478 WVH, Wed 1.30pm - 2.30pm Teaching Assistants Yuan Zhong E-mail: yzhong@ccs.neu.edu   Office Hours: WVH 462, Wed 3pm - 5pm Kamlendra Kumar E-mail: kumark@zimbra.ccs.neu.edu   Office Hours: WVH 462, Fri 3pm - 5pm

Administrativa Course Website http://www.ccs.neu.edu/course/cs6220f16/sec3/ Piazza https://piazza.com/northeastern/fall2016/cs622003/home Project Guidelines (Vote next week) http://www.ccs.neu.edu/course/cs6220f16/sec3/project/

Question What would you like   to get out of this course?

Linear Regression

Regression Examples Continuous Features Value = ⇒ x y • {age, major, gender, race} ⇒ GPA • {income, credit score, profession} ⇒ Loan Amount • {college,major,GPA} ⇒ Future Income

Example: Boston Housing Data UC Irvine Machine Learning Repository   ( good source for project datasets ) https://archive.ics.uci.edu/ml/datasets/Housing

Example: Boston Housing Data 1. CRIM : per capita crime rate by town 2. ZN : proportion of residential land zoned for lots over 25,000 sq.ft. 3. INDUS : proportion of non-retail business acres per town 4. CHAS : Charles River dummy variable (= 1 if tract bounds river; 0 otherwise) 5. NOX : nitric oxides concentration (parts per 10 million) 6. RM : average number of rooms per dwelling 7. AGE : proportion of owner-occupied units built prior to 1940 8. DIS : weighted distances to five Boston employment centres 9. RAD : index of accessibility to radial highways 10. TAX : full-value property-tax rate per $10,000 11. PTRATIO : pupil-teacher ratio by town 12. B : 1000(Bk - 0.63)^2 where Bk is the proportion of african americans by town 13. LSTAT : % lower status of the population 14. MEDV : Median value of owner-occupied homes in $1000's

Example: Boston Housing Data CRIM : per capita crime rate by town

Example: Boston Housing Data CHAS : Charles River dummy variable   (= 1 if tract bounds river; 0 otherwise)

Example: Boston Housing Data MEDV : Median value of owner-occupied homes in $1000's

Example: Boston Housing Data N data   points D features

Regression: Problem Setup Given N observations { ( x 1 , y 1 ) , ( x 2 , y 2 ) ,..., ( x N , y N ) } learn a function y i = f ( x i ) ∀ i = 1,2,..., N and for a new input x* predict y ∗ = f ( x ∗ )

Linear Regression Assume f is a linear combination of D features were x and w are defined as for N points we write Learning task : Estimate w

Linear Regression

Error Measure Mean Squared Error (MSE): N E ( w ) = 1 X ( w T x n � y n ) 2 N n =1 = 1 N k Xw � y k 2 where — x 1 T — 2 3 2 y 1 T 3 — x 2 T — y 2 T 6 7 6 7 X = y = 6 7 6 7 4 5 4 5 . . . . . . — x NT — y NT

Minimizing the Error E ( w ) = 1 N k Xw � y k 2 5 E ( w ) = 2 N X T ( Xw � y ) = 0 X T Xw = X T y w = X † y where X † = ( X T X ) � 1 X T is the ’pseudo-inverse’ of X

Minimizing the Error E ( w ) = 1 N k Xw � y k 2 5 E ( w ) = 2 N X T ( Xw � y ) = 0 X T Xw = X T y w = X † y where X † = ( X T X ) � 1 X T is the ’pseudo-inverse’ of X Matrix Cookbook (on course website)

Ordinary Least Squares Construct matrix X and the vector y from the dataset { ( x 1 , y 1 ) , x 2 , y 2 ) , . . . , ( x N , y N ) } (each x includes x 0 = 1) as follows:  — x T   y T  1 — 1 — x T y T 2 —     2 X = y =         . . . . . . — x T y T N — N Compute X † = ( X T X ) − 1 X T Return w = X † y

Gradient Descent countours : E ( w ) 50 45 40 35 30 w 1 25 20 15 10 5 5 10 15 20 25 30 35 40 45 50 w 0

Least Mean Squares (a.k.a. gradient descent) Initialize the w (0) for time t = 0 for t = 0 , 1 , 2 , . . . do Compute the gradient g t = 5 E ( w ( t )) Set the direction to move, v t = � g t Update w ( t + 1) = w ( t ) + η v t Iterate until it is time to stop Return the final weights w

Question When would you want to use OLS, when LMS?

Computational Complexity Least Mean Squares (LMS) Ordinary least squares (OMS)

Computational Complexity Least Mean Squares (LMS) Ordinary least squares (OMS) OMS is expensive when D is large

Effect of step size

Choosing Stepsize to r f ( x ) ?? Set step size proportional to ? small gradient small step? large gradient large step?

Choosing Stepsize to r f ( x ) ?? Set step size proportional to ? small gradient small step? large gradient large step? Two commonly used techniques 1. Stepsize adaptation 2. Line search

Stepsize Adaptation initial x 2 R n , functions f ( x ) and r Input: f ( x ) , initial stepsize α , tolerance θ Output: x 1: repeat f ( x ) r y x � α 2: | r f ( x ) | if [ then step is accepted] f ( y )  f ( x ) 3: x y 4: α 1 . 2 α // increase stepsize 5: else [step is rejected] 6: α 0 . 5 α // decrease stepsize 7: end if 8: 9: until | y � x | < θ [perhaps for 10 iterations in sequence] (“magic numbers”)

Second Order Methods Compute Hessian matrix of second derivatives

Second Order Methods • Broyden-Fletcher-Goldfarb-Shanno (BFGS) method: Input: initial x 2 R n , functions f ( x ) , r f ( x ) , tolerance θ Output: x 1: initialize H -1 = I n 2: repeat compute ∆ = � H -1 r f ( x ) 3: perform a line search min α f ( x + α ∆ ) 4: ∆ α ∆ 5: f ( x + ∆ ) � r f ( x ) y r 6: x x + ∆ 7: > > > ⇣ ⌘ H -1 ⇣ ⌘ > update H -1 I � y ∆ I � y ∆ + ∆∆ 8: > y > y > y ∆ ∆ ∆ 9: until | | ∆ | | 1 < θ Memory-limited version: L-BFGS

Stochastic Gradient Descent What if N is really large? Batch gradient descent (evaluates all data) Minibatch gradient descent (evaluates subset) Converges under Robbins-Monro conditions

Probabilistic Interpretation

Normal Distribution Right Skewed Left Skewed Random

Normal Distribution ∼ ⇒ Density:

Central Limit Theorem 3 3 3 N = 1 N = 2 N = 10 2 2 2 1 1 1 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 If y 1 , …, y n are 1. Independent identically distributed (i.i.d.) 2. Have finite variance 0 < σ y 2 < ∞

Multivariate Normal Density:

Regression: Probabilistic Interpretation

Regression: Probabilistic Interpretation Joint probability of N independent data points

Regression: Probabilistic Interpretation Log joint probability of N independent data points

Regression: Probabilistic Interpretation Log joint probability of N independent data points Maximum   Likelihood

Basis function regression Linear regression y = w 0 + w 1 x 1 + ... + w D x D = w T x Basis function regression Polynomial regression

Polynomial Regression M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x

Polynomial Regression Underfit M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x

Polynomial Regression M = 0 M = 1 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x Overfit M = 3 M = 9 1 1 t t 0 0 − 1 − 1 0 1 0 1 x x

Regularization L 2 regularization (ridge regression) minimizes: E ( w ) = 1 N k Xw � y k 2 + λ k w k 2 where λ � 0 and k w k 2 = w T w � k k L 1 regularization (LASSO) minimizes: E ( w ) = 1 N k Xw � y k 2 + λ | w | 1 D where λ � 0 and | w | 1 = P | ω i | i =1

Regularization

Regularization L 2: closed form solution w = ( X T X + λ I ) � 1 X T y L 1: No closed form solution. Use quadratic programming: minimize k Xw � y k 2 k w k 1  s s . t .

Review: Bias-Variance Trade-off Maximum likelihood estimator Bias-variance decomposition   ( expected value over possible data points )

Bias-Variance Trade-off Often: low bias ⇒ high variance low variance ⇒ high bias Trade-o ff :

K-fold Cross-Validation 1. Divide dataset into K “folds” 2. Train on all except k -th fold 3. Test on k -th fold 4. Minimize test error w.r.t. λ

K-fold Cross-Validation • Choices for K : 5, 10, N (leave-one-out) • Cost of computation: K x number of λ

Learning Curve

Loss Functions

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: - PowerPoint PPT Presentation

Data Mining Techniques CS 6220 - Section 3 - Fall 2016 Lecture 2: Regression Jan-Willem van de Meent ( credit : Yijun Zhao, Marc Toussaint, Bishop) Administrativa Instructor Jan-Willem van de Meent Email : j.vandemeent@northeastern.edu

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

Data Mining: Concepts and Techniques Chapter 1 Introduction 1 August 19, 2013

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 1 Introduction What is data mining? After years of data mining there is

Data Mining: Concepts and Techniques Web Mining Li Xiong Slides credits: Jiawei Han and

Data Mining 2020 Frequent Pattern Mining (2) Ad Feelders Universiteit Utrecht October 2, 2020

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

Web MINING Web MINING Overview Overview Dr Ahmed Rafea Rafea Dr Ahmed 1 Web Mining Outline

LECTURE 1: INTRODUCTION TO DATA MINING Dr. Dhaval Patel CSE, IIT-Roorkee What is data mining?

Web Mining Web Mining to automatically discover and extract information from Web

Web Mining Web Mining to automatically discover and extract information from Web

Data Mining: Concepts and Techniques Chap 8. Data Streams, Time Series Data, and Sequential

Analysis and Forecast of the Chemical and Dynamical Variability of the Middle-Upper Atmosphere

128 Some comparison may be in order to drive home the points. Si is not the best of semiconductor

A fast, parameterized model of upper atmospheric ionization rates, chemistry, and conductivity

Back Backgr grou ound (ENDS): Inf Information a rmation and D d Discussion on scussion on

Pilot Auction Facility for Methane and Climate Change Mitigation Stephanie Rogers Tanguy de

ACS Slides 1 INTRO TO CODE BLUE ACS Anti-anginal O2, nitroglycerin, morphine

Randomized Trial of Ultrafiltration versus Pharmacologic Care in Patients with Acute Decompensated

Lecture #5: On Safes, Sandboxes, and Spies 1 Now that we have some concepts... Its time