Probability and Statistics for Computer Science All models are - - PowerPoint PPT Presentation

probability and statistics
SMART_READER_LITE
LIVE PREVIEW

Probability and Statistics for Computer Science All models are - - PowerPoint PPT Presentation

Probability and Statistics for Computer Science All models are wrong, but some models are useful--- George Box Credit: wikipedia Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.19.2020 Last time Linear regression The


slide-1
SLIDE 1

ì

Probability and Statistics for Computer Science

“All models are wrong, but some models are useful”--- George Box

Hongye Liu, Teaching Assistant Prof, CS361, UIUC, 11.19.2020 Credit: wikipedia

slide-2
SLIDE 2

Last time

✺ Linear regression ✺ The problem ✺ The least square soluPon ✺ The training and predicPon ✺ The R-squared for the evaluaPon of

the fit.

slide-3
SLIDE 3

Objectives

✺ Linear regression (cont.) ✺ Modeling non-linear relaPonship with

linear regression

✺ Outliers and over-fiXng issues ✺ Regularized linear regression/Ridge

regression

✺ Nearest neighbor regression

slide-4
SLIDE 4

What if the relationship between variables is non-linear?

✺ A linear model will

not produce a good fit if the dependent variable is not linear combinaPon of the explanatory variables

R2 = 0.1

slide-5
SLIDE 5

Transforming variables could allow linear model to model non-linear relationship

✺ In the word- frequency

example, log-transforming both variables would allow a linear model to fit the data well.

slide-6
SLIDE 6

More example: Data of fish in a Finland lake

Yellow Perch

✺ Perch (a kind of fish) in a

lake in Finland, 56 data

  • bservaPons

✺ Variables include: Weight,

Length, Height, Width

✺ In order to illustrate the

point, let’s model Weight as the dependent variable and the Length as the explanatory variable.

slide-7
SLIDE 7

Is the linear model fine for this data?

  • A. YES
  • B. NO
slide-8
SLIDE 8

Is the linear model fine for this data?

✺ R-squared is 0.87 may

suggest the model is OK

✺ But the trend of the

data suggests non- linear relaPonship

✺ IntuiPon tells us length

is not linear to weight given fish is 3- dimensional

✺ We can do beger!

slide-9
SLIDE 9

Transforming the explanatory variables

slide-10
SLIDE 10
  • Q. What are the matrix X and y?

Length3 Weight 1

slide-11
SLIDE 11

Transforming the dependent variables

slide-12
SLIDE 12

What is the model now?

slide-13
SLIDE 13

What are the matrix X and y?

Length 1

3

√w

slide-14
SLIDE 14

Effect of outliers on linear regression

✺ Linear regression is sensiPve to outliers

slide-15
SLIDE 15

Effect of outliers: body fat example

✺ Linear regression is sensiPve to outliers

slide-16
SLIDE 16

Over-fitting issue: example of using too many power transformations

slide-17
SLIDE 17

Avoiding over-fitting

✺ Method 1: valida2on ✺ Use a validaPon set to choose the transformed explanatory

variables

✺ The difficulty is the number of combinaPon is exponenPal in

the number of variables.

✺ Method 2: regulariza2on ✺ Impose a penalty on complexity of the model during the

training

✺ Encourage smaller model coefficients ✺ We can use validaPon to select regularizaPon parameter λ

slide-18
SLIDE 18

Regularized linear regression

✺ In ordinary least squares, the cost funcPon is : ✺ In regularized least squares, we add a penalty with a

weight parameter λ (λ>0):

∥e∥2

∥e∥2 = ∥y − Xβ∥2 = (y − Xβ)T(y − Xβ)

∥y − Xβ∥2 + λ∥β∥2 2 = (y − Xβ)T(y − Xβ) + λβTβ 2

slide-19
SLIDE 19

Training using regularized least squares

✺ DifferenPaPng the cost funcPon and seXng it to zero,

  • ne gets:

✺ is always inverPble, so the regularized

least squares esPmaPon of the coefficients is:

  • β = (XTX + λI)−1XTy

(XTX + λI)β − XTy = 0

(XTX + λI)

slide-20
SLIDE 20

Why is the regularized version always invertible?

(XTX + λI) is inverPble (λ>0, λ is not the eigenvalue). Prove:

Energy based definiPon of semi-posi2ve definite: Given a matrix A and any nonzero vector f , we have and posi2ve definite means

f TAf > 0

f TAf ≥ 0

If A is posiPve definite, then all eigenvalues of A are posiPve, then it’s inverPble

slide-21
SLIDE 21

Why is the regularized version always invertible?

(XTX + λI) is inverPble (λ>0, λ is not the eigenvalue). Prove:

f TAf > 0

f TAf ≥ 0

slide-22
SLIDE 22

Over-fitting issue: example from using too many power transformations

slide-23
SLIDE 23

Choosing lambda using cross-validation

slide-24
SLIDE 24
  • Q. Can we use the R-squared to evaluate

the regularized model correctly?

  • A. YES
  • B. NO
  • C. YES and NO
slide-25
SLIDE 25

Nearest neighbor regression

✺ In addiPon to linear regression and generalize

linear regression models, there are methods such as Nearest neighbor regression that do not need much training for the model parameters.

✺ When there is plenty of data, nearest neighbors

regression can be used effecPvely

slide-26
SLIDE 26

K nearest neighbor regression with k=1

The idea is very similar to k-nearest neighbor classifier, but the regression model predicts numbers K=1 gives piecewise constant predicPons

slide-27
SLIDE 27

K nearest neighbor regression with weights

The goal is to predict from using a training set

✺ Let be the set of k items in the training

data set that are closest to .

✺ PredicPon is the following:

Where are weights that drop off as gets further away from .

{(xj, yj)}

xj

x0

wj

x0

x0

yp

yp

0 =

  • j wjyj
  • j wj

{(x, y)}

slide-28
SLIDE 28

Choose different weights functions for KNN regression

yp

0 =

  • j wjyj
  • j wj

✺ Inverse distance ✺ ExponenPal funcPon

wj = 1 ∥x0 − xj∥

wj = exp(−∥x0 − xj∥2 2σ2 )

slide-29
SLIDE 29

Evaluation of KNN models

✺ Which methods do

you use to choose K and weight funcPons?

  • A. Cross validaPon
  • B. EvaluaPon of MSE
  • C. Both A and B
slide-30
SLIDE 30

The Pros and Cons of K nearest neighbor regression

✺ Pros:

✺ The method is very intuiPve and simple

✺ You can predict more than numbers as long

as you can define a similarity measure.

✺ Cons

✺ The method doesn’t work well for very high

dimensional data

✺ The model depends on the scale of the data

slide-31
SLIDE 31

Assignments

✺ Finish Chapter 13 of the textbook ✺ Next Pme: Curse of Dimension,

clustering

slide-32
SLIDE 32

Additional References

✺ Robert V. Hogg, Elliot A. Tanis and Dale L.

  • Zimmerman. “Probability and StaPsPcal

Inference”

✺ Kelvin Murphy, “Machine learning, A

ProbabilisPc perspecPve”

slide-33
SLIDE 33

See you next time

See You!