Machine Learning Basics Lecture 2: Linear Classification Princeton University COS 495 Instructor: Yingyu Liang
Review: machine learning basics
Math formulation β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ 1 β’ Find π§ = π(π¦) β π that minimizes ΰ· π π Ο π=1 π π = π(π, π¦ π , π§ π ) β’ s.t. the expected loss is small π π = π½ π¦,π§ ~πΈ [π(π, π¦, π§)]
Machine learning 1-2-3 β’ Collect data and extract features β’ Build model: choose hypothesis class π and loss function π β’ Optimization: minimize the empirical loss
Machine learning 1-2-3 Experience β’ Collect data and extract features β’ Build model: choose hypothesis class π and loss function π β’ Optimization: minimize the empirical loss Prior knowledge
Example: Linear regression β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ 1 π₯ π¦ = π₯ π π¦ that minimizes ΰ· π π₯ π π¦ π β π§ π 2 β’ Find π π Ο π=1 π π π₯ = π 2 loss Linear model π
Why π 2 loss β’ Why not choose another loss β’ π 1 loss, hinge loss, exponential loss, β¦ β’ Empirical: easy to optimize β’ For linear case: w = π π π β1 π π π§ β’ Theoretical: a way to encode prior knowledge Questions: β’ What kind of prior knowledge? β’ Principal way to derive loss?
Maximum likelihood Estimation
Maximum likelihood Estimation (MLE) β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Let {π π π¦, π§ : π β Ξ} be a family of distributions indexed by π β’ Would like to pick π so that π π (π¦, π§) fits the data well
Maximum likelihood Estimation (MLE) β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Let {π π π¦, π§ : π β Ξ} be a family of distributions indexed by π β’ βfitnessβ of π to one data point π¦ π , π§ π likelihood π; π¦ π , π§ π β π π (π¦ π , π§ π )
Maximum likelihood Estimation (MLE) β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Let {π π π¦, π§ : π β Ξ} be a family of distributions indexed by π β’ βfitnessβ of π to i.i.d. data points { π¦ π , π§ π } β π π {π¦ π , π§ π } = Ο π π π (π¦ π , π§ π ) likelihood π; {π¦ π , π§ π }
Maximum likelihood Estimation (MLE) β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Let {π π π¦, π§ : π β Ξ} be a family of distributions indexed by π β’ MLE: maximize βfitnessβ of π to i.i.d. data points { π¦ π , π§ π } π ππ = argmax ΞΈβΞ Ο π π π (π¦ π , π§ π )
Maximum likelihood Estimation (MLE) β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Let {π π π¦, π§ : π β Ξ} be a family of distributions indexed by π β’ MLE: maximize βfitnessβ of π to i.i.d. data points { π¦ π , π§ π } π ππ = argmax ΞΈβΞ log[Ο π π π π¦ π , π§ π ] π ππ = argmax ΞΈβΞ Ο π log[π π π¦ π , π§ π ]
Maximum likelihood Estimation (MLE) β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Let {π π π¦, π§ : π β Ξ} be a family of distributions indexed by π β’ MLE: negative log-likelihood loss π ππ = argmax ΞΈβΞ Ο π log(π π π¦ π , π§ π ) π π π , π¦ π , π§ π = β log(π π π¦ π , π§ π ) ΰ· π π π = β Ο π log(π π π¦ π , π§ π )
MLE: conditional log-likelihood β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Let {π π π§ π¦ : π β Ξ} be a family of distributions indexed by π Only care about predicting y β’ MLE: negative conditional log-likelihood loss from x; do not care about p(x) π ππ = argmax ΞΈβΞ Ο π log(π π π§ π |π¦ π ) π π π , π¦ π , π§ π = β log(π π π§ π |π¦ π ) ΰ· π π π = β Ο π log(π π π§ π |π¦ π )
MLE: conditional log-likelihood β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Let {π π π§ π¦ : π β Ξ} be a family of distributions indexed by π P(y|x): discriminative; β’ MLE: negative conditional log-likelihood loss P(x,y): generative π ππ = argmax ΞΈβΞ Ο π log(π π π§ π |π¦ π ) π π π , π¦ π , π§ π = β log(π π π§ π |π¦ π ) ΰ· π π π = β Ο π log(π π π§ π |π¦ π )
Example: π 2 loss β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ 1 π π¦ that minimizes ΰ· π π (π¦ π ) β π§ π 2 β’ Find π π Ο π=1 π π π = π
Example: π 2 loss β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ 1 π π¦ that minimizes ΰ· π π (π¦ π ) β π§ π 2 β’ Find π π Ο π=1 π π π = π π 2 loss: Normal + MLE β’ Define π π π§ π¦ = Normal π§; π π π¦ , π 2 β1 1 β π§ π ) 2 βlog(π) β β’ log(π π π§ π |π¦ π ) = 2π 2 (π π π¦ π 2 log(2π) 1 π π (π¦ π ) β π§ π 2 β’ π ππ = argmin ΞΈβΞ π Ο π=1 π
Linear classification
Example 1: image classification indoor Indoor outdoor
Example 2: Spam detection #β$β #βMr.β #βsaleβ β¦ Spam? Email 1 2 1 1 Yes Email 2 0 1 0 No Email 3 1 1 1 Yes β¦ Email n 0 0 0 No New email 0 0 1 ??
Why classification β’ Classification: a kind of summary β’ Easy to interpret β’ Easy for making decisions
Linear classification π₯ π π¦ = 0 π₯ π π¦ > 0 π₯ π π¦ < 0 π₯ Class 1 Class 0
Linear classification: natural attempt β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ π₯ π¦ = π₯ π π¦ β’ Hypothesis π β’ π§ = 1 if π₯ π π¦ > 0 Linear model π β’ π§ = 0 if π₯ π π¦ < 0 π₯ π¦ ) = step(π₯ π π¦) β’ Prediction: π§ = step(π
Linear classification: natural attempt β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ 1 π₯ π¦ = π₯ π π¦ to minimize ΰ· π π[step(π₯ π π¦ π ) β π§ π ] β’ Find π π Ο π=1 π π π₯ = β’ Drawback: difficult to optimize β’ NP-hard in the worst case 0-1 loss
Linear classification: simple approach β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ 1 π₯ π¦ = π₯ π π¦ that minimizes ΰ· π π₯ π π¦ π β π§ π 2 β’ Find π π Ο π=1 π π π₯ = Reduce to linear regression; ignore the fact π§ β {0,1}
Linear classification: simple approach Drawback: not robust to βoutliersβ Figure borrowed from Pattern Recognition and Machine Learning , Bishop
Compare the two π§ π§ = π₯ π π¦ π§ = step(π₯ π π¦) π₯ π π¦
Between the two β’ Prediction bounded in [0,1] β’ Smooth 1 β’ Sigmoid: π π = 1+exp(βπ) Figure borrowed from Pattern Recognition and Machine Learning , Bishop
Linear classification: sigmoid prediction β’ Squash the output of the linear function 1 Sigmoid π₯ π π¦ = π π₯ π π¦ = 1 + exp(βπ₯ π π¦) 1 β’ Find π₯ that minimizes ΰ· π π(π₯ π π¦ π ) β π§ π 2 π Ο π=1 π π π₯ =
Linear classification: logistic regression β’ Squash the output of the linear function 1 Sigmoid π₯ π π¦ = π π₯ π π¦ = 1 + exp(βπ₯ π π¦) β’ A better approach: Interpret as a probability 1 π₯ (π§ = 1|π¦) = π π₯ π π¦ = π 1 + exp(βπ₯ π π¦) π₯ π§ = 1 π¦ = 1 β π π₯ π π¦ π π₯ π§ = 0 π¦ = 1 β π
Linear classification: logistic regression β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Find π₯ that minimizes π π π₯ = β 1 ΰ· π ΰ· log π π₯ π§ π¦ π=1 π π₯ = β 1 logπ(π₯ π π¦ π ) β 1 ΰ· log[1 β π π₯ π π¦ π ] π ΰ· π ΰ· π§ π =1 π§ π =0 Logistic regression: MLE with sigmoid
Linear classification: logistic regression β’ Given training data π¦ π , π§ π : 1 β€ π β€ π i.i.d. from distribution πΈ β’ Find π₯ that minimizes π π₯ = β 1 logπ(π₯ π π¦ π ) β 1 ΰ· log[1 β π π₯ π π¦ π ] π ΰ· π ΰ· π§ π =1 π§ π =0 No close form solution; Need to use gradient descent
Recommend
More recommend