Machine Learning Basics Lecture 1: Linear Regression Princeton University COS 495 Instructor: Yingyu Liang
Machine learning basics
What is machine learning? • “A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T as measured by P, improves with experience E.” ------- Machine Learning , Tom Mitchell, 1997
Example 1: image classification Task: determine if the image is indoor or outdoor Performance measure: probability of misclassification
Example 1: image classification Experience/Data: images with labels indoor Indoor outdoor
Example 1: image classification • A few terminologies • Training data: the images given for learning • Test data: the images to be classified • Binary classification: classify into two classes
Example 1: image classification (multi-class) ImageNet figure borrowed from vision.standford.edu
Example 2: clustering images Task: partition the images into 2 groups Performance: similarities within groups Data: a set of images
Example 2: clustering images • A few terminologies • Unlabeled data vs labeled data • Supervised learning vs unsupervised learning
Math formulation Feature vector: 𝑦 𝑗 Color Histogram Extract features Label: 𝑧 𝑗 Red Green Blue Indoor 0
Math formulation Feature vector: 𝑦 𝑘 Color Histogram Extract features Label: 𝑧 𝑘 Red Green Blue outdoor 1
Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 • Find 𝑧 = 𝑔(𝑦) using training data • s.t. 𝑔 correct on test data What kind of functions?
Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 • Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 using training data • s.t. 𝑔 correct on test data Hypothesis class
Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 • Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 using training data • s.t. 𝑔 correct on test data Connection between training data and test data?
Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 • Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 using training data • s.t. 𝑔 correct on test data i.i.d. from distribution 𝐸 They have the same distribution i.i.d.: independently identically distributed
Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 • Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 using training data • s.t. 𝑔 correct on test data i.i.d. from distribution 𝐸 What kind of performance measure?
Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 • Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 using training data • s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [𝑚(𝑔, 𝑦, 𝑧)] Various loss functions
Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 • Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 using training data • s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [𝑚(𝑔, 𝑦, 𝑧)] • Examples of loss functions: • 0-1 loss: 𝑚 𝑔, 𝑦, 𝑧 = 𝕁[𝑔 𝑦 ≠ 𝑧] and 𝑀 𝑔 = Pr[𝑔 𝑦 ≠ 𝑧] • 𝑚 2 loss: 𝑚 𝑔, 𝑦, 𝑧 = [𝑔 𝑦 − 𝑧] 2 and 𝑀 𝑔 = 𝔽[𝑔 𝑦 − 𝑧] 2
Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 • Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 using training data • s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [𝑚(𝑔, 𝑦, 𝑧)] How to use?
Math formulation • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 1 • Find 𝑧 = 𝑔(𝑦) ∈ 𝓘 that minimizes 𝑜 𝑜 σ 𝑗=1 𝑀 𝑔 = 𝑚(𝑔, 𝑦 𝑗 , 𝑧 𝑗 ) • s.t. the expected loss is small 𝑀 𝑔 = 𝔽 𝑦,𝑧 ~𝐸 [𝑚(𝑔, 𝑦, 𝑧)] Empirical loss
Machine learning 1-2-3 • Collect data and extract features • Build model: choose hypothesis class 𝓘 and loss function 𝑚 • Optimization: minimize the empirical loss
Wait… • Why handcraft the feature vectors 𝑦, 𝑧 ? • Can use prior knowledge to design suitable features • Can computer learn the features on the raw images? • Learn features directly on the raw images: Representation Learning • Deep Learning ⊆ Representation Learning ⊆ Machine Learning ⊆ Artificial Intelligence
Wait… • Does MachineLearning-1-2-3 include all approaches? • Include many but not all • Our current focus will be MachineLearning-1-2-3
Example: Stock Market Prediction Stock Market (Disclaimer: synthetic data/in another parallel universe) Orange MacroHard Ackermann 2013 2014 2015 2016 Sliding window over time: serve as input 𝑦 ; non-i.i.d.
Linear regression
𝑧 : prostate specific antigen Real data: Prostate Cancer by Stamey et al. (1989) Figure borrowed from The Elements of Statistical Learning (𝑦 1 , … , 𝑦 8 ) : clinical measures
Linear regression • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes 𝑜 𝑥 𝑈 𝑦 𝑗 − 𝑧 𝑗 2 • Find 𝑔 𝑜 σ 𝑗=1 𝑀 𝑔 𝑥 = 𝑚 2 loss; also called mean square error Hypothesis class 𝓘
Linear regression: optimization • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 1 𝑥 𝑦 = 𝑥 𝑈 𝑦 that minimizes 𝑜 𝑥 𝑈 𝑦 𝑗 − 𝑧 𝑗 2 • Find 𝑔 𝑜 σ 𝑗=1 𝑀 𝑔 𝑥 = 𝑈 , 𝑧 be the vector 𝑧 1 , … , 𝑧 𝑜 𝑈 • Let 𝑌 be a matrix whose 𝑗 -th row is 𝑦 𝑗 𝑜 𝑥 = 1 𝑥 𝑈 𝑦 𝑗 − 𝑧 𝑗 2 = 1 2 𝑀 𝑔 𝑜 𝑜 ⃦𝑌𝑥 − 𝑧 ⃦ 2 𝑗=1
Linear regression: optimization • Set the gradient to 0 to get the minimizer 1 2 = 0 𝑥 𝛼 𝑀 𝑔 𝑥 = 𝛼 𝑜 ⃦𝑌𝑥 − 𝑧 ⃦ 2 𝑥 𝑥 [ 𝑌𝑥 − 𝑧 𝑈 (𝑌𝑥 − 𝑧)] = 0 𝛼 𝑥 [ 𝑥 𝑈 𝑌 𝑈 𝑌𝑥 − 2𝑥 𝑈 𝑌 𝑈 𝑧 + 𝑧 𝑈 𝑧] = 0 𝛼 2𝑌 𝑈 𝑌𝑥 − 2𝑌 𝑈 𝑧 = 0 w = 𝑌 𝑈 𝑌 −1 𝑌 𝑈 𝑧
Linear regression: optimization • Algebraic view of the minimizer • If 𝑌 is invertible, just solve 𝑌𝑥 = 𝑧 and get 𝑥 = 𝑌 −1 𝑧 • But typically 𝑌 is a tall matrix 𝑥 = 𝑥 = 𝑌 𝑈 𝑌 𝑌 𝑈 𝑧 𝑌 𝑧 Normal equation: w = 𝑌 𝑈 𝑌 −1 𝑌 𝑈 𝑧
Linear regression with bias Bias term • Given training data 𝑦 𝑗 , 𝑧 𝑗 : 1 ≤ 𝑗 ≤ 𝑜 i.i.d. from distribution 𝐸 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 to minimize the loss • Find 𝑔 • Reduce to the case without bias: • Let 𝑥 ′ = 𝑥; 𝑐 , 𝑦 ′ = 𝑦; 1 𝑥,𝑐 𝑦 = 𝑥 𝑈 𝑦 + 𝑐 = 𝑥 ′ 𝑈 (𝑦 ′ ) • Then 𝑔
Recommend
More recommend