15-780 – Graduate Artificial Intelligence: Machine learning J. Zico Kolter (this lecture) and Nihar Shah Carnegie Mellon University Spring 2020 1
Outline What is machine learning? Linear regression Linear classification Nonlinear methods Overfitting, generalization, and regularization Evaluating machine learning algorithms 2
Outline What is machine learning? Linear regression Linear classification Nonlinear methods Overfitting, generalization, and regularization Evaluating machine learning algorithms 3
Introduction: digit classification The task: k: write a program that, given a 28x28 grayscale image of a digit, outputs the number in the image Image: digits from the MNIST data set (http://yann.lecun.com/exdb/mnist/) 4
Approaches Approach 1: try to write a program by hand that uses your a priori knowledge about what images look like to determine what number they are 8 Approach 2: (the machine learning approach) collect a large volume of images and their corresponding numbers, let the “write its own program” to map from these images to their 5 corresponding number (More precisely, this is a subset of machine learning called supervised learning) 5
Supervised learning pipeline Training data , 2 Hypothesis function , 0 Machine learning ℎ: 𝒴 → 𝒵 such that , 5 𝑧 푖 ≈ ℎ 𝑦 푖 algorithm , ∀𝑗 , 8 (On new data 𝑦 ′ ∈ 𝒴 , make prediction 𝑧 푖 ∈ 𝒵 𝑧 ′ = ℎ(𝑦 ′ ) ) 𝑦 푖 ∈ 𝒴 6
Outline What is machine learning? Linear regression Linear classification Nonlinear methods Overfitting, generalization, and regularization Evaluating machine learning algorithms 7
A simple example: predicting electricity use What will peak power consumption be in Pittsburgh tomorrow? Difficult to build an “a priori” model from first principles to answer this question But, relatively easy to record past days of consumption, plus additional features that affect consumption (i.e., weather) Da Date te Hi High Tempera rature (F) Peak k Demand (GW) 2011-06-01 84.0 2.651 2011-06-02 73.0 2.081 2011-06-03 75.2 1.844 2011-06-04 84.9 1.959 … … … 8
Plot of consumption vs. temperature Plot of high temperature vs. peak demand for summer months (June – August) for past six years 9
Hypothesis: linear model Let’s suppose that the peak demand approximately fits a linear model Peak_Demand ≈ 𝜄 1 ⋅ High_Temperature + 𝜄 2 Here 𝜄 1 is the “slope” of the line, and 𝜄 2 is the intercept Now, given a forecast of tomorrow’s weather (ignoring for a moment that this is also a prediction), we can predict how high the peak demand 10
Predictions Predicting in this manner is equivalent to “drawing line through data” 3.2 Observed days 3.0 Prediction 2.8 Peak Demand (GW) 2.6 2.4 2.2 2.0 1.8 1.6 1.4 55 60 65 70 75 80 85 90 95 100 High Temperature (F) 11
Machine learning notation Input features: 𝑦 푖 ∈ ℝ 푛 , 𝑗 = 1, … , 𝑛 E. g. : 𝑦 푖 = High_Temperature 푖 Training 1 data Outputs: 𝑧 푖 ∈ ℝ (regression task) E. g. : 𝑧 푖 ∈ ℝ = Peak_Demand 푖 Model parameters: 𝜄 ∈ ℝ 푘 (for linear models 𝑙 = 𝑜 ) Hypothesis function: ℎ 휃 : ℝ 푛 → ℝ , predicts output given input 푛 E. g. : ℎ 휃 𝑦 = 𝜄 푇 𝑦 = ∑ 𝜄 푗 ⋅ 𝑦 푗 푗=1 12
Loss functions How do we measure how “good” a hypothesis function is, i.e. how close is our approximation on our training data 𝑧 푖 ≈ ℎ 휃 𝑦 푖 Typically done by introducing a loss function ℓ: ℝ×ℝ → ℝ + where ℓ ℎ 휃 𝑦 , 𝑧 denotes how far apart prediction is from actual output E.g., for regression a common loss function is squared error: ℓ ℎ 휃 𝑦 , 𝑧 = ℎ 휃 𝑦 − 𝑧 2 13
The canonical machine learning problem With this notation, we define the canonical machine learning problem: given a set of input features and outputs 𝑦 푖 , 𝑧 푖 , 𝑗 = 1, … , 𝑛 , find the parameters that minimize the sum of losses 푚 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 minimize ∑ 휃 푖=1 Virtually all machine learning algorithms have this form, we just need to specify 1. What is the hypothesis function? 2. What is the loss function? 3. How do we solve the optimization problem? 14
Least squares Let’s formulate our linear least squares problem in this notation Hypothesis function: ℎ 휃 𝑦 = 𝜄 푇 𝑦 Squared loss function: ℓ ℎ 휃 𝑦 , 𝑧 = ℎ 휃 𝑦 − 𝑧 2 Leads to the machine learning optimization problem 푚 푚 𝜄 푇 𝑦 푖 − 𝑧 푖 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 2 minimize ∑ ≡ minimize ∑ 휃 휃 푖=1 푖=1 A convex optimization problem in 𝜄 , so we expect global solutions But how do we solve this optimization problem? 15
Solution via gradient descent Recall the gradient descent algorithm (written now to optimize 𝜄 ) Repeat: 𝜄 → 𝜄 − 𝛽𝛼 휃 𝑔 𝜄 What is the gradient of our objective function? 푚 푚 𝜄 푇 𝑦 푖 − 𝑧 푖 2 = ∑ 𝛼 휃 𝜄 푇 𝑦 푖 − 𝑧 푖 2 𝛼 휃 ∑ 푖=1 푖=1 푚 𝑦 푖 (𝜄 푇 𝑦 푖 − 𝑧 푖 ) = 2 ∑ 푖=1 (using chain rule and the fact that 𝛼 휃 𝜄 푇 𝑦 푖 = 𝑦 푖 ), gives update: 푚 𝑦 푖 (𝜄 푇 𝑦 푖 − 𝑧 푖 ) Repeat: 𝜄 → 𝜄 − 𝛽 ∑ 푖=1 16
Linear algebra notation Summation notation gets cumbersome, so convenient to introduce a more compact notation: 𝑦 1 푇 𝑧 1 𝑦 2 푇 𝑧 2 ∈ ℝ 푚×푛 , ∈ ℝ 푚 𝑌 = 𝑧 = ⋮ ⋮ 𝑧 푚 𝑦 푚 푇 Least squares objective can now be written 푚 𝜄 푇 𝑦 푖 − 𝑧 푖 2 = 𝑌𝜄 − 𝑧 2 2 ∑ 푖=1 2 = 2𝑌 푇 (𝑌𝜄 − 𝑧) and gradient given by 𝛼 휃 𝑌𝜄 − 𝑧 2 17
An alternative solution method In order for 𝜄 ⋆ to minimize some (unconstrained, differentiable), function 𝑔 , necessary and sufficient that 𝛼 휃 𝑔 𝜄 ⋆ = 0 Previously, we attained this point iteratively through gradient descent, but for squared error loss, we can also find it analytically 𝛼 휃 𝑌𝜄 ⋆ − 𝑧 2 2 = 0 ⟹ 2𝑌 푇 𝑌𝜄 ⋆ − 𝑧 = 0 ⟹ 𝑌 푇 𝑌𝜄 ⋆ = 𝑌 푇 𝑧 ⟹ 𝜄 ⋆ = 𝑌 푇 𝑌 −1 𝑌 푇 𝑧 These are called the normal equations , a closed form solution for minimization of sum of squared losses 18
Least squares solution Solving normal equations (or running gradient descent), gives coefficients 𝜄 1 and 𝜄 2 corresponding to the following fit 19
Poll: least squares when 𝑛 < 𝑜 What happens you run a least-squares solver, built using the simple normal equations in Python, when 𝑛 < 𝑜 ? 1. Python will return an error, because the true minimum least-squares cost is infinite 2. Python will return an error, even though the true minimum least-squares cost is zero 3. Python will correctly compute the optimal solution, with strictly positive cost 4. Python will correctly compute the optimal solution, with zero cost 20
Alternative loss functions Why did we pick the squared loss ℓ ℎ 휃 𝑦 , 𝑧 = ℎ 휃 𝑦 − 𝑧 2 ? Why not use an alternative like absolute loss ℓ ℎ 휃 𝑦 , 𝑧 = ℎ 휃 𝑦 − 𝑧 ? We could write this optimization problem as 푚 ℓ ℎ 휃 𝑦 푖 , 𝑧 푖 minimize ∑ ≡ minimize 𝑌𝜄 − 𝑧 1 휃 휃 푖=1 where 𝑨 1 = ∑ 푖 𝑨 푖 is called the ℓ 1 norm No closed-form solution, but (sub)gradient is given by 𝛼 휃 𝑌𝜄 − 𝑧 1 = 𝑌 푇 sign(𝑌𝜄 − 𝑧) 21
Poll: alternative loss solutions Solutions for minimizing squared error and absolute error Po Poll: which solution is which? 1. Green is squared loss, red is absolute 2. Red is squared loss, green is absolute 3. Those lines look identical to me 22
Outline What is machine learning? Linear regression Linear classification Nonlinear methods Overfitting, generalization, and regularization Evaluating machine learning algorithms 23
Classification tasks Regression tasks: predicting real-valued quantity 𝑧 ∈ ℝ Classification tasks: predicting discrete-valued quantity 𝑧 Binary classification: 𝑧 ∈ −1, +1 Multiclass classification: 𝑧 ∈ 1,2, … , 𝑙 24
Example: breast cancer classification Well-known classification example: using machine learning to diagnose whether a breast tumor is benign or malignant [Street et al., 1992] Setting: doctor extracts a sample of fluid from tumor, stains cells, then outlines several of the cells (image processing refines outline) System computes features for each cell such as area, perimeter, concavity, texture (10 total); computes mean/std/max for all features 25
Example: breast cancer classification Plot of two features: mean area vs. mean concave points, for two classes 26
Linear classification example Linear classification ≡ “drawing line separating classes” 27
̂ Formal setting tures: 𝑦 푖 ∈ ℝ 푛 , 𝑗 = 1, … , 𝑛 Inpu In put t featu Mean_Area 푖 E. g. : 𝑦 푖 = Mean_Concave_Points 푖 1 Outputs: 𝑧 푖 ∈ {−1, +1}, 𝑗 = 1, … , 𝑛 Ou E. g. : 𝑧 푖 ∈ {−1 benign , +1 (malignant)} rs: 𝜄 ∈ ℝ 푛 Model Mo l para rameters : ℎ 휃 : ℝ 푛 → ℝ , aims for same sign as the output (informally, a Hy Hypot othesis f function on: measure of confidence in our prediction) E. g. : ℎ 휃 𝑦 = 𝜄 푇 𝑦, 𝑧 = sign(ℎ 휃 𝑦 ) 28
Recommend
More recommend