Review: The General Learning Problem } We want to learn functions from inputs to outputs, where each input has n features: Inputs h x 1 , x 2 , . . . , x n i , with each feature x i from domain X i . Outputs y from domain Y . Function to learn: f : X 1 ⇥ X 2 ⇥ · · · ⇥ X n ! Y Class #05: Linear Classification } The type of learning problem we are solving really depends upon the type of the output domain, Y Machine Learning (COMP 135): M. Allen, 03 Feb. 20 If output Y ∈ R (a real number) , this is regression 1. If output Y is a finite discrete set, this is classification 2. 2 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 1 2 Decisions to Make One Approach: Regression } When collecting our training example pairs, ( x , f ( x )) , we } We decide that we want still have some decisions to make to try to learn to predict how long patients will live } Example : Medical Informatics } We base this upon } We have some genetic information about patients information about the } Some get sick with a disease and some don’t degree to which they } Patients live for a number of years (sick or not) express a specific gene } Question : what do we want to learn from this data? } A regression problem: the } Depending upon what we decide, we may use: function we learn is the } Different models of the data “best (linear) fit” to the } Different machine learning approaches data we have } Different measurements of successful learning Image source: https://aldro61.github.io/microbiome-summer-school-2017/sections/basics/ 4 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 3 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 3 4 1
Another Approach: Classification Which is the Correct Approach? } We decide instead that we simply want to decide whether a patient will get the disease or not } We base this upon information about expression of two genes } A classification problem: learned function separates } The approach we use depends upon what we want to achieve, and individuals into 2 groups what works best based upon the data we have (binary classes) } Much machine learning involves investigating different approaches Image source: https://aldro61.github.io/microbiome-summer-school-2017/sections/basics/ 6 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 5 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 5 6 From Regression to Classification Threshold Functions We have data-points with n features: 1. x = ( x 1 , x 2 , . . . , x n ) We have a linear function defined by n +1 weights: 2. w = ( w 0 , w 1 , w 2 , . . . , w n ) x We can write this linear function as: 3. w · x We can then find the linear boundary, where: 4. } Suppose we have two classes of data, defined by a single attribute x w · x = 0 } We seek a decision boundary that splits the data in two And use it to define our threshold between classes: 5. } When such a boundary can be defined using a linear function, ( Outputs 1 and 0 here are 1 w · x ≥ 0 it is called a linear separator arbitrary labels for one h w = of two possible classes 0 w · x < 0 8 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 7 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 7 8 2
From Regression to Classification From Regression to Classification x } Data is linearly separable if it can be divided into classes using } Data is linearly separable if it can be divided into classes a linear boundary: using a linear boundary: w · x = 0 w · x = 0 } Such a boundary, in 1 -dimensional space, is a threshold value } Such a boundary, in 2 -dimensional space, is a line 10 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 9 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 9 10 From Regression to Classification The Geometry of Linear Boundaries x 2 } Suppose we have 2 - dimensional inputs x = ( x 1 , x 2 ) } The “real” weights w · x = w 0 + w · ( x 1 , x 2 ) = 0 w = ( w 1 , w 2 ) define a vector − w 0 /w 2 } The boundary where our linear function is zero, Image: R. Urtasun (U. of Toronto) w w · x = w 0 + w · ( x 1 , x 2 ) = 0 is an orthogonal line, } Data is linearly separable if it can be divided into classes using a x 1 linear boundary: parallel to w · ( x 1 , x 2 ) = 0 − w 0 /w 1 w · x = 0 } Its offset from origin is determined by w 0 (which is w · ( x 1 , x 2 ) = 0 } Such a boundary, in 3 -dimensional space, is a plane called the bias weight) } In higher dimensions, it is a hyper-plane 12 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 11 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 11 12 3
The Geometry of Linear Boundaries The Geometry of Linear Boundaries x 2 x 2 } For example, with “real” weights: } Once we have our linear w = ( w 1 , w 2 ) = (0 . 5 , 1 . 0) boundary, data points are 2.0 we get the vector shown classified according to our as a green arrow threshold function 1.5 } Then, for a bias weight h w = 1 ( 1 w · x ≥ 0 w 0 = − 1 . 0 h w = ( w · x ≥ 0) 1.0 0 w · x < 0 the boundary where our w · x = w 0 + w · ( x 1 , x 2 ) = 0 linear function is zero, h w = 0 0.5 w · x = w 0 + w · ( x 1 , x 2 ) = 0 ( w · x < 0) is the line shown in red, x 1 x 1 0.0 crossing origin at (2,0) & (0,1) w · x = 0 0.0 0.5 1.0 1.5 2.0 14 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 13 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 13 14 Zero-One Loss Minimizing Zero/One Loss x 2 } Sadly, it is not easy to compute weights that minimize zero/one loss } For a training set made up of input/output pairs, } It is a piece-wise constant function of weights } It is not continuous, however, and gradient descent won’t work { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x k , y k ) } we could define the } E.g., for the following one-dimensional data, we get loss shown below: zero/one loss ( x 1 0 if h w ( x i ) = y i L ( h w ( x i ) , y i ) = 0 5 6 7 1 2 3 4 1 if h w ( x i ) 6 = y i Loss } Summed for the entire set, this is simply the count of 2 examples that we get wrong x 1 1 } In this example, if data-points marked should be in class 0 (below the line) and those marked should be in class 1 (above x 1 0 the line) the loss would be equal to 3 0 1 2 3 4 5 6 7 16 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 15 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 15 16 4
Perceptron Loss Perceptron Learning } T o minimize perceptron loss we can start from initial weights— } Instead, we define the perceptron loss on a training item: perhaps chosen uniformly from interval [-1,1] —and then: x i = ( x i, 1 , x i, 2 , . . . , x i,n ) Choose an input x i from our data set that is wrongly classified. 1. n Update vector of weights, , as follows: 2. w = ( w 0 , w 1 , w 2 , . . . , w n ) X L π ( h w ( x i ) , y i ) = ( y i − h w ( x i )) × x i,j w j ← w j + α ( y i − h w ( x i )) × x i,j j =0 Repeat until no classification errors remain. 3. } For example, suppose we have a 2 -dimensional element in our } The update equation means that: training set for which the correct output is 0 , but our If correct output should be below the boundary ( y i = 0 ) but our threshold function says 1 : 1. threshold has placed it above ( h w ( x i ) = 1 ) then we subtract each x i = (0 . 5 , 0 . 4) y i = 1 h w ( x i ) = 0 feature ( x i,j ) from the corresponding weight ( w i ) L π ( h w ( x i ) , y i ) = (1 − 0)(1 + 0 . 5 + 0 . 4) = 1 . 9 If correct output should be above the boundary ( y i = 1 ) but our 2. threshold has placed it below ( h w ( x i ) = 0 ) then we add each Sum of input attributes ( 1 is the “dummy” feature ( x i,j ) to the corresponding weight ( w i ) The difference between what output should be, and what our weights make it attribute that is multiplied by bias weight w 0 ) 18 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 17 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 17 18 Perceptron Updates Progress of Perceptron Learning x 2 } For an example like this, we: w j ← w j + α ( y − h w ( x i )) × x i,j Choose a mis-classified item 1. } The perceptron update rule shifts the weight vector positively or negatively, (marked in green) trying to get all data on the right side of the linear decision boundary Compute the weight updates, 2. } Again, supposing we have an error as before, with weights as given below: based on the “distance” away from the boundary (so weights x i = (0 . 5 , 0 . 4) w = (0 . 2 , − 2 . 5 , 0 . 6) y i = 1 shift more based upon errors in w · x i = 0 . 2 + ( − 2 . 5 × 0 . 5) + (0 . 6 × 0 . 4) = − 0 . 81 h w ( x i ) = 0 boundary placement that are } This means we add the value of each attribute to its matching weight more extreme) (assuming again that “dummy” x i,0 = 1 , and that parameter 𝛽 = 1 ): } Here, this adds to each weight, w 0 ← ( w 0 + x i, 0 ) = (0 . 2 + 1) = 1 . 2 After adjusting weights, x 1 changing the decision boundary our function is now w 1 ← ( w 1 + x i, 1 ) = ( − 2 . 5 + 0 . 5) = − 2 . 0 correct on this input } In this example, data-points marked should be in class 0 (below w 2 ← ( w 2 + x i, 2 ) = (0 . 6 + 0 . 4) = 1 . 0 the line) and those marked should be in class 1 (above the line) w · x i = 1 . 2 + ( − 2 . 0 × 0 . 5) + (1 . 0 × 0 . 4) = 0 . 6 h w ( x i ) = 1 20 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 19 Monday, 3 Feb. 2020 Machine Learning (COMP 135) 19 20 5
Recommend
More recommend