Linear Predictors COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Linear Predictors 1 / 37
Outline 1 Definitions and Properties 2 The Least-Squares Linear Regressor 3 The Logistic-Regression Classifier 4 Probabilities and the Geometry of Logistic Regression 5 The Logistic Function 6 The Cross-Entropy Loss 7 Multi-Class Linear Predictors COMPSCI 371D — Machine Learning Linear Predictors 2 / 37
Definitions and Properties Definitions • A linear regressor fits an affine function to the data y ≈ h ( x ) = b + w T x x ∈ R d for • A linear, binary classifier separates the two classes with a hyperplane in R d • The actual data can be separated only if it is linearly separable (!) • Multi-class classifiers separate any two classes with a hyperplane • The resulting decision regions are convex and simply connected COMPSCI 371D — Machine Learning Linear Predictors 3 / 37
Definitions and Properties Properties of Linear Predictors • Linear Predictors... • ...have a very small H with d + 1 parameters (resist overfitting) • ... are trained by a convex optimization problem (global optimum) • ... are fast at inference time (and training is not too slow) • ... work well if the data is close to linearly separable COMPSCI 371D — Machine Learning Linear Predictors 4 / 37
The Least-Squares Linear Regressor The Least-Squares Linear Regressor • D´ ej` a vu : Polynomial regression with k = 1 y ≈ h v ( x ) = b + w T x for x ∈ R d � b � ∈ R d + 1 • Parameter vector v = w H = R m with m = d + 1 y ) 2 • “Least Squares:” ℓ ( y , ˆ y ) = ( y − ˆ • ˆ m L T ( v ) v = arg min v ∈ R � N • Risk L T ( v ) = 1 n = 1 ℓ ( y n , h v ( x n )) N • We know how to solve this COMPSCI 371D — Machine Learning Linear Predictors 5 / 37
The Least-Squares Linear Regressor Linear Regression Example 800 350 700 300 600 500 250 400 200 300 200 150 100 0 100 0 1000 2000 3000 4000 5000 6000 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 √ • Left: All of Ames. Residual Risk: $55,800 √ • Right: One Neighborhood. Residual Risk: $23,600 • Left, yellow: Ignore two largest homes COMPSCI 371D — Machine Learning Linear Predictors 6 / 37
The Least-Squares Linear Regressor Binary Classification by Logistic Regression Y = { c 0 , c 1 } • Multi-class case later • The logistic-regression classifier is a classifier! • A linear classifier implemented through regression • The logistic is a particular function COMPSCI 371D — Machine Learning Linear Predictors 7 / 37
The Logistic-Regression Classifier Score-Based Classifiers Y = { c 0 , c 1 } • Think of c 0 , c 1 as numbers : Y = { 0 , 1 } • We saw the idea of level sets: Regress a score function s such that s is large where y = 1, small where y = 0 • Threshold s to obtain a classifier: � c 0 if s ( x ) ≤ threshold h ( x ) = c 1 otherwise. • A linear classifier implemented through regression COMPSCI 371D — Machine Learning Linear Predictors 8 / 37
The Logistic-Regression Classifier Idea 1 • s ( x ) = b + w T x 1 1 0 0 • Not so good! • A line does not approximate a step well • Why not fit a step function? • NP-hard unless the data is separable COMPSCI 371D — Machine Learning Linear Predictors 9 / 37
The Logistic-Regression Classifier Idea 2 • How about a “soft step?” • The logistic function 1 0.5 0 0 def 1 f ( x ) = 1 + e − x • If a true step moves, the loss does not change until a data point flips label • If the logistic function moves, the loss changes gradually • We have a gradient! • The optimization problem is no longer combinatorial COMPSCI 371D — Machine Learning Linear Predictors 10 / 37
The Logistic-Regression Classifier What is a Logistic Function in d Dimensions? • We want a linear classifier • The level crossing must be a hyperplane • Level crossing: Solution to s ( x ) = 1 / 2 • Shape of the crossing depends on s ( a : R d → R ) • Compose an affine a ( x ) = c + u T x ...with a monotonic f ( a ) that crosses 1 / 2 ( f : R → R ) s ( x ) = f ( a ( x )) = f ( c + u T x ) • Then, if f ( α ) = 1 / 2, the equation s ( x ) = 1 / 2 is the same as c + u T x = α • A hyperplane! • Let f be the logistic function COMPSCI 371D — Machine Learning Linear Predictors 11 / 37
The Logistic-Regression Classifier Example 350 350 300 300 250 250 200 200 150 150 100 100 50 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 (a) (b) • Gold line: Regression problem R → R • Black line: Classification problem R 2 → R (result of running a logistic-regression classifier) • Labels: Good (red squares, y = 1) or poor quality (blue circles, y = 0) homes • All that matters is how far a point is from the black line COMPSCI 371D — Machine Learning Linear Predictors 12 / 37
Probabilities and the Geometry of Logistic Regression A Probabilistic Interpretation 350 300 250 200 150 100 50 600 800 1000 1200 1400 1600 1800 2000 2200 2400 2600 • All that matters is how far a point is from the black line • s ( x ) = f (∆( x )) where ∆ is a signed distance • We could interpret the score s ( x ) as “the probability that y = 1:” f (∆( x )) = P [ y = 1 ] • (...or as “1 − the probability that y = 0”) lim ∆ →−∞ P [ y = 1 ] = 0 lim ∆ →∞ P [ y = 1 ] = 1 ∆ = 0 ⇒ P [ y = 1 ] = 1 / 2 (just like the logistic function) COMPSCI 371D — Machine Learning Linear Predictors 13 / 37
Probabilities and the Geometry of Logistic Regression Ingredients for the Regression Part • Determine the distance ∆ of a point x ∈ X from a hyperplane χ , and the side of χ on which the point is on (Geometry: affine functions as unscaled, signed distances) • Specify a monotonically increasing function that turns ∆ into a probability (Choice based on convenience: the logistic function ) • Define a loss function ℓ ( y , ˆ y ) such that the minimum risk yields the optimal classifier (Ditto, matches function in previous bullet to obtain a convex risk: the cross-entropy loss ) COMPSCI 371D — Machine Learning Linear Predictors 14 / 37
Probabilities and the Geometry of Logistic Regression Normal to a Hyperplane Δ ( x ) > 0 x x 0 x ᾽ Δ ( x ) < 0 positive half-space β n χ negative half-space • Hyperplane χ : b + w T x = 0 (w.l.o.g. b ≤ 0) a 1 , a 2 ∈ χ ⇒ c = a 1 − a 2 parallel to χ • Subtract b + w T a 1 = 0 b + w T a 2 = 0 from • Obtain w T c = 0 for any a 1 , a 2 ∈ χ • w is perpendicular to χ COMPSCI 371D — Machine Learning Linear Predictors 15 / 37
Probabilities and the Geometry of Logistic Regression Distance of a Hyperplane from the Origin Δ ( x ) > 0 x x 0 x ᾽ Δ ( x ) < 0 positive half-space β n χ negative half-space w • Unit-norm version of w : n = � w � b + w T x = 0 (w.l.o.g. b ≤ 0) as • Rewrite χ : n T x = β β = − b where � w � ≥ 0 • Line along n : x = α n for α ∈ R (parametric form) α is the distance from the origin • Replace into eq. for χ : α n T n = β that is, α = β ≥ 0 • In particular, x 0 = β n • β is the distance of χ from the origin COMPSCI 371D — Machine Learning Linear Predictors 16 / 37
Probabilities and the Geometry of Logistic Regression Signed Distance of a Point from a Hyperplane x Δ ( x ) > 0 x 0 x ᾽ Δ ( x ) < 0 positive half-space β n χ negative half-space n T x = β where β = − b w � w � ≥ 0 and n = � w � x 0 = β n • In one half-space, n T x ≥ β • Distance of x from χ is n T x − β ≥ 0 • In other half-space, n T x ≤ β • Distance of x from χ is β − n T x ≥ 0 • On decision boundary, n T x = β • n T x − β is the signed distance of x 0 from the hyperplane COMPSCI 371D — Machine Learning Linear Predictors 17 / 37
Probabilities and the Geometry of Logistic Regression Summary If w is nonzero (which it has to be), the distance of χ from the origin is = | b | def β � w � (a nonnegative number) and the quantity = b + w T x def ∆( x ) � w � is the signed distance of point x ∈ X from hyperplane χ . Specifically, the distance of x from χ is | ∆( x ) | , and ∆( x ) is nonnegative if and only if x is on the side of χ pointed to by w . Let us call that side the positive half-space of χ . COMPSCI 371D — Machine Learning Linear Predictors 18 / 37
The Logistic Function Ingredient 2: The Logistic Function • Want to make the score of x be only a function of ∆( x ) • Given ∆ 0 , all points such that ∆( x ) = ∆ 0 have the same score • Score s ( x ) = f (∆( x )) • How to pick f ? • lim ∆ →−∞ f (∆) = 0 f ( 0 ) = 1 / 2 lim ∆ →∞ f (∆) = 1 def 1 • Logistic function: f (∆) = 1 + e − ∆ 1 0.5 0 0 COMPSCI 371D — Machine Learning Linear Predictors 19 / 37
The Logistic Function The Logistic Function def 1 • Logistic function: f (∆) = 1 + e − ∆ 1 0.5 0 0 1 • Scale-free: Why not 1 + e − ∆ / c ? = b + w T x def • Can use both c and ∆( x ) � w � def = b + w T x ... or more simply use no c but use a ( x ) • The affine function takes care of scale implicitly def 1 • Score: s ( x ) = f ( a ( x )) = 1 + e − b − w T x • Write s ( x ; b , w ) to remind us of dependence COMPSCI 371D — Machine Learning Linear Predictors 20 / 37
Recommend
More recommend