Introduction to Machine Learning COMPSCI 371D — Machine Learning COMPSCI 371D — Machine Learning Introduction to Machine Learning 1 / 18
Outline 1 Classification, Regression, Unsupervised Learning 2 About Dimensionality 3 Drawings and Intuition in Higher Dimensions 4 Classification through Regression 5 Linear Separability COMPSCI 371D — Machine Learning Introduction to Machine Learning 2 / 18
About Slides • By popular demand, lecture slides will be made available online • They will show up just before a lecture starts • Slides are grouped by topic, not by lecture • Slides are not for studying • Class notes and homework assignments are the materials of record COMPSCI 371D — Machine Learning Introduction to Machine Learning 3 / 18
Classification, Regression, Unsupervised Learning Parenthesis: Supervised vs Unsupervised • Supervised: Train with ( x , y ) • Classification: Hand-written digit recognition • Regression: Median age of YouTube viewers for each video • Unsupervised: Train with x • Clustering: Color compression • Distances matter! • We will not cover unsupervised learning COMPSCI 371D — Machine Learning Introduction to Machine Learning 4 / 18
Classification, Regression, Unsupervised Learning Machine Learning Terminology • Predictor h : X → Y (the signature of h ) • X ⊆ R d is the data space • Y (categorical) is the label space for a classifier • Y ( ⊆ R e ) is the value space for a regressor • A target is either a label or a value • H is the hypothesis space (all h we can choose from) • A training set is a subset T of X × Y = 2 X × Y is the class of all possible training sets def • T • Learner λ : T → H , so that λ ( T ) = h • ℓ ( y , ˆ y ) is the loss incurred for estimating ˆ y when the true prediction is y � N • L T ( h ) = 1 n = 1 ℓ ( y n , h ( x n )) is the empirical risk of h on T N COMPSCI 371D — Machine Learning Introduction to Machine Learning 5 / 18
About Dimensionality H is Typically Parametric • For polynomials, h ↔ c • We write L T ( c ) instead of L T ( h ) • “Searching H ” means “find the parameters” ˆ m � A c − b � 2 c ∈ arg min c ∈ R • This is common in machine learning: h ( x ) = h θ ( x ) , • θ : a vector of parameters • Abstract view: ˆ h ∈ arg min h ∈H L T ( h ) • Concrete view: ˆ m L T ( θ ) θ ∈ arg min θ ∈ R • Minimize a function of real variables, rather than of “functions” COMPSCI 371D — Machine Learning Introduction to Machine Learning 6 / 18
About Dimensionality Curb your Dimensions • For polynomials, h c ( x ) : X → Y x ∈ X ⊆ R d and c ∈ R m • We saw that d > 1 and degree k > 1 ⇒ m ≫ d � d + k � • Specifically, m ( d , k ) = k • Things blow up when k and d grow • More generally, h θ ( x ) : X → Y x ∈ X ⊆ R d and θ ∈ R m • Which dimension(s) do we want to curb? m ? d ? • Both , for different but related reasons COMPSCI 371D — Machine Learning Introduction to Machine Learning 7 / 18
About Dimensionality Problem with m Large • Even just for data fitting , we generally want N ≫ m , i.e. , (possibly many) more samples than parameters to estimate • For instance, in A c = b , we want A to have more rows than columns • Remember that annotating training data is costly • So we want to curb m : We want a small H COMPSCI 371D — Machine Learning Introduction to Machine Learning 8 / 18
About Dimensionality Problems with d Large • We do machine learning , not just data fitting! • We want h to generalize to new data • During training, we would like the learner to see a good sampling of all possible x (“fill X nicely”) • With large d , this is impossible: The curse of dimensionality COMPSCI 371D — Machine Learning Introduction to Machine Learning 9 / 18
Drawings and Intuition in Higher Dimensions Drawings Help Intuition COMPSCI 371D — Machine Learning Introduction to Machine Learning 10 / 18
Drawings and Intuition in Higher Dimensions Intuition Often Fails in Many Dimensions 1 1 −ε / 2 1 • Gray parts dominate when d → ∞ • Distance from center to corners diverges when d → ∞ COMPSCI 371D — Machine Learning Introduction to Machine Learning 11 / 18
Classification through Regression Classifiers as Partitions of X def = h − 1 ( y ) X y partitions X (not just T !) • Classifier = partition • S = h − 1 ( red square ) , C = h − 1 ( blue circle ) COMPSCI 371D — Machine Learning Introduction to Machine Learning 12 / 18
Classification through Regression Classification, Geometry, and Regression • Classification partitions X ⊂ R d into sets • How do we represent sets ⊂ R d ? How do we work with them? • We’ll see a couple of ways: nearest-neighbor classifier, decision trees • These methods have a strong geometric flavor • Beware of our intuition! • Another technique: score-based classifiers i.e. , classification through regression COMPSCI 371D — Machine Learning Introduction to Machine Learning 13 / 18
Classification through Regression Score-Based Classifiers s=0 s > 0 s < 0 Score Function [Figure adapted from Wei et al. , Structural and Multidisciplinary Optimization , 58:831–849, 2018] • s = 0 defines the decision boundaries • s > 0 and s < 0 defines the (two) decision regions COMPSCI 371D — Machine Learning Introduction to Machine Learning 14 / 18
Classification through Regression Score-Based Classifiers • Threshold some score function s ( x ) : • Example: 's' (red squares) and 'c' (blue circles) • Correspond to two sets S ⊆ X and C = X \ S If we can estimate something like s ( x ) = P [ x ∈ S ] � 's' if s ( x ) > 1 / 2 h ( x ) = otherwise 'c' COMPSCI 371D — Machine Learning Introduction to Machine Learning 15 / 18
Classification through Regression Classification through Regression • If you prefer 0 as a threshold, let s ( x ) = 2 P [ x ∈ S ] − 1 ∈ [ − 1 , 1 ] � 's' if s ( x ) > 0 h ( x ) = otherwise 'c' • Scores are convenient even without probabilities, because they are easy to work with • We implement a classifier h by building a regressor s • Example: Logistic-regression classifiers COMPSCI 371D — Machine Learning Introduction to Machine Learning 16 / 18
Linear Separability Linearly Separable Training Sets • Some line (hyperplane in R d ) separates C , S • Requires much smaller H • Simplest score: s ( x ) = b + w T x . The line is s ( x ) = 0 � 's' if s ( x ) > 0 h ( x ) = otherwise 'c' COMPSCI 371D — Machine Learning Introduction to Machine Learning 17 / 18
Linear Separability Data Representation? • Linear separability is a property of the data in a given representation Δ r r • Xform 1: z = x 2 1 + x 2 2 implies x ∈ S ⇔ a ≤ z ≤ b � • Xform 2: z = | x 2 1 + x 2 2 − r | implies linear separability: x ∈ S ⇔ z ≤ ∆ r COMPSCI 371D — Machine Learning Introduction to Machine Learning 18 / 18
Recommend
More recommend