WHY SUPERVISED LEARNING MAY WORK WHY SUPERVISED LEARNING MAY WORK Matthieu R Bloch Thrusday January 9, 2020 1
LOGISTICS LOGISTICS Registration update Please decide soon if you want to take the class or not Still many people on the waiting list! Lecture videos on Canvas available in “Media Gallery” Please keep coming to class! Self-assessment online here Due Friday January 17, 2020 (11:59PM EST) (Friday January 24, 2020 for DL) I don’t expect you to do the assignment without refreshing your memory first http://www.phdcomics.com 2
COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. An unknown function to learn f : X → Y : x ↦ y = f ( x ) The formula to distinguish cats from dogs 2. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} : picture of cat/dog R d x i ∈ X ≜ : the corresponding label cat/dog y i ∈ Y ≜ R 3. A set of hypotheses as to what the function could be H Example: deep neural nets with AlexNet architecture 4. An algorithm to find the best that explains ALG h ∈ H f Terminology: : regression problem Y = R : binary classification problem Learning model #1 | Y | < ∞ : binary classification problem | Y | = 2 The goal is to generalize , i.e., be able to classify inputs we have not seen. 3
A LEARNING PUZZLE A LEARNING PUZZLE Learning seems impossible without additional assumptions! 4
POSSIBLE VS PROBABLE POSSIBLE VS PROBABLE Flip a biased coin that lands on head with unknown probability p ∈ [0, 1] and P (head) = p P (tail) = 1 − p Say we flip the coin times, can we estimate ? N p https://xkcd.com/221/ # head p = ^ N Can we relate to ? p p ^ The law of large numbers tells us that converges in probability to as gets large p p N ^ ∀ ϵ > 0 P (| p − p | > ϵ ) ⟶ 0. ^ N →∞ It is possible that is completely off but it is not probable p ^ 5
COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. An unknown function to learn f : X → Y : x ↦ y = f ( x ) 2. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from unknown distribution on { x i } N P x X i =1 are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 3. A set of hypotheses as to what the function could be H 4. An algorithm to find the best that explains ALG h ∈ H f Learning model #2 6
ANOTHER LEARNING PUZZLE ANOTHER LEARNING PUZZLE Which color is the dress? 7
COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. An unknown conditional distribution to learn P y | x models with noise P y | x f : X → Y 2. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from an unknown probability { x i } N i =1 distribution on P x X are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 3. A set of hypotheses as to what the function could be H 4. An algorithm to find the best that explains ALG h ∈ H f The roles of and are different P y | x P x is what we want to learn, captures the underlying P y | x function and the noise added to it Learning model #3 models sampling of dataset, need not be learned P x 8
YET ANOTHER LEARNING PUZZLE YET ANOTHER LEARNING PUZZLE Assume that you are designing a fingerprint authentication system You trained your system with a fancy machine learning system The probability of wrongly authenticating is 1% The probability of correctly authenticating is 60% Is this a good system? It depends! Biometric authentication system If you are GTRI, this might be ok (security matters more) If you are Apple, this is not acceptable (user convenience matters too) There is an application dependentent cost that can affect the design 9
COMPONENTS OF SUPERVISED MACHINE LEARNING COMPONENTS OF SUPERVISED MACHINE LEARNING 1. A dataset D ≜ {( x 1 y 1 , ), ⋯ , ( x N y N , )} drawn i.i.d. from an unknown probability { x i } N i =1 distribution on P x X are the corresponding targets { y i } N y i ∈ Y ≜ R i =1 2. An unknown conditional distribution P y | x models with noise P y | x f : X → Y 3. A set of hypotheses as to what the function could be H 4. A loss function capturing the “cost” of ℓ : Y × Y → R + prediction 5. An algorithm to find the best that explains ALG h ∈ H f Final supervised learning model 10
THE SUPERVISED LEARNING PROBLEM THE SUPERVISED LEARNING PROBLEM Learning is not memorizing Our goal is not to find that accurately assigns values to elements of h ∈ H D Our goal is to find the best that accurately predicts values of unseen samples h ∈ H Consider hypothesis . We can easily compute the empirical risk (a.k.a. in-sample error) h ∈ H N 1 ˆ N R ( h ) ≜ N ∑ ℓ( y i , h ( x i )) i =1 What we really care about is the true risk (a.k.a. out-sample error) R ( h ) ≜ E x y [ℓ( y , h ( x ))] Question #1: Can generalize ? For a given , is close to ? ˆ N h R ( h ) R ( h ) Question #2: Can we learn well ? Given , the best hypothesis is h ♯ H ≜ argmin h ∈ H R ( h ) Our algorithm can only find h ∗ ˆ N ≜ argmin h ∈ H R ( h ) Is close to ? ˆ N h ∗ h ♯ R ( ) R ( ) Is ? h ♯ R ( ) ≈ 0 11
WHY THE QUESTIONS MATTERS WHY THE QUESTIONS MATTERS Quick demo: nearest neighbor classification 12
DETOUR: DETOUR: PROBABILITIES PROBABILITIES Probabilities are not that old. The axiomatic theory was carried out by Kolmogorov in 1932 Definition. (Axiom for events) Let be a sample space. The class of subsets of that constitutes events satisfies the following Ω ≠ ∅ Ω axioms: 1. is an event; Ω 2. For some events in , is an event; { A i } ∞ Ω ∪ ∞ i =1 A i i ≥1 3. For every event in , is an event. Ω A c A Definition. (Axiom for probability) Let be a sample space and a class of events satisfying the axioms for events. A probability Ω ≠ ∅ F rule is a function such that: P : F → R + 1. ; P (Ω) = 1 2. For every ; A ⊂ F P ( A ) ≥ 0 3. For any disjoint events in , . ∑ ∞ F { A i } ∞ P ∪ ∞ ( i =1 A i ) = i =1 P A i ( ) i =1 Proposition (Union bound) Let be a probability space. For any events we have (Ω, F , P ) { A i } P ∪ i ≥1 A i ( ) ≤ ∑ i ≥1 P A i ( ) i ≥1 13
DETOUR: DETOUR: PROBABILITIES PROBABILITIES Definition. (Conditional probability) Let be a probability space. The conditional probability of event given event is, if (Ω, F , P ) A B , . P ( B ) > 0 P ( A | B ) = P ( A ∩ B )/ ( B ) P Definition. (Bayes' rule) Let be a probability space and with non zero probability. Then, (Ω, F , P ) A , B P ( A ) P ( A | B ) = P ( B | A ) . P ( B ) Definition. (Independence) Let be a probability space. Event is independent of event if . (Ω, F , P ) A B P ( A ∩ B ) = P ( A ) ( B ) P For , the events are independent if such that , n n > 2 { A i } ∀ S ⊂ [1; n ] | S | > 2 i =1 n P ∩ i ∈ S A i ( ) = ∏ P A i ( ) i =1 14
DETOUR: RANDOM VARIABLES DETOUR: RANDOM VARIABLES Definition. (Random variable) Let be a probability space. A random variable is a function . (Ω, F , P ) X X : Ω → R 1. might undefined or infinite on a subset of zero probability. X 2. must be an event for all { ω ∈ Ω : X ( ω ) ≤ x } x ∈ R 3. For a finite set of random variables , the set n { X i } { ω : X 1 ( ω ) ≤ x 1 , ⋯ , X n ( ω ) ≤ x n } must be an event for all i =1 n { x i } i =1 Definition. (Cumulative distribution function) Let be a probability space and a random variable. The CDF of is the function (Ω, F , P ) X X F X : R → R : x ↦ P ( ω ∈ Ω : X ( ω ) ≤ x ) ≜ P ( X ≤ x ) If or countable, the random variable is discrete . We can write and | X | | X | < ∞ X = { x i } i =1 is called the probability mass function (PMF) of . P X x i ( ) ≜ P ( X = x i ) X If the CDF of as a finite derivative at , the derivative is called the probability density function (PDF), X x denoted by . If has a derivative for every , is continuous p X F X x ∈ R X We o�en don’t need to specify . All we need is a CDF (of PMF or PDF) (Ω, F , P ) 15
DETOUR: RANDOM VARIABLES DETOUR: RANDOM VARIABLES Definition. (Expectation/Mean) Let be a random variable with PMF . Then . X P X E [ X ] ≜ ∑ x ∈ X x P X ( x ) Let be a random variable with PDF . Then . X p X E [ X ] ≜ ∫ x ∈ X x p X ( x ) dx Expectation of a function of a discrete is (and idem for PDFs). f X E [ f ( X )] = ∑ x ∈ X f ( x ) P X ( x ) Definition. (Moment) Let be a random variable. The th moment of is . E X m X m X [ ] The variance is the second centered moment . 2 ) 2 E X 2 Var( X ) ≜ E [ ( X − E [ X ] ] = [ ] − E [ X ] Proposition (Expectation of indicator function) Let be a random variable and . Then X E ⊂ R E [ 1 { X ∈ E }] = P ( X ∈ E ) 11th commandment: thou shall denote random variables by capital letters 12th commandment: but sometimes not 16
Recommend
More recommend