n oise
play

N OISE ... p (y|x) x Y X the same x can generate different y - PowerPoint PPT Presentation

I NTRODUCTION TO LEARNING R EGULARIZATION M ETHODS FOR H IGH D IMENSIONAL L EARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu June 6, 2011 Regularization Methods for High Dimensional Learning Introduction to


  1. I NTRODUCTION TO LEARNING R EGULARIZATION M ETHODS FOR H IGH D IMENSIONAL L EARNING Francesca Odone and Lorenzo Rosasco odone@disi.unige.it - lrosasco@mit.edu June 6, 2011 Regularization Methods for High Dimensional Learning Introduction to learning

  2. D IFFERENT PROBLEMS IN SUPERVISED LEARNING In supervised learning we are given a set of input-output pairs ( x 1 , y 1 ) , . . . , ( x n , y n ) that we call a training set Classification: A learning problem with output values taken from a finite unordered set C = { C 1 , . . . , C k } . A special case is binary classification where y i ∈ {− 1 , 1 } . Regression: A learning problem whose output values are real y i ∈ IR Regularization Methods for High Dimensional Learning Introduction to learning

  3. L EARNING IS I NFERENCE P REDICTIVITY OR G ENERALIZATION Given the data, the goal is to learn how to make decisions/predictions about future data / data not belonging to the training set . The problem is : Avoid overfitting!! Regularization Methods for High Dimensional Learning Introduction to learning

  4. P REDICTIVITY Among many possible solutions how can we choose one that correcly applies to previously unseen data? Regularization Methods for High Dimensional Learning Introduction to learning

  5. P REDICTIVITY Among many possible solutions how can we choose one that correcly applies to previously unseen data? Regularization Methods for High Dimensional Learning Introduction to learning

  6. P REDICTIVITY Among many possible solutions how can we choose one that correcly applies to previously unseen data? Regularization Methods for High Dimensional Learning Introduction to learning

  7. P REDICTIVITY Among many possible solutions how can we choose one that correcly applies to previously unseen data? Regularization Methods for High Dimensional Learning Introduction to learning

  8. T HE ROLE OF PROBABILITY In supervised learning we consider the following The relationship can be stochastic, or deterministic with stochastic noise. If it is entirely umpredictable no learning takes place (we are not about to learn how to predict lotto numbers!) Regularization Methods for High Dimensional Learning Introduction to learning

  9. D ATA GENERATED BY A PROBABILITY DISTRIBUTION We assume that X and Y are two sets of random variables. We consider a set of data S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } that we call a training set. The training set consists of a set of independent identically distributed samples drawn from the probability distribution on X × Y . The joint conditional probabilities obey to the following: p ( x , y ) = p ( y | x ) p ( x ) . p ( x , y ) is fixed but unknown. Regularization Methods for High Dimensional Learning Introduction to learning

  10. N OISE ... p (y|x) x Y X the same x can generate different y (according to p ( y | x ) ): the underlying process is deterministic, but there is noise in the measurement of y ; the underlying process is not deterministic ; the underlying process is deterministic, but only incomplete information is available. Regularization Methods for High Dimensional Learning Introduction to learning

  11. ... AND S AMPLING y EVEN IN A NOISE FREE CASE WE HAVE TO DEAL WITH SAMPLING x the marginal p ( x ) distribution might model errors in the location of the p(x) input points; discretization error for a given grid; presence or absence of certain input instances x Regularization Methods for High Dimensional Learning Introduction to learning

  12. ... AND S AMPLING y EVEN IN A NOISE FREE CASE WE HAVE TO DEAL WITH SAMPLING x the marginal p ( x ) distribution might model ✁ � errors in the location of the p(x) input points; discretization error for a given grid; presence or absence of certain input instances x Regularization Methods for High Dimensional Learning Introduction to learning

  13. ... AND S AMPLING y EVEN IN A NOISE FREE CASE WE HAVE TO DEAL WITH SAMPLING x the marginal p ( x ) distribution might model errors in the location of the input points; p(x) discretization error for a given grid; presence or absence of certain input instances x Regularization Methods for High Dimensional Learning Introduction to learning

  14. ... AND S AMPLING y ✧ ★ EVEN IN A NOISE FREE CASE WE ✪ ✩ ✴ ✳ ✲ ✱ HAVE TO DEAL WITH SAMPLING ✬ ✫ ☛ ✡ ✯ ✰ ✞ ✝ ✵ ✶ ✮ ✭ ✘ ✙ ✍ ✎ ✍ ✎ ✗ ✖ ✁ � x ✏ ✄ ✑ ✂ ☎ ✔ ✕ ✔ ✛ ✚ the marginal p ( x ) distribution ✆ ✌ ☞ ✟ ✠ ✦ ✥ ✒ ✓ might model ✤ ✣ ✢ ✜ errors in the location of the input points; p(x) discretization error for a given grid; presence or absence of certain input instances x Regularization Methods for High Dimensional Learning Introduction to learning

  15. H YPOTHESIS S PACE Predictivity is a trade-off between the information provided by training data and the complexity of the solution we are looking for The hypothesis space, H , is the space of functions where we look for our solution Supervised learning uses the training data to learn a function f of H , f : X → Y , that can be applied to previously unseen data: y pred = f ( x new ) Regularization Methods for High Dimensional Learning Introduction to learning

  16. L OSS FUNCTIONS How do we choose a “good” f ∈ H ? L OSS FUNCTION In order to measure the goodness of our function f we use a non negative function called loss function V . In general V ( f ( x ) , y ) denotes the price to pay in associating f ( x ) to x instead than y Regularization Methods for High Dimensional Learning Introduction to learning

  17. L OSS FUNCTIONS FOR REGRESSION The most common is the square loss or L 2 loss V ( f ( x ) , y ) = ( f ( x ) − y ) 2 Absolute value or L 1 loss : V ( f ( x ) , y ) = | f ( x ) − y | Vapnik’s ǫ - insensitive loss : V ( f ( x ) , y ) = ( | f ( x ) − y | − ǫ ) + Regularization Methods for High Dimensional Learning Introduction to learning

  18. L OSS FUNCTIONS FOR ( BINARY ) CLASSIFICATION The most intuitive one: 0 − 1 -loss : V ( f ( x ) , y ) = θ ( − yf ( x )) ( θ is the step function) The more tractable hinge loss : V ( f ( x ) , y ) = ( 1 − yf ( x )) + And again the square loss or L 2 loss V ( f ( x ) , y ) = ( f ( x ) − y ) 2 Regularization Methods for High Dimensional Learning Introduction to learning

  19. L OSS FUNCTIONS Regularization Methods for High Dimensional Learning Introduction to learning

  20. L EARNING ALGORITHM L EARNING ALGORITHM If Z = X × Y , a learning algorithm is a map L : Z n → H that looks at the training set S and selects from H a function f S : X → Y such that f S ( x ) ∼ y in a generalizing way Regularization Methods for High Dimensional Learning Introduction to learning

  21. W HAT WE HAVE SEEN SO FAR We are considering an input space X and an output space Y ⊂ R an unknown probability distribution on the product space Z = X × Y : p ( X , Y ) a training set of n samples i.i.d. from p : S = { ( x 1 , y 1 ) , . . . , ( x n , y n ) } a hypothesis space H , that is, a space of functions f : X → Y a learning algorithm, that is a map L : Z n → H selecting from H a function f S such that f S ( x ) ∼ y in a predictive way Regularization Methods for High Dimensional Learning Introduction to learning

  22. L EARNING AS RISK MINIMIZATION Learning means to produce a hypothesis making the expected error or true error small Expected error : � I [ f ] = V ( f ( x ) , y ) p ( x , y ) dxdy . X × Y We would like to obtain f H = arg min f ∈H I [ f ] If the probability density is known, then learning is easy! Unfortunatelly it is usually fixed but unknown What we do have is the training set S Regularization Methods for High Dimensional Learning Introduction to learning

  23. E MPIRICAL R ISK M INIMIZATION (ERM) Given a loss function V = V ( y , f ( x )) we define the empirical risk I emp [ f ] as n I emp [ f , S ] = 1 � V ( f ( x i ) , y i ) n i = 1 ERM PRINCIPLE The Empirical Risk Minimization (ERM) principle chooses the function f S ∈ H according to the following f S = arg min f ∈H I emp [ f , S ] . Regularization Methods for High Dimensional Learning Introduction to learning

  24. G OOD QUALITIES OF A SOLUTION For a solution to be useful in the context of learning it must generalize be stable (well posed). Regularization Methods for High Dimensional Learning Introduction to learning

  25. R EMINDER I LL POSED PROBLEM A mathematical problem is well posed in the sense of Hadamard is the solution exists the solution is unique the solution depends continuously on the data If a problem is not well posed it is called ill posed . Regularization Methods for High Dimensional Learning Introduction to learning

  26. R EMINDER C ONVERGENCE IN PROBABILITY Let { X n } be a sequence of bounded random variables. Then n →∞ X n = X lim in probability if ∀ ǫ > 0 n →∞ P {| X n − X | ≥ ǫ } = 0 lim Regularization Methods for High Dimensional Learning Introduction to learning

  27. C ONSISTENCY AND GENERALIZATION A desirable property for f S is consistency: n →∞ I [ f S ] = I [ f H ] lim that is the training error must converge to the expected error — consistency guarantees generalization Regularization Methods for High Dimensional Learning Introduction to learning

Recommend


More recommend