non parametric methods
play

Non parametric methods Course of Machine Learning Master Degree in - PowerPoint PPT Presentation

Non parametric methods Course of Machine Learning Master Degree in Computer Science University of Rome Tor Vergata Giorgio Gambosi a.a. 2018-2019 1 Probability distribution estimates The statistical approach to classification


  1. Non parametric methods Course of Machine Learning Master Degree in Computer Science University of Rome “Tor Vergata” Giorgio Gambosi a.a. 2018-2019 1

  2. Probability distribution estimates • The statistical approach to classification requires the (at least estimated. 2 approximate) knowledge of p ( C i | x ) : in fact, an item x shall be assigned to the class C i such that i = argmax p ( C k | x ) k • The same holds in the regression case, where p ( y | x ) has to be

  3. Probability distribution estimates: hypotheses 3 What do we assume to know of class distributions, given a training set X , t ? • Case 1. The probabilities p ( x |C i ) are known: an item is assigned x to the class C i such that i = argmax p ( C j | x ) j where p ( C j | x ) can be derived through Bayes’ rule and prior probabilities, since p ( C k ) | x ) ∝ p ( x |C k ) p ( C k )

  4. Probability distribution estimates: hypotheses class and, from them, 4 • Case 2. The type of probability distribution p ( x | θ ) is known: an estimate of parameter values θ i is performed for all classes, taking into account for each class C i the subset of X i , t i of items belonging to the class, that is such that t = i . Different approaches to parameter estimation: 1. Maximum likelihood: θ ML = argmax p ( X i , t i | θ ) is computed. Item x is i θ assigned to class C i if p ( x | θ ML i = argmax p ( C j | x ) = argmax ) p ( C j ) j j j 2. Maximum a posteriori: θ MAP = argmax p ( θ | X i , t i ) is computed. Item x is i θ assigned to class C i if p ( x | θ MAP i = argmax p ( C j | x ) = argmin ) p ( C j ) j j j 3. Bayesian estimate: the distributions p ( θ | X i , t i ) are estimated for each ∫ p ( x |C i ) = p ( x | θ ) p ( θ | X i , t i ) d θ θ Item x is assigned to class C i if i = argmax p ( C j | x ) = argmax p ( C j ) p ( x |C j ) j j ∫ = argmax p ( C j ) p ( x | θ ) p ( θ | X j , t j ) d θ j θ

  5. Probability distribution estimates: hypotheses • Case 3. No knowledge of the probabilities assumed. • In previous cases, use of (parametric) models for a synthetic • In this case, no models (and parameters): training set items explicitly appear in class distribution estimates. parameters is used 5 • The class distributions p ( x |C i ) are directly from data. description of data in X , t • Denoted as non parametric models: indeed, an unbounded number of

  6. Histograms • The probability density in the interval corresponding to the bin • Elementary type of non parametric estimate 6 • Domain partitioned into m d -dimensional intervals (bins) • The probability P x that an item belongs to the bin containing item x is estimated as n ( x ) , where n ( x ) is the number of element in that bin n containing x is then estimated as the ratio between the above probability and the interval width ∆( x ) (tipically, a constant ∆ ) n ( x ) n ( x ) N p H ( x ) = ∆( x ) = N ∆( x ) 5 ∆ = 0 . 04 0 0 0.5 1 5 ∆ = 0 . 08 0 0 0.5 1 5 ∆ = 0 . 25 0 0 0.5 1

  7. Histograms: problems • The density is a function of the position of the first bin. In the case of multivariate data, also from bin orientation. • The resulting estimates is not continuous. • Curse of dimensionality: the number of bins grows as a polynomial of unless a large number of items is available. • In practice, histograms can be applied only in low-dimensional datasets (1,2) 7 order d : in high-dimensional spaces many bins may result empty,

  8. Kernel density estimators Hence, in general, 8 • Probability that an item is in region R ( x ) , containing x ∫ P x = p ( z ) d z R ( x ) • Given n items x 1 , x 2 , . . . , x n , the probability that k among them are in R ( x ) is given by the binomial distribution ( ) n n ! x (1 − P x ) n − k = P K k !( n − k )! P K x (1 − P x ) n − k p ( k ) = k • Mean and variance of the ratio r = k n are var [ r ] = P x (1 − P x ) E [ r ] = P x n • P x is the expected fraction of items in R ( x ) , and the ratio r is an estimate. As n → ∞ variance decreases and r tends to E [ r ] = P x . r = k n ≃ P ( x )

  9. Nonparametric estimates almost constant in the region and 9 • Let the volume of R ( x ) be sufficiently small. Then, the density p ( x ) is ∫ p ( z ) d z ≃ p ( x ) V P x = R ( x ) where V is the volume of R ( x ) • since P x ≃ k k n , it then derives that p ( x ) ≃ nV

  10. It can be shown that in both cases, under suitable conditions, the estimator Approaches to nonparametric estimates 10 k Two alternative ways to exploit the estimate p ( x ) ≃ nV 1. Fix V and derive k from data (kernel density estimation) 2. Fix k and derive V from data (K-nearest neighbor). tends to the true density p ( x ) as n → ∞ .

  11. Kernel density estimation: Parzen windows otherwise • the number of items in the hypercube is then 11 • Region associated to a point x : hypercube with edge length h (and volume h d ) centered on x . • Kernel function k ( u ) (Parzen window) used to count the number of items in the unit hypercube centered on the origin 0  | u i | ≤ 1 / 2 1 i = 1 , . . . , d  k ( u ) = 0  ( x − x ′ ) = 1 iff x ′ is in the hypercube of edge • as a consequence, k h length h centered on x n ( x − x i ) ∑ K = k h i =1

  12. Kernel density estimation: Parzen windows • Since and it derives • The estimated density is and 12 n ( x − x i p ( x ) = 1 1 ) ∑ h d k n h i =1 ∫ k ( u ) ≥ 0 k ( u ) d u = 1 ( x − x i ∫ ( x − x i ) ) d x = h d k ≥ 0 k h h

  13. Kernel density estimation: Parzen windows and Clearly, the window size has a relevant effect on the estimate 13 As a consequence, it results that p n ( x ) is a probability density. In fact, n ( x − x i p ( x ) = 1 1 ) ∑ ≥ 0 h d k n h i =1 n ( x − x i ∫ ∫ 1 1 ) ∑ p ( x ) d x = h d k d x n h i =1 n ( x − x i 1 ∫ ) ∑ k d x nh d h i =1 n 1 ∫ ( x − x i 1 nh d nh d = 1 ) ∑ k d x = nh d h i =1

  14. Kernel density estimation: Parzen windows 14 h = 2 h = 1 h = ε

  15. Kernels and smoothing Drawbacks 1. discontinuity of the estimates points nearer to the origin. 15 2. items in a region centered on x have uniform weights: their distance from x is not taken into account Solution. Use of smooth kernel functions κ h ( u ) to assign larger weights to Assumed characteristics of κ h ( u ) : ∫ κ h ( x ) d x = 1 ∫ x κ h ( x ) d x = 0 ∫ x 2 κ h ( x ) d x > 0

  16. Kernels and smoothing Usually kernels are based on smooth radial functions (functions of the resulting estimate: 16 distance from the origin) 1 u 2 e − 1 1. gaussian κ ( u ) = √ 2 σ 2 , unlimited support 2 πσ ( 1 ) 2 − u 2 , | u | ≤ 1 2. Epanechnikov κ ( u ) = 3 2 , limited support 3. · · · k ( u ) κ ( u ) κ ( u ) u u u − 1 1 − 1 1 − 1 1 2 2 2 2 2 2 n n p ( x ) = 1 ( x − x i = 1 ) ∑ ∑ κ κ h ( x − x i ) nh h n i =1 i =1

  17. Kernels and smoothing 17 h = 1 8 6 p ( x ) 4 2 0 6 5 4 3 2 1 0 1 2 3 4 5 6 x

  18. Kernels and smoothing 18 h = 2 7 6 5 4 p ( x ) 3 2 1 0 6 5 4 3 2 1 0 1 2 3 4 5 6 x

  19. Kernels and smoothing 19 h = . 5 14 12 10 8 p ( x ) 6 4 2 0 6 5 4 3 2 1 0 1 2 3 4 5 6 x

  20. Kernels and smoothing 20 h = . 25 25 20 15 p ( x ) 10 5 0 6 5 4 3 2 1 0 1 2 3 4 5 6 x

  21. Kernels regression Kernel smoothers methods can be applied also to regression: in this case, Applying kernels, we have should be returned. 21 In this case, the conditional expectation the value corresponding to any item x is predicted by referring to items in the training set (and in particular to the items which are closer to x ). ∫ yp ( x , y ) dy ∫ yp ( x , y ) dy ∫ ∫ y p ( x , y ) f ( x ) = E [ y | x ] = yp ( y | x ) dy = p ( x ) dy = = ∫ p ( x ) p ( x , y ) dy n p ( x , y ) ≈ 1 ∑ κ h ( x − x i ) κ h ( y − t i ) n i =1

  22. Kernels regression This results into and, since 22 y 1 ∑ n ∑ n ∫ i =1 κ h ( x − x i ) κ h ( y − t i ) dy ∫ i =1 κ h ( x − x i ) yκ h ( y − t i ) dy n f ( x ) = i =1 κ h ( x − x i ) κ h ( y − t i ) dy = 1 ∑ n ∑ n ∫ i =1 κ h ( x − x i ) ∫ κ h ( y − t i ) dy n ∫ ∫ κ h ( y − t i ) dy = 1 and yκ h ( y − t i ) dy = y i , we get ∑ n i =1 κ h ( x − x i ) t i f ( x ) = ∑ n i =1 κ h ( x − x i )

  23. Kernels regression By setting we can write that is, the predicted value is computed as a linear combination of all target values, weighted by kernels (Nadaraya-Watson) 23 κ h ( x − x i ) w i ( x ) = ∑ n j =1 κ h ( x − x j ) n ∑ f ( x ) = w i ( x ) t i i =1

  24. Locally weighted regression In Nadaraya-Watson model, the prediction is performed by means of a weighted combination of constant values (target values in the training set). Locally weighted regression improves that approach by referring to a weighted version of the sum of squared differences loss function used in regression. 24 If a value y has to be predicted for a provided item x , a “local” version of the loss function is considered, with weight w i dependent from the “distance” between x and x i . n κ h ( x − x i )( w T x i − t i ) 2 ∑ L ( x ) = i =1

Recommend


More recommend