Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018
Outline } Non-parametric approach } Unsupervised: Non-parametric density estimation } Parzen Windows } Kn-Nearest Neighbor Density Estimation } Supervised: Instance-based learners } Classification ¨ kNN classification ¨ Weighted (or kernel) kNN } Regression ¨ kNN regression ¨ Locally linear weighted regression 2
Introduction } Estimation of arbitrary density functions } Parametric density functions cannot usually fit the densities we encounter in practical problems. } e.g., parametric densities are unimodal. } Non-parametric methods don't assume that the model (from) of underlying densities is known in advance } Non-parametric methods (for classification) can be categorized into } Generative } Estimate 𝑞(𝒚|𝒟 & ) from & using non-parametric density estimation } Discriminative } Estimate 𝑞(𝒟 & |𝒚) from 3
Parametric vs. nonparametric methods } Parametric methods need to find parameters from data and then use the inferred parameters to decide on new data points } Learning: finding parameters from data } Nonparametric methods } Training examples are explicitly used } Training phase is not required } Both supervised and unsupervised learning methods can be categorized into parametric and non-parametric methods. 4
Histogram approximation idea } Histogram approximation of an unknown pdf } 𝑄(𝑐 + ) ≈ 𝑙 . (𝑐 + )/𝑜 𝑚 = 1, … , 𝑀 } 𝑙 . (𝑐 + ) : number of samples (among n ones) lied in the bin 𝑐 + 𝑙 . } The corresponding estimated pdf: 9(: ; ) < } 𝑞 7 𝑦 = 𝑦 − 𝑦̅ : ; ≤ < @ ℎ Mid-point of the bin 𝑐 + 5
Non-parametric density estimation } Probability of falling in a region ℛ : 𝑞 𝒚 D 𝑒𝒚′ } 𝑄 = ∫ (smoothed version of 𝑞 𝒚 ) ℛ . : a set of samples drawn i.i.d. according to 𝑞 𝒚 } = 𝒚 & &GH } The probability that 𝑙 of the 𝑜 samples fall in ℛ : } 𝑄 I = 𝑜 𝑙 𝑄 I 1 − 𝑄 .JI } 𝐹 𝑙 = 𝑜𝑄 } This binomial distribution peaks sharply about the mean: } 𝑙 ≈ 𝑜𝑄 ⇒ I . as an estimate for 𝑄 } More accurate for larger 𝑜 6
Non-parametric density estimation } We can estimate smoothed 𝑞 𝒚 by estimating 𝑄 : } Assumptions: 𝑞 𝒚 is continuous and the region ℛ enclosing 𝒚 is so small that 𝑞 is near constant in it: 𝑄 = N 𝑞 𝒚 D 𝑒𝒚′ = 𝑞 𝒚 ×𝑊 ℛ 𝑊 = 𝑊𝑝𝑚 ℛ 𝒚 ∈ ℛ ⇒ 𝑞 𝒚 = 𝑄 𝑊 ≈ 𝑙/𝑜 𝑊 } Let 𝑊 approach zero if we want to find 𝑞 𝒚 instead of the averaged version. 7
Necessary conditions for converge } 𝑞 . 𝒚 is the estimate of 𝑞 𝒚 using 𝑜 samples: } 𝑊 . : the volume of region around 𝒚 } 𝑙 . : the number of samples falling in the region 𝑞 . 𝒚 = 𝑙 . /𝑜 𝑊 . } Necessary conditions for converge of 𝑞 . 𝒚 to 𝑞(𝒚) : } lim .→W 𝑊 . = 0 } lim .→W 𝑙 . = ∞ } lim .→W 𝑙 . /𝑜 = 0 8
� � Non-parametric density estimation: Main approaches } Two approaches of satisfying conditions: } k-nearest neighbor density estimator: fix k and determine the value ofV from the data } Volume grows until it contains k neighbors of 𝒚 } converges to the true probability density in the limit 𝑜 → ∞ when k grows with n (e.g., 𝑙 . = 𝑙 H 𝑜 ) } Kernel density estimator (Parzen window): fix V and determine K from the data } Number of points falling inside the volume can vary from point to point } converges to the true probability density in the limit 𝑜 → ∞ when V shrinks suitably with n (e.g., 𝑊 . = 𝑊 H / 𝑜 ) 9
1 Parzen window −1/2 1/2 } Extension of histogram idea: } Hyper-cubes with length of side ℎ (i.e., volume ℎ [ ) are located on the samples 1 } Hypercube as a simple window function: 𝜒 𝒗 = ^1 ( 𝑣 H ≤ 1 2 ∧ … ∧ 𝑣 [ ≤ 1 −1/2 1/2 2) 0 𝑝. 𝑥. . [ d 𝜒 𝒚 − 𝒚 (&) 𝑞 . 𝑦 = 1 𝑜 × 1 ℎ . ℎ . &GH I e } 𝑞 . 𝒚 = .f e 𝒚J𝒚 (h) . } 𝑙 . = ∑ 𝜒 number of samples in the hypercube around 𝒚 &GH < e . = ℎ . [ } 𝑊 10
� � Window function } Necessary conditions for window function to find legitimate density function: } 𝜒(𝒚) ≥ 0 } ∫ 𝜒 𝒚 𝑒𝒚 = 1 } Windows are also called kernels or potential functions. 11
� � Density estimation: non-parametric 𝑞̂ . 𝑦 = 1 . 𝑓 J nJn (h) o 𝑂(𝑦|𝑦 (&) , ℎ @ ) 1 𝑜 d @< o 2𝜌 ℎ &GH 1 1.2 1.4 1.5 1.6 2 2.1 2.15 4 4.3 4.7 4.75 5 𝜏 = ℎ 𝑞̂ 𝑦 = 1 . 𝑂(𝑦|𝑦 (&) , 𝜏 @ ) 𝑜 d &GH 𝑓 J nJn (h) o = 1 . 1 𝑜 d @q o 2𝜌 𝜏 &GH Choice of 𝜏 is crucial. 12
Density estimation: non-parametric 𝜏 = 0.02 𝜏 = 0.1 𝜏 = 1.5 𝜏 = 0.5 13
Window (or kernel) function: Width parameter . [ d 𝜒 𝒚 − 𝒚 (&) 𝑞 . 𝑦 = 1 𝑜 × 1 ℎ . ℎ . &GH } Choosing ℎ . : } Too large: low resolution } Too small: much variability [Duda, Hurt, and Stork] } For unlimited 𝑜 , by letting 𝑊 . slowly approach zero as 𝑜 increases 𝑞 . (𝒚) converges to 𝑞(𝒚) 14
� Parzen Window: Example 𝜒 𝑣 = 𝑂 0,1 𝑞 𝑦 = 𝑂(0,1) ℎ . = ℎ/ 𝑜 15 [Duda, Hurt, and Stork]
Width parameter } For fixed 𝑜 , a smaller ℎ results in higher variance while a larger ℎ leads to higher bias. } For a fixed ℎ , the variance decreases as the number of sample points 𝑜 tends to infinity } for a large enough number of samples, the smaller ℎ the better the accuracy of the resulting estimate } In practice, where only a finite number of samples is possible, a compromise between ℎ and 𝑜 must be made. } ℎ can be set using techniques like cross-validation where the density estimation used for learning tasks such as classification 16
Practical issues: Curse of dimensionality } Large 𝑜 is necessary to find an acceptable density estimation in high dimensional feature spaces } 𝑜 must grow exponentially with the dimensionality 𝑒 . } If 𝑜 equidistant points are required to densely fill a one-dim interval, 𝑜 [ points are needed to fill the corresponding 𝑒 -dim hypercube. ¨ We need an exponentially large quantity of training data to ensure that the cells are not empty } Also complexity requirements 17 𝑒 = 1 𝑒 = 3 𝑒 = 2
𝑙 . -nearest neighbor estimation } Cell volume is a function of the point location } To estimate 𝑞(𝒚) , let the cell around 𝒚 grow until it captures 𝑙 . samples called 𝑙 . nearest neighbors of 𝒚 . } Two possibilities can occur: } high density near 𝒚 ⇒ cell will be small which provides a good resolution } low density near 𝒚 ⇒ cell will grow large and stop until higher density regions are reached 18
� � 𝑙 . -nearest neighbor estimation 𝑞 . 𝒚 = 𝑙 . /𝑜 1 ⇒ 𝑊 . ≈ 𝑞(𝒚) ×𝑙 . /𝑜 𝑊 . } A family of estimates by setting 𝑙 . = 𝑙 H 𝑜 and choosing different values for 𝑙 H : 𝑞 . 𝒚 = 𝑙 . /𝑜 . ≈ 1/𝑞(𝒚) ⇒ 𝑊 𝑙 H = 1 𝑊 𝑜 . 𝑊 . is a function of 𝒚 19
𝑙 . -Nearest Neighbor Estimation: Example } Discontinuities in the slopes [Bishop] 20
� 𝑙 . -Nearest Neighbor Estimation: Example } 𝑙 . = 𝑜 1 𝑞 H 𝑦 = 2 𝑦 − 𝑦 (H) 21 [Duda, Hurt, and Stork]
Non-parametric density estimation: Summary } Generality of distributions } With enough samples, convergence to an arbitrarily complicated target density can be obtained. } The number of required samples must be very large to assure convergence } grows exponentially with the dimensionality of the feature space } These methods are very sensitive to the choice of window width or number of nearest neighbors } There may be severe requirements for computation time and storage (needed to save all training samples). } ‘training’ phase simply requires storage of the training set. } computational cost of evaluating 𝑞(𝒚) grows linearly with the size of the data set. 22
Nonparametric learners } Memory-based or instance-based learners } lazy learning: (almost) all the work is done at the test time. } Generic description: } Memorize training (𝒚 (H) , 𝑧 (H) ), . . . , (𝒚 (.) , 𝑧 (.) ) . 7 = 𝑔(𝒚; 𝒚 (H) , 𝑧 (H) , . . . , 𝒚 (.) , 𝑧 (.) ). } Given test 𝒚 predict: 𝑧 } 𝑔 is typically expressed in terms of the similarity of the test sample 𝒚 to the training samples 𝒚 (H) , . . . , 𝒚 (.) 23
� � Parzen window & generative classification z 𝒚{𝒚(h) ew × w w xy ∑ 𝒚(h)∈w 9(𝒟 w ) x × 9(𝒟 o ) > 1 decide 𝒟 H } If z 𝒚{𝒚(h) eo × w w xy ∑ 𝒚(h)∈o x } otherwise decide 𝒟 @ } 𝑜 } = } ( 𝑘 = 1,2 ): number of training samples in class 𝒟 } } } : set of training samples labels as 𝒟 } } For large 𝑜 , it needs both high time and memory requirements 24
Parzen window & generative classification: Example Smaller ℎ larger ℎ 25 [Duda, Hurt, and Stork]
Estimate the posterior 𝑞 𝑦 𝑧 = 𝑗 = 𝑙 & 𝑜 & 𝑊 𝑞 𝑧 = 𝑗 = 𝑜 & 𝑜 𝑞(𝑦) = 𝑙 𝑜𝑊 𝑞 𝑧 = 𝑗 𝑦 = 𝑞 𝑦 𝑧 = 𝑗 𝑞(𝑧 = 𝑗) = 𝑙 & 𝑞(𝑦) 𝑙 26
Recommend
More recommend