Instance-based Learning CE-717: Machine Learning Sharif University - PowerPoint PPT Presentation

Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018

Outline } Non-parametric approach } Unsupervised: Non-parametric density estimation } Parzen Windows } Kn-Nearest Neighbor Density Estimation } Supervised: Instance-based learners } Classification ¨ kNN classification ¨ Weighted (or kernel) kNN } Regression ¨ kNN regression ¨ Locally linear weighted regression 2

Introduction } Estimation of arbitrary density functions } Parametric density functions cannot usually fit the densities we encounter in practical problems. } e.g., parametric densities are unimodal. } Non-parametric methods don't assume that the model (from) of underlying densities is known in advance } Non-parametric methods (for classification) can be categorized into } Generative } Estimate 𝑞(𝒚|𝒟 & ) from 𝒠 & using non-parametric density estimation } Discriminative } Estimate 𝑞(𝒟 & |𝒚) from 𝒠 3

Parametric vs. nonparametric methods } Parametric methods need to find parameters from data and then use the inferred parameters to decide on new data points } Learning: finding parameters from data } Nonparametric methods } Training examples are explicitly used } Training phase is not required } Both supervised and unsupervised learning methods can be categorized into parametric and non-parametric methods. 4

Histogram approximation idea } Histogram approximation of an unknown pdf } 𝑄(𝑐 + ) ≈ 𝑙 . (𝑐 + )/𝑜 𝑚 = 1, … , 𝑀 } 𝑙 . (𝑐 + ) : number of samples (among n ones) lied in the bin 𝑐 + 𝑙 . } The corresponding estimated pdf: 9(: ; ) < } 𝑞 7 𝑦 = 𝑦 − 𝑦̅ : ; ≤ < @ ℎ Mid-point of the bin 𝑐 + 5

Non-parametric density estimation } Probability of falling in a region ℛ : 𝑞 𝒚 D 𝑒𝒚′ } 𝑄 = ∫ (smoothed version of 𝑞 𝒚 ) ℛ . : a set of samples drawn i.i.d. according to 𝑞 𝒚 } 𝒠 = 𝒚 & &GH } The probability that 𝑙 of the 𝑜 samples fall in ℛ : } 𝑄 I = 𝑜 𝑙 𝑄 I 1 − 𝑄 .JI } 𝐹 𝑙 = 𝑜𝑄 } This binomial distribution peaks sharply about the mean: } 𝑙 ≈ 𝑜𝑄 ⇒ I . as an estimate for 𝑄 } More accurate for larger 𝑜 6

Non-parametric density estimation } We can estimate smoothed 𝑞 𝒚 by estimating 𝑄 : } Assumptions: 𝑞 𝒚 is continuous and the region ℛ enclosing 𝒚 is so small that 𝑞 is near constant in it: 𝑄 = N 𝑞 𝒚 D 𝑒𝒚′ = 𝑞 𝒚 ×𝑊 ℛ 𝑊 = 𝑊𝑝𝑚 ℛ 𝒚 ∈ ℛ ⇒ 𝑞 𝒚 = 𝑄 𝑊 ≈ 𝑙/𝑜 𝑊 } Let 𝑊 approach zero if we want to find 𝑞 𝒚 instead of the averaged version. 7

Necessary conditions for converge } 𝑞 . 𝒚 is the estimate of 𝑞 𝒚 using 𝑜 samples: } 𝑊 . : the volume of region around 𝒚 } 𝑙 . : the number of samples falling in the region 𝑞 . 𝒚 = 𝑙 . /𝑜 𝑊 . } Necessary conditions for converge of 𝑞 . 𝒚 to 𝑞(𝒚) : } lim .→W 𝑊 . = 0 } lim .→W 𝑙 . = ∞ } lim .→W 𝑙 . /𝑜 = 0 8

� � Non-parametric density estimation: Main approaches } Two approaches of satisfying conditions: } k-nearest neighbor density estimator: fix k and determine the value ofV from the data } Volume grows until it contains k neighbors of 𝒚 } converges to the true probability density in the limit 𝑜 → ∞ when k grows with n (e.g., 𝑙 . = 𝑙 H 𝑜 ) } Kernel density estimator (Parzen window): fix V and determine K from the data } Number of points falling inside the volume can vary from point to point } converges to the true probability density in the limit 𝑜 → ∞ when V shrinks suitably with n (e.g., 𝑊 . = 𝑊 H / 𝑜 ) 9

1 Parzen window −1/2 1/2 } Extension of histogram idea: } Hyper-cubes with length of side ℎ (i.e., volume ℎ [ ) are located on the samples 1 } Hypercube as a simple window function: 𝜒 𝒗 = ^1 ( 𝑣 H ≤ 1 2 ∧ … ∧ 𝑣 [ ≤ 1 −1/2 1/2 2) 0 𝑝. 𝑥. . [ d 𝜒 𝒚 − 𝒚 (&) 𝑞 . 𝑦 = 1 𝑜 × 1 ℎ . ℎ . &GH I e } 𝑞 . 𝒚 = .f e 𝒚J𝒚 (h) . } 𝑙 . = ∑ 𝜒 number of samples in the hypercube around 𝒚 &GH < e . = ℎ . [ } 𝑊 10

� � Window function } Necessary conditions for window function to find legitimate density function: } 𝜒(𝒚) ≥ 0 } ∫ 𝜒 𝒚 𝑒𝒚 = 1 } Windows are also called kernels or potential functions. 11

� � Density estimation: non-parametric 𝑞̂ . 𝑦 = 1 . 𝑓 J nJn (h) o 𝑂(𝑦|𝑦 (&) , ℎ @ ) 1 𝑜 d @< o 2𝜌 ℎ &GH 1 1.2 1.4 1.5 1.6 2 2.1 2.15 4 4.3 4.7 4.75 5 𝜏 = ℎ 𝑞̂ 𝑦 = 1 . 𝑂(𝑦|𝑦 (&) , 𝜏 @ ) 𝑜 d &GH 𝑓 J nJn (h) o = 1 . 1 𝑜 d @q o 2𝜌 𝜏 &GH Choice of 𝜏 is crucial. 12

Density estimation: non-parametric 𝜏 = 0.02 𝜏 = 0.1 𝜏 = 1.5 𝜏 = 0.5 13

Window (or kernel) function: Width parameter . [ d 𝜒 𝒚 − 𝒚 (&) 𝑞 . 𝑦 = 1 𝑜 × 1 ℎ . ℎ . &GH } Choosing ℎ . : } Too large: low resolution } Too small: much variability [Duda, Hurt, and Stork] } For unlimited 𝑜 , by letting 𝑊 . slowly approach zero as 𝑜 increases 𝑞 . (𝒚) converges to 𝑞(𝒚) 14

� Parzen Window: Example 𝜒 𝑣 = 𝑂 0,1 𝑞 𝑦 = 𝑂(0,1) ℎ . = ℎ/ 𝑜 15 [Duda, Hurt, and Stork]

Width parameter } For fixed 𝑜 , a smaller ℎ results in higher variance while a larger ℎ leads to higher bias. } For a fixed ℎ , the variance decreases as the number of sample points 𝑜 tends to infinity } for a large enough number of samples, the smaller ℎ the better the accuracy of the resulting estimate } In practice, where only a finite number of samples is possible, a compromise between ℎ and 𝑜 must be made. } ℎ can be set using techniques like cross-validation where the density estimation used for learning tasks such as classification 16

Practical issues: Curse of dimensionality } Large 𝑜 is necessary to find an acceptable density estimation in high dimensional feature spaces } 𝑜 must grow exponentially with the dimensionality 𝑒 . } If 𝑜 equidistant points are required to densely fill a one-dim interval, 𝑜 [ points are needed to fill the corresponding 𝑒 -dim hypercube. ¨ We need an exponentially large quantity of training data to ensure that the cells are not empty } Also complexity requirements 17 𝑒 = 1 𝑒 = 3 𝑒 = 2

𝑙 . -nearest neighbor estimation } Cell volume is a function of the point location } To estimate 𝑞(𝒚) , let the cell around 𝒚 grow until it captures 𝑙 . samples called 𝑙 . nearest neighbors of 𝒚 . } Two possibilities can occur: } high density near 𝒚 ⇒ cell will be small which provides a good resolution } low density near 𝒚 ⇒ cell will grow large and stop until higher density regions are reached 18

� � 𝑙 . -nearest neighbor estimation 𝑞 . 𝒚 = 𝑙 . /𝑜 1 ⇒ 𝑊 . ≈ 𝑞(𝒚) ×𝑙 . /𝑜 𝑊 . } A family of estimates by setting 𝑙 . = 𝑙 H 𝑜 and choosing different values for 𝑙 H : 𝑞 . 𝒚 = 𝑙 . /𝑜 . ≈ 1/𝑞(𝒚) ⇒ 𝑊 𝑙 H = 1 𝑊 𝑜 . 𝑊 . is a function of 𝒚 19

𝑙 . -Nearest Neighbor Estimation: Example } Discontinuities in the slopes [Bishop] 20

� 𝑙 . -Nearest Neighbor Estimation: Example } 𝑙 . = 𝑜 1 𝑞 H 𝑦 = 2 𝑦 − 𝑦 (H) 21 [Duda, Hurt, and Stork]

Non-parametric density estimation: Summary } Generality of distributions } With enough samples, convergence to an arbitrarily complicated target density can be obtained. } The number of required samples must be very large to assure convergence } grows exponentially with the dimensionality of the feature space } These methods are very sensitive to the choice of window width or number of nearest neighbors } There may be severe requirements for computation time and storage (needed to save all training samples). } ‘training’ phase simply requires storage of the training set. } computational cost of evaluating 𝑞(𝒚) grows linearly with the size of the data set. 22

Nonparametric learners } Memory-based or instance-based learners } lazy learning: (almost) all the work is done at the test time. } Generic description: } Memorize training (𝒚 (H) , 𝑧 (H) ), . . . , (𝒚 (.) , 𝑧 (.) ) . 7 = 𝑔(𝒚; 𝒚 (H) , 𝑧 (H) , . . . , 𝒚 (.) , 𝑧 (.) ). } Given test 𝒚 predict: 𝑧 } 𝑔 is typically expressed in terms of the similarity of the test sample 𝒚 to the training samples 𝒚 (H) , . . . , 𝒚 (.) 23

� � Parzen window & generative classification z 𝒚{𝒚(h) ew × w w xy ∑ 𝒚(h)∈𝒠w 9(𝒟 w ) x × 9(𝒟 o ) > 1 decide 𝒟 H } If z 𝒚{𝒚(h) eo × w w xy ∑ 𝒚(h)∈𝒠o x } otherwise decide 𝒟 @ } 𝑜 } = 𝒠 } ( 𝑘 = 1,2 ): number of training samples in class 𝒟 } } 𝒠 } : set of training samples labels as 𝒟 } } For large 𝑜 , it needs both high time and memory requirements 24

Parzen window & generative classification: Example Smaller ℎ larger ℎ 25 [Duda, Hurt, and Stork]

Estimate the posterior 𝑞 𝑦 𝑧 = 𝑗 = 𝑙 & 𝑜 & 𝑊 𝑞 𝑧 = 𝑗 = 𝑜 & 𝑜 𝑞(𝑦) = 𝑙 𝑜𝑊 𝑞 𝑧 = 𝑗 𝑦 = 𝑞 𝑦 𝑧 = 𝑗 𝑞(𝑧 = 𝑗) = 𝑙 & 𝑞(𝑦) 𝑙 26

Instance-based Learning CE-717: Machine Learning Sharif University - PowerPoint PPT Presentation

Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Outline } Non-parametric approach } Unsupervised: Non-parametric density estimation } Parzen Windows } Kn-Nearest Neighbor Density

INSTANCE BASED LEARNING 2 Instance-Based Learning Distance function defines whats learned

Instance recognition Thurs April 6 Kristen Grauman UT Austin Instance recognition Indexing

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

I Instance-level recognition t l l iti Cordelia Schmid INRIA Instance-level recognition

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

Test Instance Generation Test Instance Generation for MAX 2SAT for MAX 2SAT Mitsuo Motoki

Nearest Neighbor Learning (Instance Based Learning) l Classify based on local similarity l Ranges

Instance Based Learning k -Nearest Neighbor Locally weighted regression Radial basis

Instance-based Learning Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1

Learning for Categorization Sample Category Learning Problem A training example is an instance

Multiple Instance Detection Network with Online Instance Classifier Refinement Peng Tang

Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis

About any instance (fi rst instance, appeal, cassation, the ARTYUSHENKO & PARTNERS IS THE

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

CPSC 213 2.4.4-2.4.6 Textbook 2ed: 3.9.1 1ed: 3.9.1 Introduction to Computer

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

Model selection for fast density estimation orfi 1 L aszl o (Laci) Gy 1 Department of

In the previous parts of the course, weve seen a number of different estimators used in

Mean Shift Paper by Comaniciu and Meer Presentation by Carlo Lopez-Tello What is the Mean Shift

ICS 667 Advanced HCI Design Methods 08. Intro to Evaluation Analytic Evaluation Dan Suthers

Density Estimation Parametric techniques Maximum Likelihood Maximum A Posteriori

Panel f u nctions DATA VISU AL IZATION W ITH L ATTIC E IN R Deepa y an Sarkar Associate

Efficient Structure-Aware Selection Techniques for 3D Point Cloud Visualizations with 2DOF Input

SVMs and Kernel Methods Lecture 3 David Sontag New York University Slides adapted from Luke

Sambuz

Useful Links

Newsletter

Mail Us

Instance-based Learning CE-717: Machine Learning Sharif University - PowerPoint PPT Presentation

Instance-based Learning CE-717: Machine Learning Sharif University of Technology M. Soleymani Fall 2018 Outline } Non-parametric approach } Unsupervised: Non-parametric density estimation } Parzen Windows } Kn-Nearest Neighbor Density

INSTANCE BASED LEARNING 2 Instance-Based Learning Distance function defines whats learned

Instance recognition Thurs April 6 Kristen Grauman UT Austin Instance recognition Indexing

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n &lt;=

I Instance-level recognition t l l iti Cordelia Schmid INRIA Instance-level recognition

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n &lt;=

Test Instance Generation Test Instance Generation for MAX 2SAT for MAX 2SAT Mitsuo Motoki

Nearest Neighbor Learning (Instance Based Learning) l Classify based on local similarity l Ranges

Instance Based Learning k -Nearest Neighbor Locally weighted regression Radial basis

Instance-based Learning Hamid R. Rabiee Spring 2015 http://ce.sharif.edu/courses/93-94/2/ce717-1

Learning for Categorization Sample Category Learning Problem A training example is an instance

Multiple Instance Detection Network with Online Instance Classifier Refinement Peng Tang

Explaining the Stars: Weighted Multiple-Instance Learning for Aspect-Based Sentiment Analysis

About any instance (fi rst instance, appeal, cassation, the ARTYUSHENKO &amp; PARTNERS IS THE

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

CPSC 213 2.4.4-2.4.6 Textbook 2ed: 3.9.1 1ed: 3.9.1 Introduction to Computer

Instance-level recognition Cordelia Schmid INRIA, Grenoble Instance-level recognition Search

Model selection for fast density estimation orfi 1 L aszl o (Laci) Gy 1 Department of

In the previous parts of the course, weve seen a number of different estimators used in

Mean Shift Paper by Comaniciu and Meer Presentation by Carlo Lopez-Tello What is the Mean Shift

ICS 667 Advanced HCI Design Methods 08. Intro to Evaluation Analytic Evaluation Dan Suthers

Density Estimation Parametric techniques Maximum Likelihood Maximum A Posteriori

Panel f u nctions DATA VISU AL IZATION W ITH L ATTIC E IN R Deepa y an Sarkar Associate

Efficient Structure-Aware Selection Techniques for 3D Point Cloud Visualizations with 2DOF Input

SVMs and Kernel Methods Lecture 3 David Sontag New York University Slides adapted from Luke

Sambuz

Useful Links

Newsletter

Mail Us

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

Divide And Conquer Small And Large Instance Small instance. Sort a list that has n <=

About any instance (fi rst instance, appeal, cassation, the ARTYUSHENKO & PARTNERS IS THE