- 8. Nearest neighbors
8. Nearest neighbors Chlo-Agathe Azencot Centre for Computatjonal - - PowerPoint PPT Presentation
8. Nearest neighbors Chlo-Agathe Azencot Centre for Computatjonal - - PowerPoint PPT Presentation
Foundatjons of Machine Learning CentraleSuplec Paris Fall 2017 8. Nearest neighbors Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Practjcal maters Class
2
Practjcal maters
- Class representatjves
– William PALMER william.palmer@student.ecp.fr – Léonard BOUSSIOUX leonard.boussioux@student.ecp.fr
- Kaggle project
3
Learning objectjves
- Implement the nearest-neighbor and k-nearest-
neighbors algorithms.
- Compute distances between real-valued vectors as
well as objects represented by categorical features.
- Defjne the decision boundary of the nearest-
neighbor algorithm.
- Explain why kNN might not work well in high
dimension.
4
Nearest neighbors
5
- How would you color the blank circles?
6
- How would you color the blank circles?
7
The training data partjtjons the entjre space
Partjtjoning the space
8
Nearest neighbor
- Learning:
– Store all the training examples
- Predictjon:
– For x: the label of the training example closest to it
9
k nearest neighbors
- Learning:
– Store all the training examples
- Predictjon:
– Find the k training examples closest to x – Classifjcatjon?
10
k nearest neighbors
- Learning:
– Store all the training examples
- Predictjon:
– Find the k training examples closest to x – Classifjcatjon
Majority vote: Predict the class of the most frequent label among the k neighbors.
11
k nearest neighbors
- Learning:
– Store all the training examples
- Predictjon:
– Find the k training examples closest to x – Classifjcatjon
Majority vote: Predict the class of the most frequent label among the k neighbors.
– Regression?
12
k nearest neighbors
- Learning:
– Store all the training examples
- Predictjon:
– Find the k training examples closest to x – Classifjcatjon
Majority vote: Predict the class of the most frequent label among the k neighbors.
– Regression
Predict the average of the labels of the k neighbors.
13
Choice of k
- Small k: noisy
The idea behind using more than 1 neighbor is to average out the noise
- Large k: computatjonally intensive
If k = n ?
14
Choice of k
- Small k: noisy
The idea behind using more than 1 neighbor is to average out the noise
- Large k: computatjonally intensive
If k=n, then we predict
– for classifjcatjon: the majority class – for regression: the average value
- Set k by cross-validatjon
- Heuristjc: k ≈ √n
15
Non-parametric learning
Non-parametric learning algorithm:
– the complexity of the decision functjon grows with the
number of data points.
– contrast with linear regression (≈ as many parameters as
features).
– Usually: decision functjon is expressed directly in terms
- f the training examples.
– Examples:
- kNN (this chapter)
- tree-based methods (Chap. 9)
- SVM (Chap. 10)
16
Instance-based learning
- Learning:
– Storing training instances.
- Predictjng:
– Compute the label for a new instance based on its
similarity with the stored instances.
- Also called lazy learning.
- Similar to case-based reasoning
– Doctors treatjng a patjent based on how patjents with
similar symptoms were treated,
– Judges ruling court cases based on legal precedent.
17
Instance-based learning
- Learning:
– Storing training instances.
- Predictjng:
– Compute the label for a new instance based on its
similarity with the stored instances.
- Also called lazy learning.
- Similar to case-based reasoning
– Doctors treatjng a patjent based on how patjents with
similar symptoms were treated,
– Judges ruling court cases based on legal precedent.
where the magic happens!
18
Computjng distances & similaritjes
19
Distances between instances
- Distance
20
Distances between instances
- Distance
21
Distances between instances
- Euclidean distance
22
Distances between instances
- Euclidean distance
- Manhatan distance
Why is this called the Manhatan distance?
23
Distances between instances
- Euclidean distance
- Manhatan distance
- Lq-norm: Minkowski distance
– L1 = Manhatuan. – L2 = Euclidean. – L∞ ?
24
Distances between instances
- Euclidean distance
- Manhatan distance
- Lq-norm: Minkowski distance
– L1 = Manhatuan. – L2 = Euclidean. – L∞
25
Similarity between instances
- Pearson's correlatjon
- Assuming the data is centered
Geometric interpretatjon?
26
Similarity between instances
- Pearson's correlatjon (centered data)
- Cosine similarity: the dot product can be used to
measure similaritjes.
27
Categorical features
- Ex: a feature that can take 5 values
– Sports – World – Culture – Internet – Politjcs
- Naive encoding: x1 in {1, 2, 3, 4, 5}:
– Why is Sports closer to World than Politjcs?
- One-hot encoding: x1, x2, x3, x4, x5
– Sports: [1, 0, 0, 0, 0] – Internet: [0, 0, 0, 1, 0]
28
Categorical features
- Represent object as the list of presence/absence (or
counts) of features that appear in it.
- Example: small molecules
features = atoms and bonds of a certain type
– C, H, S, O, N... – O-H, O=C, C-N....
29
- Hamming distance
Number of bits that are difgerent Equivalent to 1 1 1 1 1 1 no occurrence
- f the 1st feature
1+ occurrences
- f the 10th feature
Binary representatjon
?
30
- Hamming distance
Number of bits that are difgerent Equivalent to 1 1 1 1 1 1 no occurrence
- f the 1st feature
1+ occurrences
- f the 10th feature
Binary representatjon
31
- Tanimoto/Jaccard similarity
Number of shared features (normalized) 1 1 1 1 1 1
Binary representatjon
32
- MinMax similarity
Number of shared features (normalized) If x is binary, MinMax and Tanimoto are equivalent 1 2 1 4 1 7 no occurrence
- f the 1st feature
# occurrences
- f the 10th feature
Counts representatjon
33
Categorical features
- Features
- Compute the Hamming distance and Tanimoto and
MinMax similaritjes between these objects:
?
34
Categorical features
- Features
- Compute the Hamming distance and Tanimoto and
MinMax similaritjes between these objects:
100011010110 300011010120 111011011110 211021011120 111011010100 311011010100
35
Categorical features
- A = 100011010110 / 300011010120
- B = 111011011110 / 211021011120
- C = 111011010100 / 311011010100
- Hamming distance
d(A, B) = 3 d(A, C) = 3 d(B, C) = 2
- Tanimoto similarity
s(A, B) = 6/9 s(A, C) = 5/8 s(B, C) = 7/9 = 0.67 = 0.63 = 0.78
- MinMax similarity
s(A, B) = 8/13 s(A, C) = 7/11 s(B, C) = 8/13 = 0.62 = 0.64 = 0.62
36
Categorical features
- Features
- When new data has unknown features: ignore
them.
=
37
Back to nearest neighbors
38
Advantages of kNN
- Training is very fast
– Just store the training examples. – Can use smart indexing procedures to speed-up testjng
(slower training).
- Keeps the training data
– Useful if we want to do something else with it.
- Rather robust to noisy data (averaging k votes)
- Can learn complex functjons
39
Drawbacks of kNN
- Memory requirements
- Predictjon can be slow.
– Complexity of labeling 1 new data point ?
40
Drawbacks of kNN
- Memory requirements
- Predictjon can be slow.
Complexity of labeling 1 new data point: But kNN works best with lots of samples... → Effjcient data structures (k-D trees, ball-trees)
- constructjon space: tjme:
- query:
→ Approximate solutjons based on hashing
- kNN are fooled by irrelevant atributes.
E.g. p=1000, only 10 features are relevant; distances become meaningless.
41
Decision boundary of kNN
- Classifjcatjon
- Decision boundary: Line separatjng the positjve
from negatjve regions.
- What decision boundary is the kNN building?
42
Voronoi tesselatjon
- Voronoi cell of x:
– set of all points of the space closer to x than any other
point of the training set
– polyhedron
- Voronoid tesselatjon of the space: union of all
Voronoi cells. Draw the Voronoi cell of the blue dot.
?
43
Voronoi tesselatjon
- Voronoi cell of x:
– set of all points of the space closer to x than any other
point of the training set
– polyhedron
- Voronoid tesselatjon of the space: union of all
Voronoi cells.
44
Voronoi tesselatjon
- The Voronoi tesselatjon defjnes the decision
boundary of the 1-NN.
- The kNN also partjtjons the space (in a more
complex way).
45
Curse of dimensionality
- Remember from Chap 3
- When p ↗ the proportjon of a hypercube outside of
its inscribed hypersphere approaches 1.
- Volume of a p-sphere:
- What this means:
– hyperspace is very big – all points are far apart – dimensionality reductjon needed.
46
kNN variants
- ε-ball neighbors
– Instead of using the k nearest neighbors, use all points
within a distance ε of the test point.
– What if there are no such points?
47
kNN variants
- Weighted kNN
– Weigh the vote of each neighbor according to the
distance to the test point.
– Variant: learn the optjmal weights [e.g. Swamidass,
Azencotu et al. 2009, Infmuence Relevance Voter]
48
Collaboratjve fjltering
- Collaboratjve fjltering: recommend items that
similar users have liked in the past
similar users = users with similar tastes
- item-based kNN
– similarity between items: adjusted cosine similarity
Ratjng of item A by user u Average ratjng by user u Sum over the users that rated both item A and item B
49
Collaboratjve fjltering
– score of item A for user u:
k nearest neighbors of A according to s among the items rated by user u
50
Summary
- kNN
– very simple training – predictjon can be expensive
- Relies on a “good” distance/similarity between
instances
- Decision boundary = Voronoi tesselatjon
- Curse of dimensionality: hyperspace is very big.
51
References
- A Course in Machine Learning.
http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf
– kNN: Chap 3.2 — 3.3 – Categorical variables: Chap 3.1 – Curse of dimensionality: Chap 3.5
- More on
– Kd-trees
https://www.ri.cmu.edu/pub_files/pub1/moore_andrew_1991_ 1/moore_andrew_1991_1.pdf http://www.alglib.net/other/nearestneighbors.php
– Voronoi tessellatjon
http://philogb.github.io/blog/2010/02/12/voronoi-tessellation/
52
Lab
53
54
Even though we use the same scoring strategy, we don’t get the same optjmum. That’s because the cross-validatjon evaluatjon strategy is difgerent: scikit-learn compute one AUC per fold and averages them.
55
The kNN performs much worse than the linear models. With such a large number of features, this is not unexpected.
56
57