8. Nearest neighbors Chlo-Agathe Azencot Centre for Computatjonal - - PowerPoint PPT Presentation

8 nearest neighbors
SMART_READER_LITE
LIVE PREVIEW

8. Nearest neighbors Chlo-Agathe Azencot Centre for Computatjonal - - PowerPoint PPT Presentation

Foundatjons of Machine Learning CentraleSuplec Paris Fall 2017 8. Nearest neighbors Chlo-Agathe Azencot Centre for Computatjonal Biology, Mines ParisTech chloe-agathe.azencott@mines-paristech.fr Practjcal maters Class


slide-1
SLIDE 1
  • 8. Nearest neighbors

Foundatjons of Machine Learning CentraleSupélec Paris — Fall 2017 Chloé-Agathe Azencot

Centre for Computatjonal Biology, Mines ParisTech

chloe-agathe.azencott@mines-paristech.fr

slide-2
SLIDE 2

2

Practjcal maters

  • Class representatjves

– William PALMER william.palmer@student.ecp.fr – Léonard BOUSSIOUX leonard.boussioux@student.ecp.fr

  • Kaggle project
slide-3
SLIDE 3

3

Learning objectjves

  • Implement the nearest-neighbor and k-nearest-

neighbors algorithms.

  • Compute distances between real-valued vectors as

well as objects represented by categorical features.

  • Defjne the decision boundary of the nearest-

neighbor algorithm.

  • Explain why kNN might not work well in high

dimension.

slide-4
SLIDE 4

4

Nearest neighbors

slide-5
SLIDE 5

5

  • How would you color the blank circles?
slide-6
SLIDE 6

6

  • How would you color the blank circles?
slide-7
SLIDE 7

7

The training data partjtjons the entjre space

Partjtjoning the space

slide-8
SLIDE 8

8

Nearest neighbor

  • Learning:

– Store all the training examples

  • Predictjon:

– For x: the label of the training example closest to it

slide-9
SLIDE 9

9

k nearest neighbors

  • Learning:

– Store all the training examples

  • Predictjon:

– Find the k training examples closest to x – Classifjcatjon?

slide-10
SLIDE 10

10

k nearest neighbors

  • Learning:

– Store all the training examples

  • Predictjon:

– Find the k training examples closest to x – Classifjcatjon

Majority vote: Predict the class of the most frequent label among the k neighbors.

slide-11
SLIDE 11

11

k nearest neighbors

  • Learning:

– Store all the training examples

  • Predictjon:

– Find the k training examples closest to x – Classifjcatjon

Majority vote: Predict the class of the most frequent label among the k neighbors.

– Regression?

slide-12
SLIDE 12

12

k nearest neighbors

  • Learning:

– Store all the training examples

  • Predictjon:

– Find the k training examples closest to x – Classifjcatjon

Majority vote: Predict the class of the most frequent label among the k neighbors.

– Regression

Predict the average of the labels of the k neighbors.

slide-13
SLIDE 13

13

Choice of k

  • Small k: noisy

The idea behind using more than 1 neighbor is to average out the noise

  • Large k: computatjonally intensive

If k = n ?

slide-14
SLIDE 14

14

Choice of k

  • Small k: noisy

The idea behind using more than 1 neighbor is to average out the noise

  • Large k: computatjonally intensive

If k=n, then we predict

– for classifjcatjon: the majority class – for regression: the average value

  • Set k by cross-validatjon
  • Heuristjc: k ≈ √n
slide-15
SLIDE 15

15

Non-parametric learning

Non-parametric learning algorithm:

– the complexity of the decision functjon grows with the

number of data points.

– contrast with linear regression (≈ as many parameters as

features).

– Usually: decision functjon is expressed directly in terms

  • f the training examples.

– Examples:

  • kNN (this chapter)
  • tree-based methods (Chap. 9)
  • SVM (Chap. 10)
slide-16
SLIDE 16

16

Instance-based learning

  • Learning:

– Storing training instances.

  • Predictjng:

– Compute the label for a new instance based on its

similarity with the stored instances.

  • Also called lazy learning.
  • Similar to case-based reasoning

– Doctors treatjng a patjent based on how patjents with

similar symptoms were treated,

– Judges ruling court cases based on legal precedent.

slide-17
SLIDE 17

17

Instance-based learning

  • Learning:

– Storing training instances.

  • Predictjng:

– Compute the label for a new instance based on its

similarity with the stored instances.

  • Also called lazy learning.
  • Similar to case-based reasoning

– Doctors treatjng a patjent based on how patjents with

similar symptoms were treated,

– Judges ruling court cases based on legal precedent.

where the magic happens!

slide-18
SLIDE 18

18

Computjng distances & similaritjes

slide-19
SLIDE 19

19

Distances between instances

  • Distance
slide-20
SLIDE 20

20

Distances between instances

  • Distance
slide-21
SLIDE 21

21

Distances between instances

  • Euclidean distance
slide-22
SLIDE 22

22

Distances between instances

  • Euclidean distance
  • Manhatan distance

Why is this called the Manhatan distance?

slide-23
SLIDE 23

23

Distances between instances

  • Euclidean distance
  • Manhatan distance
  • Lq-norm: Minkowski distance

– L1 = Manhatuan. – L2 = Euclidean. – L∞ ?

slide-24
SLIDE 24

24

Distances between instances

  • Euclidean distance
  • Manhatan distance
  • Lq-norm: Minkowski distance

– L1 = Manhatuan. – L2 = Euclidean. – L∞

slide-25
SLIDE 25

25

Similarity between instances

  • Pearson's correlatjon
  • Assuming the data is centered

Geometric interpretatjon?

slide-26
SLIDE 26

26

Similarity between instances

  • Pearson's correlatjon (centered data)
  • Cosine similarity: the dot product can be used to

measure similaritjes.

slide-27
SLIDE 27

27

Categorical features

  • Ex: a feature that can take 5 values

– Sports – World – Culture – Internet – Politjcs

  • Naive encoding: x1 in {1, 2, 3, 4, 5}:

– Why is Sports closer to World than Politjcs?

  • One-hot encoding: x1, x2, x3, x4, x5

– Sports: [1, 0, 0, 0, 0] – Internet: [0, 0, 0, 1, 0]

slide-28
SLIDE 28

28

Categorical features

  • Represent object as the list of presence/absence (or

counts) of features that appear in it.

  • Example: small molecules

features = atoms and bonds of a certain type

– C, H, S, O, N... – O-H, O=C, C-N....

slide-29
SLIDE 29

29

  • Hamming distance

Number of bits that are difgerent Equivalent to 1 1 1 1 1 1 no occurrence

  • f the 1st feature

1+ occurrences

  • f the 10th feature

Binary representatjon

?

slide-30
SLIDE 30

30

  • Hamming distance

Number of bits that are difgerent Equivalent to 1 1 1 1 1 1 no occurrence

  • f the 1st feature

1+ occurrences

  • f the 10th feature

Binary representatjon

slide-31
SLIDE 31

31

  • Tanimoto/Jaccard similarity

Number of shared features (normalized) 1 1 1 1 1 1

Binary representatjon

slide-32
SLIDE 32

32

  • MinMax similarity

Number of shared features (normalized) If x is binary, MinMax and Tanimoto are equivalent 1 2 1 4 1 7 no occurrence

  • f the 1st feature

# occurrences

  • f the 10th feature

Counts representatjon

slide-33
SLIDE 33

33

Categorical features

  • Features
  • Compute the Hamming distance and Tanimoto and

MinMax similaritjes between these objects:

?

slide-34
SLIDE 34

34

Categorical features

  • Features
  • Compute the Hamming distance and Tanimoto and

MinMax similaritjes between these objects:

100011010110 300011010120 111011011110 211021011120 111011010100 311011010100

slide-35
SLIDE 35

35

Categorical features

  • A = 100011010110 / 300011010120
  • B = 111011011110 / 211021011120
  • C = 111011010100 / 311011010100
  • Hamming distance

d(A, B) = 3 d(A, C) = 3 d(B, C) = 2

  • Tanimoto similarity

s(A, B) = 6/9 s(A, C) = 5/8 s(B, C) = 7/9 = 0.67 = 0.63 = 0.78

  • MinMax similarity

s(A, B) = 8/13 s(A, C) = 7/11 s(B, C) = 8/13 = 0.62 = 0.64 = 0.62

slide-36
SLIDE 36

36

Categorical features

  • Features
  • When new data has unknown features: ignore

them.

=

slide-37
SLIDE 37

37

Back to nearest neighbors

slide-38
SLIDE 38

38

Advantages of kNN

  • Training is very fast

– Just store the training examples. – Can use smart indexing procedures to speed-up testjng

(slower training).

  • Keeps the training data

– Useful if we want to do something else with it.

  • Rather robust to noisy data (averaging k votes)
  • Can learn complex functjons
slide-39
SLIDE 39

39

Drawbacks of kNN

  • Memory requirements
  • Predictjon can be slow.

– Complexity of labeling 1 new data point ?

slide-40
SLIDE 40

40

Drawbacks of kNN

  • Memory requirements
  • Predictjon can be slow.

Complexity of labeling 1 new data point: But kNN works best with lots of samples... → Effjcient data structures (k-D trees, ball-trees)

  • constructjon space: tjme:
  • query:

→ Approximate solutjons based on hashing

  • kNN are fooled by irrelevant atributes.

E.g. p=1000, only 10 features are relevant; distances become meaningless.

slide-41
SLIDE 41

41

Decision boundary of kNN

  • Classifjcatjon
  • Decision boundary: Line separatjng the positjve

from negatjve regions.

  • What decision boundary is the kNN building?
slide-42
SLIDE 42

42

Voronoi tesselatjon

  • Voronoi cell of x:

– set of all points of the space closer to x than any other

point of the training set

– polyhedron

  • Voronoid tesselatjon of the space: union of all

Voronoi cells. Draw the Voronoi cell of the blue dot.

?

slide-43
SLIDE 43

43

Voronoi tesselatjon

  • Voronoi cell of x:

– set of all points of the space closer to x than any other

point of the training set

– polyhedron

  • Voronoid tesselatjon of the space: union of all

Voronoi cells.

slide-44
SLIDE 44

44

Voronoi tesselatjon

  • The Voronoi tesselatjon defjnes the decision

boundary of the 1-NN.

  • The kNN also partjtjons the space (in a more

complex way).

slide-45
SLIDE 45

45

Curse of dimensionality

  • Remember from Chap 3
  • When p ↗ the proportjon of a hypercube outside of

its inscribed hypersphere approaches 1.

  • Volume of a p-sphere:
  • What this means:

– hyperspace is very big – all points are far apart – dimensionality reductjon needed.

slide-46
SLIDE 46

46

kNN variants

  • ε-ball neighbors

– Instead of using the k nearest neighbors, use all points

within a distance ε of the test point.

– What if there are no such points?

slide-47
SLIDE 47

47

kNN variants

  • Weighted kNN

– Weigh the vote of each neighbor according to the

distance to the test point.

– Variant: learn the optjmal weights [e.g. Swamidass,

Azencotu et al. 2009, Infmuence Relevance Voter]

slide-48
SLIDE 48

48

Collaboratjve fjltering

  • Collaboratjve fjltering: recommend items that

similar users have liked in the past

similar users = users with similar tastes

  • item-based kNN

– similarity between items: adjusted cosine similarity

Ratjng of item A by user u Average ratjng by user u Sum over the users that rated both item A and item B

slide-49
SLIDE 49

49

Collaboratjve fjltering

– score of item A for user u:

k nearest neighbors of A according to s among the items rated by user u

slide-50
SLIDE 50

50

Summary

  • kNN

– very simple training – predictjon can be expensive

  • Relies on a “good” distance/similarity between

instances

  • Decision boundary = Voronoi tesselatjon
  • Curse of dimensionality: hyperspace is very big.
slide-51
SLIDE 51

51

References

  • A Course in Machine Learning.

http://ciml.info/dl/v0_99/ciml-v0_99-all.pdf

– kNN: Chap 3.2 — 3.3 – Categorical variables: Chap 3.1 – Curse of dimensionality: Chap 3.5

  • More on

– Kd-trees

https://www.ri.cmu.edu/pub_files/pub1/moore_andrew_1991_ 1/moore_andrew_1991_1.pdf http://www.alglib.net/other/nearestneighbors.php

– Voronoi tessellatjon

http://philogb.github.io/blog/2010/02/12/voronoi-tessellation/

slide-52
SLIDE 52

52

Lab

slide-53
SLIDE 53

53

slide-54
SLIDE 54

54

Even though we use the same scoring strategy, we don’t get the same optjmum. That’s because the cross-validatjon evaluatjon strategy is difgerent: scikit-learn compute one AUC per fold and averages them.

slide-55
SLIDE 55

55

The kNN performs much worse than the linear models. With such a large number of features, this is not unexpected.

slide-56
SLIDE 56

56

slide-57
SLIDE 57

57

Computjng nearest neighbors based on correlatjon works betuer than based on Minkowski distances. Indeed this allows to compare the profjles of the gene expressions (which genes have high expression / low expression simultaneously). Stjll logistjc regression works best.