deconstructing data science
play

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models Mar 28, 2016 1853 (12,10) (5,6) 1853 Jefferson Oakland Lafayette Harrison Square Square Square Square Feature x 5


  1. 
 Deconstructing Data Science David Bamman, UC Berkeley 
 Info 290 
 Lecture 17: Distance models Mar 28, 2016

  2. 1853

  3. (12,10) (5,6) 1853

  4. Jefferson 
 Oakland 
 Lafayette 
 Harrison 
 Square Square Square Square Feature x 5 12 5 12 y 6 10 10 6 F � | x i − y i | “Manhattan distance” i = 1

  5. (12,10) (5,6) 1853

  6. (12,10) � ( x 1 − y 1 ) 2 + ( x 2 − y 2 ) 2 | x 2 − y 2 | (5,6) | x 1 − y 1 | a 2 + b 2 = c 2 � a 2 + b 2 = c

  7. Euclidean distance � F � ( x i − y i ) 2 � � � i = 1 � F � 1 / 2 ( x i − y i ) 2 � = i = 1

  8. � F � 1 / 1 | x i − y i | 1 1-norm 
 � (Manhattan) i = 1 � F � 1 / 2 2-norm 
 | x i − y i | 2 � (Euclidean) i = 1 � F � 1 / p | x i − y i | p � p -norm i = 1

  9. � F � 1 / 0 F 0-norm 
 � I [ x i � = y i ] | x i − y i | 0 � = (Hamming) i = 1 i = 1 � F � 1 / ∞ ∞ -norm 
 | x i − y i | � | x i − y i | ∞ = max i (Chebyshev) i = 1

  10. Metrics d ( x , y ) ≥ 0 distances are not negative distances are positive, d ( x , y ) = 0 iff x = y except for identity d ( x , y ) = d ( y , x ) distances are symmetric

  11. Metrics d ( x , y ) ≤ d ( x , z ) + d ( z , y ) triangle inequality y a detour to another point z can’t shorten the “distance” between x and y x z

  12. Feature x1 x2 x3 follow clinton 1 0 0 follow trump 0 1 1 “benghazi” 0 0 1 negative sentiment + “benghazi” 0 1 0 “illegal immigrants” 0 1 1 “republican” in profile 0 0 0 “democrat” in profile 0 0 0 self-reported location = Berkeley 1 0 0

  13. K-nearest neighbors • Supervised classification/regression • Make prediction by finding the closest k data points and • predicting the majority label among those k points (classification) • predicting their average of those k points (regression)

  14. KNN Classification N ( x i ) Let be the K-nearest neighbors to x i P ( Y = j | x ) = 1 � I [ y i = j ] K x i ∈ N ( x ) (Pick the value of Y with the highest probability)

  15. KNN Regression N ( x i ) Let be the K-nearest neighbors to x i y i = 1 � y j ˆ K x j ∈ N ( x i )

  16. Data http://scott.fortmann-roe.com/docs/BiasVariance.html

  17. K=1 http://scott.fortmann-roe.com/docs/BiasVariance.html

  18. K=100 http://scott.fortmann-roe.com/docs/BiasVariance.html

  19. K=12 http://scott.fortmann-roe.com/docs/BiasVariance.html

  20. KNN • Properties: • Linear/Nonlinear? • Complexity of training/testing? • Overfitting? • How to choose the best K? • Impact of data representation

  21. Similarity task method distance classification/regression KNN euclidean, etc. classification/regression SVM kernel duplicate detection search

  22. Relevance (IR) • Similarity as an end of its own is a different paradigm from what we’ve been considering so far (classification, regression, clustering). task x y KNN classification/ documents genres regression duplicate detection documents

  23. Duplicate detection

  24. Duplicate document detection • What are the data points we’re comparing? • How do we represent each one? • How do we measure “similarity” • Evaluation?

  25. Computational concerns • Two sources of complexity: • Dimensionality of the feature space (every document in represented by a vocabulary of 1M word) [ minhashing ] • Number of documents in collection to compare (4.64 billion web pages) [ locality sensitive hashing]

  26. Feature x1 x2 x3 the 1 1 1 and 1 1 1 obama 1 1 0 supreme 1 0 0 court 1 0 1 kansas 0 1 1 ncaa 0 1 1 four 1 1 1

  27. Jaccard Similarity x1 x2 x3 1 1 1 1 1 1 1 1 0 number of features in both X and Y 1 0 0 | X ∩ Y | 1 0 1 | X ∪ Y | 0 1 1 0 1 1 number of features in either X and Y 1 1 1

  28. Text Reuse We were many times weaker than his splendid, lacquered machine, so that I did not even attempt to outspeed him. O lente currite noctis equi! O softly run, nightmares! Nabokov, Lolita

  29. Text reuse detection • What are the data points we’re comparing? • How do we represent each one? • How do we measure “similarity” • Evaluation?

  30. Information retrieval

  31. Information retrieval • What are the data points we’re comparing? • How do we represent each one? • How do we measure “similarity” • Evaluation?

  32. Cosine Similarity x1 x2 x3 1 1 1 � F i = 1 x i y i 1 1 1 cos ( x , y ) = �� F �� F i = 1 x 2 i = 1 y 2 1 1 0 i i 1 0 0 Euclidean distance measures the • 1 0 1 magnitude of distance between two points 0 1 1 Cosine similarity measures their • orientation 0 1 1 Often weighted by TF-IDF to • 1 1 1 discount the impact of frequent features.

  33. Modern IR • Modern IR accounts for much more information than document similarity • Prominence/reliability of document (PageRank) • Geographic location • Search query history • This can become a supervised problem to learn how to map these more elaborate features of a query/session to the search ranking. How do we represent our data?

  34. Meme tracking J. Leskovec et al. (2009), "Meme-tracking and the Dynamics of the News Cycle"

  35. Meme tracking J. Leskovec et al. (2009), "Meme-tracking and the Dynamics of the News Cycle"

  36. Meme tracking J. Leskovec et al. (2009), "Meme-tracking and the Dynamics of the News Cycle"

  37. http://mybinder.org/repo/dbamman/dds

Recommend


More recommend