Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

  Deconstructing Data Science David Bamman, UC Berkeley   Info 290   Lecture 17: Distance models Mar 28, 2016

(12,10) (5,6) 1853

Jefferson   Oakland   Lafayette   Harrison   Square Square Square Square Feature x 5 12 5 12 y 6 10 10 6 F � | x i − y i | “Manhattan distance” i = 1

(12,10) (5,6) 1853

(12,10) � ( x 1 − y 1 ) 2 + ( x 2 − y 2 ) 2 | x 2 − y 2 | (5,6) | x 1 − y 1 | a 2 + b 2 = c 2 � a 2 + b 2 = c

Euclidean distance � F � ( x i − y i ) 2 � � � i = 1 � F � 1 / 2 ( x i − y i ) 2 � = i = 1

Metrics d ( x , y ) ≥ 0 distances are not negative distances are positive, d ( x , y ) = 0 iff x = y except for identity d ( x , y ) = d ( y , x ) distances are symmetric

Metrics d ( x , y ) ≤ d ( x , z ) + d ( z , y ) triangle inequality y a detour to another point z can’t shorten the “distance” between x and y x z

Feature x1 x2 x3 follow clinton 1 0 0 follow trump 0 1 1 “benghazi” 0 0 1 negative sentiment + “benghazi” 0 1 0 “illegal immigrants” 0 1 1 “republican” in profile 0 0 0 “democrat” in profile 0 0 0 self-reported location = Berkeley 1 0 0

K-nearest neighbors • Supervised classification/regression • Make prediction by finding the closest k data points and • predicting the majority label among those k points (classification) • predicting their average of those k points (regression)

KNN Classification N ( x i ) Let be the K-nearest neighbors to x i P ( Y = j | x ) = 1 � I [ y i = j ] K x i ∈ N ( x ) (Pick the value of Y with the highest probability)

KNN Regression N ( x i ) Let be the K-nearest neighbors to x i y i = 1 � y j ˆ K x j ∈ N ( x i )

Data http://scott.fortmann-roe.com/docs/BiasVariance.html

K=1 http://scott.fortmann-roe.com/docs/BiasVariance.html

KNN • Properties: • Linear/Nonlinear? • Complexity of training/testing? • Overfitting? • How to choose the best K? • Impact of data representation

Similarity task method distance classification/regression KNN euclidean, etc. classification/regression SVM kernel duplicate detection search

Relevance (IR) • Similarity as an end of its own is a different paradigm from what we’ve been considering so far (classification, regression, clustering). task x y KNN classification/ documents genres regression duplicate detection documents

Duplicate detection

Duplicate document detection • What are the data points we’re comparing? • How do we represent each one? • How do we measure “similarity” • Evaluation?

Computational concerns • Two sources of complexity: • Dimensionality of the feature space (every document in represented by a vocabulary of 1M word) [ minhashing ] • Number of documents in collection to compare (4.64 billion web pages) [ locality sensitive hashing]

Feature x1 x2 x3 the 1 1 1 and 1 1 1 obama 1 1 0 supreme 1 0 0 court 1 0 1 kansas 0 1 1 ncaa 0 1 1 four 1 1 1

Jaccard Similarity x1 x2 x3 1 1 1 1 1 1 1 1 0 number of features in both X and Y 1 0 0 | X ∩ Y | 1 0 1 | X ∪ Y | 0 1 1 0 1 1 number of features in either X and Y 1 1 1

Text Reuse We were many times weaker than his splendid, lacquered machine, so that I did not even attempt to outspeed him. O lente currite noctis equi! O softly run, nightmares! Nabokov, Lolita

Text reuse detection • What are the data points we’re comparing? • How do we represent each one? • How do we measure “similarity” • Evaluation?

Information retrieval

Information retrieval • What are the data points we’re comparing? • How do we represent each one? • How do we measure “similarity” • Evaluation?

Cosine Similarity x1 x2 x3 1 1 1 � F i = 1 x i y i 1 1 1 cos ( x , y ) = �� F �� F i = 1 x 2 i = 1 y 2 1 1 0 i i 1 0 0 Euclidean distance measures the • 1 0 1 magnitude of distance between two points 0 1 1 Cosine similarity measures their • orientation 0 1 1 Often weighted by TF-IDF to • 1 1 1 discount the impact of frequent features.

Modern IR • Modern IR accounts for much more information than document similarity • Prominence/reliability of document (PageRank) • Geographic location • Search query history • This can become a supervised problem to learn how to map these more elaborate features of a query/session to the search ranking. How do we represent our data?

Meme tracking J. Leskovec et al. (2009), "Meme-tracking and the Dynamics of the News Cycle"

http://mybinder.org/repo/dbamman/dds

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models Mar 28, 2016 1853 (12,10) (5,6) 1853 Jefferson Oakland Lafayette Harrison Square Square Square Square Feature x 5

C Constructing i (and Deconstructing) (and Deconstructing) the Postmortem Interval the

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Data and

Deconstructing Alice & Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal

Deconstructing MinBFT for Security and Verifiability Vincent Rahli, Francisco Rocha, Marcus V

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 20: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 9: Logistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 18: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Probabilistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 2: Survey of

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 10: Validity Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Naive Bayes Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 11: Topic models

Supervised Metric Learning M. Sebban Laboratoire Hubert Curien , UMR CNRS 5516 University of Jean

Linear Fitting CS3220 - Summer 2008 Jonathan Kaldor (based on Sp07 Slides) From N to M We

Outline Outline Several Random Variables Several Random Variables Joint

Perfectoid fields, deeply ramified fields and their relatives Franz-Viktor Kuhlmann (joint work

Dynamic Classifier Selection Based on Imprecise Probabilities Meizhu Li Ghent University

On the construction of minimax-distance (sub-)optimal designs Luc Pronzato Universit Cte

Myself Researcher at CNR-IMATI & Member of the Shape and Seman2cs Modelling Group

Deviation from Pr[exactly 50.5 Heads] = ? = 0 the Mean Pr[exactly 50 Heads] < 1/13 Pr[50.5

Deconstructing Data Science David Bamman, UC Berkeley Info 290 - PowerPoint PPT Presentation

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 17: Distance models Mar 28, 2016 1853 (12,10) (5,6) 1853 Jefferson Oakland Lafayette Harrison Square Square Square Square Feature x 5

C Constructing i (and Deconstructing) (and Deconstructing) the Postmortem Interval the

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 7: Data and

Deconstructing Alice &amp; Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal

Deconstructing MinBFT for Security and Verifiability Vincent Rahli, Francisco Rocha, Marcus V

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 20: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 5: Clustering

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 9: Logistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 18: Distance models

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Probabilistic

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 3: Classification

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 2: Survey of

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 10: Validity Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 8: Naive Bayes Feb

Deconstructing Data Science David Bamman, UC Berkeley Info 290 Lecture 11: Topic models

Supervised Metric Learning M. Sebban Laboratoire Hubert Curien , UMR CNRS 5516 University of Jean

Linear Fitting CS3220 - Summer 2008 Jonathan Kaldor (based on Sp07 Slides) From N to M We

Outline Outline Several Random Variables Several Random Variables Joint

Perfectoid fields, deeply ramified fields and their relatives Franz-Viktor Kuhlmann (joint work

Dynamic Classifier Selection Based on Imprecise Probabilities Meizhu Li Ghent University

On the construction of minimax-distance (sub-)optimal designs Luc Pronzato Universit Cte

Myself Researcher at CNR-IMATI &amp; Member of the Shape and Seman2cs Modelling Group

Deviation from Pr[exactly 50.5 Heads] = ? = 0 the Mean Pr[exactly 50 Heads] &lt; 1/13 Pr[50.5

Deconstructing Alice & Bob Carlos Caleiro CLC, Dep. Mathematics, IST, TU Lisbon, Portugal

Myself Researcher at CNR-IMATI & Member of the Shape and Seman2cs Modelling Group

Deviation from Pr[exactly 50.5 Heads] = ? = 0 the Mean Pr[exactly 50 Heads] < 1/13 Pr[50.5