Measuring distance/ similarity of data objects Multiple data types - PowerPoint PPT Presentation

Measuring distance/ similarity of data objects

Multiple data types • Records of users • Graphs • Images • Videos • Text (webpages, books) • Strings (DNA sequences) • Timeseries • How do we compare them?

Feature space representation • Usually data objects consist of a set of attributes (also known as dimensions ) • J. Smith, 20, 200K • If all d dimensions are real-valued then we can visualize each data point as points in a d-dimensional space • If all d dimensions are binary then we can think of each data point as a binary vector

Distance functions • The distance d(x, y) between two objects x and y is a metric if – d(i, j) ≥ 0 (non-negativity) – d(i, i)=0 (isolation) – d(i, j)= d(j, i) (symmetry) – d(i, j) ≤ d(i, h)+d(h, j) (triangular inequality) [ Why do we need it? ] • The definitions of distance functions are usually di fg erent for real, boolean, categorical, and ordinal variables. • Weights may be associated with di fg erent variables based on applications and data semantics.

Data Structures attributes/dimensions • data matrix tuples/objects objects • Distance matrix objects

Distance functions for real-valued vectors • L p norms or Minkowski distance: ! 1 d p X | x i − y i | p L p ( x, y ) = i =1 • p = 1, L 1, Manhattan (or city block) or Hamming distance: d ! X L 1 ( x, y ) = | x i − y i | i =1

Distance functions for real-valued vectors • L p norms or Minkowski distance: ! 1 d p X | x i − y i | p L p ( x, y ) = i =1 • p = 2, L 2, Euclidean distance: ! 1 / 2 d X ( x i − y i ) 2 L 2 ( x, y ) = i =1

Distance functions for real-valued vectors • Dot product or cosine similarity x · y cos( x, y ) = || x |||| y || • Can we construct a distance function out of this? • When use the one and when the other ?

Hamming distance for 0-1 vectors x 0 1 0 0 1 0 0 1 0 y 1 0 0 0 0 1 0 1 1 d ! X L 1 ( x, y ) = | x i − y i | i =1

How good is Hamming distance for 0-1 vectors? • Drawback • Documents represented as sets (of words) • Two cases – Two very large documents -- almost identical -- but for 5 terms – Two very small documents, with 5 terms each, disjoint 10

Distance functions for binary vectors or sets • Jaccard similarity between binary vectors x and y (Range?) x JSim( x, y ) = | x ∩ y | | x ∪ y | y • Jaccard distance (Range?): JDist( x, y ) = 1 − | x ∩ y | | x ∪ y |

The previous example • Case 1 (very large almost identical documents) x J ( x, y ) almost 1 y • Case 2 (small disjoint documents) x J ( x, y ) = 0 y 12

Jaccard similarity/distance • Example: Q1 Q2 Q3 Q4 Q5 Q6 • JSim = 1/6 X 1 0 0 1 1 1 Y 0 1 1 0 1 0 • Jdist = 5/6

Distance functions for strings • Edit distance between two strings x and y is the min number of operations required to transform one string to another • Operations: replace, delete, insert, transpose etc.

Distance functions between strings • Strings x and y have equal length • Modification of Hamming distance • Add 1 for all positions that are di fg erent x c g t a a c g y g a t t a c a • Hamming distance = 4 • Drawbacks? 15

Hamming distance between strings -- drawbacks • Strings should have equal length • What about x a g a t t a c y g a t t a c a • String Hamming distance = 6

Edit Distance • Edit distance between two strings x and y of length n and m resp. is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word to the other

Example • I N T E N T I O N • E X E C U T I O N • I N T E * N T I O N • * E X E C U T I O N • d s s i s

Computing the edit distance • Dynamic programming Form nxm distance matrix D (x of length n, y of length m) • y D x • D(i,j) is the optimal distance between strings x[1..i] and y[1..j] 19

Computing the edit distance • How to compute D(i,j)? • Either – match the last two characters (substitution) – match by deleting the last char in one string – match by deleting the last character in the other string 20

Computing edit distance D ( i, j ) = min { D ( i − 1 , j ) + del( X [ i ]) , D ( i, j − 1) + ins( Y [ j ]) , D ( i − 1 , j − 1) + sub( X [ i ] , Y [ j ]) } • Running time? Metric?

Distance function between time series • time series can be seen as vectors • apply existing distance metrics • L-norms • what can go wrong? 22

Distance functions between time series • Euclidean distance between time series figures from Eamonn Keogh www.cs.ucr.edu/~eamonn/DTW_myths.ppt 23

Dynamic time warping • Alleviate the problems with Euclidean distance figures from Eamonn Keogh www.cs.ucr.edu/~eamonn/DTW_myths.ppt 24

Dynamic time warping • Quite useful in practice 4 3 2 1 0 -1 -2 -3 Sign -4 0 10 20 30 40 50 60 70 80 language figures from Eamonn Keogh www.cs.ucr.edu/~eamonn/DTW_myths.ppt 25

Dynamic time warping • how to compute it? • Dynamic programming Q C figures from Eamonn Keogh www.cs.ucr.edu/~eamonn/DTW_myths.ppt 26

Dynamic time warping • constraints for more e ffj cient computation C Q C figures from Eamonn Keogh www.cs.ucr.edu/~eamonn/DTW_myths.ppt 27

Measuring distance/ similarity of data objects Multiple data types - PowerPoint PPT Presentation

Measuring distance/ similarity of data objects Multiple data types Records of users Graphs Images Videos Text (webpages, books) Strings (DNA sequences) Timeseries How do we compare them? Feature space

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Mutable Values Announcements Objects (Demo) Objects 4 Objects Objects represent

61A Lecture 12 Announcements Objects (Demo) Objects 4 Objects Objects represent

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Getting To Know Your Data Road Map 1. Data Objects and Attribute Types 2. Descriptive Data

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Objects & Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

Measuring Angles S: physical distance along the arc between 2 objects Lengths are measured

Similarity searching using multiple starting points Peter Willett, University of Sheffield, UK

Passage Based Retrieval (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Passage Based

Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with

A Similarity Measure for the ALN Description Logic Nicola Fanizzi, Claudia dAmato Dipartimento

A graphical view of distance between rankings: the Point and Area measures Giorgio Maria Di Nunzio

Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace Unsupervised learning

Trajectory Clustering: Visual Analytics Approaches Gennady Andrienko & Natalia Andrienko y

Lab 1: Packet Sniffing and Wireshark Fengwei Zhang Wayne State University CSC 5991 Cyber

Measuring distance/ similarity of data objects Multiple data types - PowerPoint PPT Presentation

Measuring distance/ similarity of data objects Multiple data types Records of users Graphs Images Videos Text (webpages, books) Strings (DNA sequences) Timeseries How do we compare them? Feature space

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

Distance Education Distance education used to be about the distance. 1700s 1800s 1900s 2000s

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Mutable Values Announcements Objects (Demo) Objects 4 Objects Objects represent

61A Lecture 12 Announcements Objects (Demo) Objects 4 Objects Objects represent

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Mark-recapture distance sampling (MRDS) in Distance 7.1 Setting up Distance for MRDS

Getting To Know Your Data Road Map 1. Data Objects and Attribute Types 2. Descriptive Data

Distance in data space Notion of distance (metrics) in data space Who is my closest neighbor?

Objects &amp; Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

Multiple Sequence Multiple Sequence Alignments Alignments Multiple alignment Pairwise

Measuring Angles S: physical distance along the arc between 2 objects Lengths are measured

Similarity searching using multiple starting points Peter Willett, University of Sheffield, UK

Passage Based Retrieval (COSC 488) Nazli Goharian nazli@cs.georgetown.edu 1 Passage Based

Generalized similarity measures for text data. Hubert Wagner (IST Austria) Joint work with

A Similarity Measure for the ALN Description Logic Nicola Fanizzi, Claudia dAmato Dipartimento

A graphical view of distance between rankings: the Point and Area measures Giorgio Maria Di Nunzio

Hierarchical Clustering 4/5/17 Hypothesis Space Continuous inputs Output is a binary tree

Machine Learning 2 DS 4420 - Spring 2020 Clustering I Byron C. Wallace Unsupervised learning

Trajectory Clustering: Visual Analytics Approaches Gennady Andrienko &amp; Natalia Andrienko y

Lab 1: Packet Sniffing and Wireshark Fengwei Zhang Wayne State University CSC 5991 Cyber

Objects & Inheritance Section 7 Implementing Objects in 401 Ways of implementing objects:

Trajectory Clustering: Visual Analytics Approaches Gennady Andrienko & Natalia Andrienko y