Locality-Sensitive Hashing & Image Similarity Search Andrew - PowerPoint PPT Presentation

Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie

Overview; LSH ● given a query q (or not), how do we find similar items from a large search set quickly? ○ Can’t do all pairwise comparisons; n C 2 pairs ● define a measure of similarity for the items, then hash them into buckets using the measure. ○ Items which are similar will be in the same bucket. ● then when given a query q , we hash it and return items in the same bucket.

Overview; LSH ● it’s a way to do approximate near-neighbour search ○ Item signatures used are approximate (mostly) ○ Items hashing to the same bucket is probabilistic ● so multiple hash tables are composed for better accuracy

Overview; LSH ● there are many similarity/distance measures ○ Jaccard ○ Edit ○ Euclidean ○ Chi 2 ○ Hamming ○ p -stable ○ Cosine ○ Kernelized ● allows sublinear query time of O(dn 1/ 1+ ϵ ) ● preprocessing varies based on data & representation

Euclidean Distance ● n -dimensional space ● most often l 2 norm, l 1 & l ∞ norms also used ● d(v, u) = (∑ i |v i - u i | p ) 1/p ● eg. x = [7, 2, 3], y = [5, 0, -2] ○ d 2 (x, y) = [ (7 - 5) 2 + (2 - 0) 2 + (3 - (-2)) 2 ] ½ ○ d 2 (x, y) = 29 1/2 = 5.39

Euclidean Distance & Random Projections ● we won’t compute the distance between the points! ● use a randomly chosen line in 2-space (for each hash fn) ● select a constant a to divide line into equal width segments ● points projected onto the line, buckets are the segments ● (a/2, 2a, 1/2, 1/3) -sensitive family

Cosine Distance ● it’s the angle between two vectors/points (in degrees) ● calculated as their dot product divided by l 2 norms ● eg. x = [7, 2, 3], y = [5, 0, -2] ○ d(x,y) = (7*5) + (2*0) + (3*(-2)) / ||x|| 2 ||y|| 2 ○ d(x,y) = 29 / 62 1/2 * 29 1/2 ○ d(x,y) = cos -1 (0.684) ○ d(x, y) = 46.8 degrees

Cosine Distance & Random Hyperplanes ● don’t actually compute this distance for x & y ● consider a random plane through the origin w/ normal v ● compute instead v.x & v.y

Cosine Distance & Random Hyperplanes ● we’ll say they’re similar if they have the same sign ● (d 1 , d 2 , (180 − d 1 )/180, (180 − d 2 )/180) -sensitive

p -Stable Distribution Scheme ● locality-sensitive families for l p norm using p -stable distribution ○ eg. Gaussian distribution is 2-stable ● distribution is stable if ○ ∑ i v i X i has same distribution as (∑ i |v i | p ) 1/p X ● so with v & X as vectors the dot product estimates the l p norm

p -Stable Distribution Scheme ● dot product is instead used to assign a hash value to v ○ projects to a value on the real line ○ split line into equal-width segments of size r for buckets ● two vectors which are close have a small difference between norms, and should collide ● h a,b (v) = ⌊ (a. v + b) / r ⌋ ● family is (r 1 , r 2 , p 1 , p 2 )-sensitive

Image Similarity Search ● consider the case of search in web engines ○ most engines return image search matches based on ■ surrounding text on the page ■ image metadata ● could lead to incorrect results for mislabelled images &c ● can we do better than this? ○ should also match on similar images

Google Image Search (VisualRank) ● uses PageRank for initial candidate results ● feature vectors extracted using SIFT (local features)

Google Image Search (VisualRank) ● clusters images based on similarity ○ measured using p -stable ○ Gaussian distribution ○ l 2 norm

Google Image Search (VisualRank) ● top results selected as graph center ○ eigenvector centrality measure

Image Similarity Search ● other methods have been proposed... ● chi 2 distance scheme ○ also based on p -stable ○ modified to use X 2 distance measure ○ similarity more accurate wrt/ global image descriptors ■ eg. color histograms (what’s mostly used)

Image Similarity Search

Image Similarity Search ● kernelized lsh (afaik) ○ constructed using kernel function (& some database items) ■ eg. gaussian blur, radial basis functions ■ method allows functions with unknown embeddings ○ given kernelized data & kernel function ■ need to use random hyperplane in kernel-induced feature space ■ construct hyperplane as weighted sum of random items ■ transform to change to normal distribution ■ which is used with the (modified) random hyperplane method

Image Similarity Search ● kernelized lsh (example) ○ 80 million images; extracting 384-dimensional vector ○ image → gist descriptor → Gaussian RBF Kernel ○ only .098% of all images searched

References ● Mayur Datar and Piotr Indyk. Locality-sensitive hashing scheme based on p- stable distributions. ACM Press, 2004. ● Yushi Jing and Shumeet Baluja. VisualRank: Applying PageRank to Large-Scale Image Search. 2008. ● Gorisse, D. and Cord, M. and Precioso, F. Locality-Sensitive Hashing for Chi2 Distance. 2012 ● Kulis, B. and Grauman, K. Kernelized Locality-Sensitive Hashing. 2012 ● Ullman, J. and Rajaraman, A. and Leskovec, J. Mining of Massive Datasets. 2010

Locality-Sensitive Hashing & Image Similarity Search Andrew - PowerPoint PPT Presentation

Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a query q (or not), how do we find similar items from a large search set quickly? Cant do all pairwise comparisons; n C 2 pairs define a

Today. Cuckoo hashing. Today. Cuckoo hashing. Johnson-Lindenstrass. Cuckoo hashing. Hashing

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

Locality-Sensitive Hashing LSH Fingerprints References Anil Maheshwari School of Computer

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Overview Intro to Hashing Intro to Hashing Hashing with Chaining Whats hashing?

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Locality-Sensitive Hashing Documents LSH Metric Spaces Sensitive Function Anil Maheshwari

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Information near-duplicates Minimum hashing; Locality Sensitive Hashing Web Search Information

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Near Neighbor Search in High Dimensional Data (2) Locality-Sensitive Hashing (continued) LS

Database Systems Index: Hashing Based on slides by Feifei Li, University of Utah Hashing n

Hashing (Application of Probability) Ashwinee Panda Final CS 70 Lecture! 9 Aug 2018 Overview

Hashing Connections 2-Universal Hash Function Perfect Hashing Anil Maheshwari Proofs

Union-Find [10] In the last class Hashing Collision Handling for Hashing Closed

Hashing Chapter 5 1 Objectives Understand the idea of hashing Compare hashing to sorting

Exotic and CP violating Higgs Decays Grenoble, 02 July 2014 Based on work with Roberto

Provably Secure Execution Platforms - Lecture Four: A Simple Separation Kernel and Its

Timo Sirainen Dovecot Solutions Oy http://www.dovecot.org/ Talk Overview Quick introduction

2 nd semester Topic 35: Messages on the phone In our daily life we are trying to save our

CSC165 Week 10 Larry Zhang, November 11, 2014 Test 2 result average: 8.9 + 6.2 + 6.6 = 21.7 / 30

UPnP / DLNA plugin Smit Mehta GSoC 2012 - KDE Digikam Akademy 2012 Overview UPnP (Universal

A Testable Abstract Data Type of Outer and Inner Real Approximations Michal Kone cn y

R&D: Getting Hackers Laid* *Other side effects may include better quality software Paul J.

Locality-Sensitive Hashing & Image Similarity Search Andrew - PowerPoint PPT Presentation

Locality-Sensitive Hashing & Image Similarity Search Andrew Wylie Overview; LSH given a query q (or not), how do we find similar items from a large search set quickly? Cant do all pairwise comparisons; n C 2 pairs define a

Today. Cuckoo hashing. Today. Cuckoo hashing. Johnson-Lindenstrass. Cuckoo hashing. Hashing

Nearest Neighbor and Locality-Sensitive Hashing Nearest Neighbor Set Similarity

Locality-Sensitive Hashing LSH Fingerprints References Anil Maheshwari School of Computer

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Overview Intro to Hashing Intro to Hashing Hashing with Chaining Whats hashing?

14. Hashing Hash Tables, Pre-Hashing, Hashing, Resolving Collisions using Chaining, Simple

Locality-Sensitive Hashing Documents LSH Metric Spaces Sensitive Function Anil Maheshwari

CONTEXT LOCALITY LOCALITY LOCALITY LOCALITY LAYOUTS M E E R L U S T R O A D PICK

Information near-duplicates Minimum hashing; Locality Sensitive Hashing Web Search Information

MIN-HASHING AND LOCALITY SENSITIVE HASHING Thanks to: Rajaraman and Ullman, Mining Massive

Near Neighbor Search in High Dimensional Data (2) Locality-Sensitive Hashing (continued) LS

Database Systems Index: Hashing Based on slides by Feifei Li, University of Utah Hashing n

Hashing (Application of Probability) Ashwinee Panda Final CS 70 Lecture! 9 Aug 2018 Overview

Hashing Connections 2-Universal Hash Function Perfect Hashing Anil Maheshwari Proofs

Union-Find [10] In the last class Hashing Collision Handling for Hashing Closed

Hashing Chapter 5 1 Objectives Understand the idea of hashing Compare hashing to sorting

Exotic and CP violating Higgs Decays Grenoble, 02 July 2014 Based on work with Roberto

Provably Secure Execution Platforms - Lecture Four: A Simple Separation Kernel and Its

Timo Sirainen Dovecot Solutions Oy http://www.dovecot.org/ Talk Overview Quick introduction

2 nd semester Topic 35: Messages on the phone In our daily life we are trying to save our

CSC165 Week 10 Larry Zhang, November 11, 2014 Test 2 result average: 8.9 + 6.2 + 6.6 = 21.7 / 30

UPnP / DLNA plugin Smit Mehta GSoC 2012 - KDE Digikam Akademy 2012 Overview UPnP (Universal

A Testable Abstract Data Type of Outer and Inner Real Approximations Michal Kone cn y

R&amp;D: Getting Hackers Laid* *Other side effects may include better quality software Paul J.

R&D: Getting Hackers Laid* *Other side effects may include better quality software Paul J.