Importance Sampling via Locality Sensitive Hashing. Rice University Anshumali Shrivastava anshumali@rice.edu 7 th March 2019 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 1 / 23
Motivating Problem: Stochastic Gradient Descent N 1 θ ∗ = arg min � θ F ( θ ) = arg min f ( x i , θ ) (1) N θ i =1 Standard GD N θ t = θ t − 1 − η t 1 � ∇ f ( x j , θ t − 1 ) (2) N i =1 SGD, pick a random x i , and θ t = θ t − 1 − η t ∇ f ( x j , θ t − 1 ) (3) SGD Preferred over GD in Large-Scale Optimization. Slow Convergence per epoch. Faster Epoch, O(N) times and hence overall faster convergence. 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 2 / 23
Better SGD? Why SGD Works? (It is Unbiased Estimator) N E ( ∇ f ( x j , θ t − 1 )) = 1 � ∇ f ( x i , θ t − 1 ) . (4) N i =1 Are there better estimators? YES!! Pick x i , with probability proportional to w i Optimal Variance (Alain et. al. 2015): w i = ||∇ f ( x i , θ t − 1 ) || 2 Many works on other Importance Weights (e.g. works by Rachel Ward) The Chicken-and-Egg Loop Maintaining w i , requires O ( N ) work. � � For Least Squares, w i = ||∇ f ( x i , θ t ) || 2 = � 2( θ t · x i − y i ) || x i || 2 � , changes in every iteration. 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 3 / 23
Better SGD? Why SGD Works? (It is Unbiased Estimator) N E ( ∇ f ( x j , θ t − 1 )) = 1 � ∇ f ( x i , θ t − 1 ) . (4) N i =1 Are there better estimators? YES!! Pick x i , with probability proportional to w i Optimal Variance (Alain et. al. 2015): w i = ||∇ f ( x i , θ t − 1 ) || 2 Many works on other Importance Weights (e.g. works by Rachel Ward) The Chicken-and-Egg Loop Maintaining w i , requires O ( N ) work. � � For Least Squares, w i = ||∇ f ( x i , θ t ) || 2 = � 2( θ t · x i − y i ) || x i || 2 � , changes in every iteration. Can we Break this Chicken-and-Egg Loop? Can we get adaptive sampling in constant time O(1) per Iterations, similar to cost of 7 th March 2019 SGD? Anshumali Shrivastava (Rice University) COMP 480/580 3 / 23
Detour: Probabilistic Hashing 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 4 / 23
Probabilistic Fingerprinting (Hashing) Hashing: Function (Randomized) h that maps a given data object (say x ∈ R D ) to an integer key h : R D �→ { 0 , 1 , 2 , ..., N } . h ( x ) serves as a discrete fingerprint. 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 5 / 23
Probabilistic Fingerprinting (Hashing) Hashing: Function (Randomized) h that maps a given data object (say x ∈ R D ) to an integer key h : R D �→ { 0 , 1 , 2 , ..., N } . h ( x ) serves as a discrete fingerprint. Locality Sensitive Property: if x = y Sim(x,y) is high then h ( x ) = h ( y ) Pr ( h ( x ) = h ( y )) is high. if x � = y Sim(x,y) is low then h ( x ) � = h ( y ) Pr ( h ( x ) = h ( y )) is low. Similar points are more likely to have the same hash value (hash collision) compared to dissimilar points. Likely Unlikely 0 h 1 2 3 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 5 / 23
Popular Hashing Scheme 1: SimHash (SRP) 𝜄 � if r T x ≥ 0 1 r ∈ R D ∼ N (0 , I ) h r ( x ) = 0 otherwise Pr r ( h r ( x ) = h r ( y )) = 1 − 1 π cos − 1 ( θ ) , monotonic in θ (Cosine Similarity) A classical result from Goemans-Williamson (95) 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 6 / 23
Popular Hashing Scheme 1: SimHash (SRP) 𝑠 𝜄 𝒔 𝑼 𝒚 > 0 𝒔 𝑼 𝒚 < 0 � if r T x ≥ 0 1 r ∈ R D ∼ N (0 , I ) h r ( x ) = 0 otherwise Pr r ( h r ( x ) = h r ( y )) = 1 − 1 π cos − 1 ( θ ) , monotonic in θ (Cosine Similarity) A classical result from Goemans-Williamson (95) 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 6 / 23
Some Popular Measures that are Hashable Many Popular Measures. Jaccard Similarity (MinHash) Cosine Similarity (Simhash and also MinHash if Data is Binary) Euclidian Distance Earth Mover Distance, etc. Recently, Un-normalized Inner Products 1 1 With bounded norm assumption. 2 Allowing Asymmetry. 1 SL [NIPS 14 (Best Paper), UAI 15, WWW 15], APRS [PODS 16]. 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 7 / 23
Sub-linear Near-Neighbor Search Given a query q ∈ R D and a giant collection C of N vectors in R D , search for p ∈ C s.t., p = arg max sim ( q , x ) x ∈C sim is the similarity, like Cosine Similarity, Resemblance, etc. Worst case O ( N ) for any query. N is huge. Querying is a very frequent operation. 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 8 / 23
Sub-linear Near-Neighbor Search Given a query q ∈ R D and a giant collection C of N vectors in R D , search for p ∈ C s.t., p = arg max sim ( q , x ) x ∈C sim is the similarity, like Cosine Similarity, Resemblance, etc. Worst case O ( N ) for any query. N is huge. Querying is a very frequent operation. Our goal is to find sub-linear query time algorithm. 1 Approximate (or Inexact) answer suffices. 2 We are allowed to pre-process C once. (offline costly step) 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 8 / 23
𝒊 𝟐 𝒊 𝟑 𝒊 𝟐 𝒊 𝟑 𝑆 𝐸 … … 𝒊 𝟐 , 𝒊 𝟑 : 𝑺 𝑬 → {𝟏, 𝟐, 𝟑, 𝟒} Probabilities Hash Tables � � Given: Pr h h ( x ) = h ( y ) = f ( sim ( x , y )) , f is monotonic. 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 9 / 23
Probabilities Hash Tables � � Given: Pr h h ( x ) = h ( y ) = f ( sim ( x , y )) , f is monotonic. 𝒊 𝟐 𝒊 𝟑 Buckets (pointers only) 𝒊 𝟐 00 00 00 01 00 10 𝒊 𝟑 𝑆 𝐸 … … 𝒊 𝟐 , 𝒊 𝟑 : 𝑺 𝑬 → {𝟏, 𝟐, 𝟑, 𝟒} 11 11 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 9 / 23
Probabilities Hash Tables � � Given: Pr h h ( x ) = h ( y ) = f ( sim ( x , y )) , f is monotonic. 𝒊 𝟐 𝒊 𝟑 Buckets (pointers only) 𝒊 𝟐 00 00 … 00 01 … 00 10 Empty 𝒊 𝟑 𝑆 𝐸 … … … 𝒊 𝟐 , 𝒊 𝟑 : 𝑺 𝑬 → {𝟏, 𝟐, 𝟑, 𝟒} 11 11 … 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 9 / 23
Probabilities Hash Tables � � Given: Pr h h ( x ) = h ( y ) = f ( sim ( x , y )) , f is monotonic. 𝒊 𝟐 𝒊 𝟑 Buckets (pointers only) 𝒊 𝟐 00 00 … 00 01 … 00 10 Empty 𝒊 𝟑 𝑆 𝐸 … … … 𝒊 𝟐 , 𝒊 𝟑 : 𝑺 𝑬 → {𝟏, 𝟐, 𝟑, 𝟒} 11 11 … Given query q , if h 1 ( q ) = 11 and h 2 ( q ) = 01, then probe bucket with index 1101 . It is a good bucket !! 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 9 / 23
Probabilities Hash Tables � � Given: Pr h h ( x ) = h ( y ) = f ( sim ( x , y )) , f is monotonic. 𝒊 𝟐 𝒊 𝟑 Buckets (pointers only) 𝒊 𝟐 00 00 … 00 01 … 00 10 Empty 𝒊 𝟑 𝑆 𝐸 … … … 𝒊 𝟐 , 𝒊 𝟑 : 𝑺 𝑬 → {𝟏, 𝟐, 𝟑, 𝟒} 11 11 … Given query q , if h 1 ( q ) = 11 and h 2 ( q ) = 01, then probe bucket with index 1101 . It is a good bucket !! (Locality Sensitive) h i ( q ) = h i ( x ) noisy indicator of high similarity. Doing better than random !! 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 9 / 23
The Classical LSH Algorithm Table 1 𝟐 Buckets 𝟐 … 𝒊 𝑳 𝒊 𝟐 00 … 00 … 00 … 01 … 00 … 10 Empty … … … … 11 … 11 … We use K concatenation. 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 10 / 23
The Classical LSH Algorithm Table 1 Table L 𝟐 Buckets 𝑴 Buckets 𝟐 … 𝒊 𝑳 𝑴 … 𝒊 𝑳 𝒊 𝟐 𝒊 𝟐 00 … 00 … 00 … 00 … … 00 … 01 … 00 … 01 … 00 … 10 Empty 00 … 10 … … … … … … … … 11 … 11 … 11 … 11 Empty We use K concatenation. Repeat the process L times. ( L Independent Hash Tables) 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 10 / 23
The Classical LSH Algorithm Table 1 Table L 𝟐 Buckets 𝑴 Buckets 𝟐 … 𝒊 𝑳 𝑴 … 𝒊 𝑳 𝒊 𝟐 𝒊 𝟐 00 … 00 … 00 … 00 … … 00 … 01 … 00 … 01 … 00 … 10 Empty 00 … 10 … … … … … … … … 11 … 11 … 11 … 11 Empty We use K concatenation. Repeat the process L times. ( L Independent Hash Tables) Querying : Probe one bucket from each of L tables. Report union. 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 10 / 23
The Classical LSH Algorithm Table 1 Table L 𝟐 Buckets 𝑴 Buckets 𝟐 … 𝒊 𝑳 𝑴 … 𝒊 𝑳 𝒊 𝟐 𝒊 𝟐 00 … 00 … 00 … 00 … … 00 … 01 … 00 … 01 … 00 … 10 Empty 00 … 10 … … … … … … … … 11 … 11 … 11 … 11 Empty We use K concatenation. Repeat the process L times. ( L Independent Hash Tables) Querying : Probe one bucket from each of L tables. Report union. 1 Two knobs K and L to control. 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 10 / 23
Success of LSH Similarity Search or Related (Reduce n) Similarity Search or related. Plenty of Applications. 2 Li et. al. NIPS 2011 7 th March 2019 Anshumali Shrivastava (Rice University) COMP 480/580 11 / 23
Recommend
More recommend