Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010 QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Jožef Stefan Institute – Department of Knowledge Technologies
Outline Definitions Datastream models Similarity measures Historical background Foundations Estimating the L 2 distance Estimating the Jaccard similarity: Min-Wise Hashing Key applications Maintaining statistics on streams Hot items Some advanced results (Appendix) Estimating rarity and similarity (the windowed model) Tight bounds for approximate histograms and cluster ‐ based summaries Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Data stream models: Time series model A stream is a vector / point in space / p p Items are arriving in order of their indices: { , , ,...} { , , ,...} x x x x x x x x 1 1 2 2 3 3 … coordinates of the vector 1 2 3 4 x 1 x 2 x 3 x 4 The value of the i-th item is the value of the i-th coordinate of the vector Distance (similarity) between two streams is the distance between the two points Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Data stream models: Turnstile model Each arriving item is an update to some component of g p p the vector: 1 2 3 4 1 2 3 4 (2, 4) ⇒ 10 5 24 12 10 9 24 12 (2, x 2 (2 x (5) ) (5) ) indicates the 5 th update to the 2 nd indicates the 5 -th update to the 2 -nd component of the vector (2) + x i value: x i = x i (1) + x i (3) … i i i i positive or negative update only nonnegative updates ⇒ cash register model Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
L p distances (p ≥ 0 ) (p ) p Stream 1 {x 1 ,x 2 ,x 3 ,…} & stream 2 {y 1 ,y 2 ,y 3 ,…} in {1,…,m} { 1 } { 1 } { } 2 3 2 3 p -y i p | 1/p L p = Σ i |x i L 0 distance (Hamming distance) ⇔ the number of L 0 distance (Hamming distance) the number of indices i such that x i ≠ y i A measure of dis(similarity) of two streams [CDI02] L ∞ = max i |x i - y i | 2 -y i 2 | 1/2 distance L 2 = Σ i |x i 2 )- for approximating self-join sizes L 2 norm (f 2 [AGM’99] Q = COUNT(R A R) |dom(A)| = m Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Basic requirements q Naïve approach: store the points/vectors in memory Naïve approach: store the points/vectors in memory and compute any distance/similarity measure or a statistic (norm, frequency moment) ( , q y ) Typically: Large quantities of data – single pass g q g p Memory is constrained – O(log m) Real-time answers – linear time algorithms O(n) g ( ) Allowed approximate answers ( ε , δ ) ε & δ are user-specified parameters p p Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Historical background g [AMS’96] approximate F 2 (inserts only) [AMS 96] approximate F 2 (inserts only) [AGM’99] approximate L 2 norm (inserts and deletes) [FKS’99] approximate L 1 distance [ ] pp 1 [Indyk’00] approximate L p distance for p � (0,2] p-stable distributions (Caushy is 1-stable, Gaussian is 2-stable ) [CDI’02] efficient approximation of L 0 distance Approximate distances on windowed streams [DGI’02] approximate L p distance [Datar-Muthukrishnan’02] approximate Jaccard similarity Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Estimating the L 2 distance [AGM’99] g [ ] 2 Data streams (x 1 , x 2 …, x n ) and (y 1 , y 2 … y n ) For each i = 1, 2, …n define a i.i.d. random variable X i P[X i = 1] = P[X i = -1] = 1/2 � E[X i ]=0 Base idea: Simply maintain Σ i=1,..,n X i ( x i - y i ) For some i, j and items (i, x i (j) ), (i, y i (j) ) : (j) is added and X i X i · x i (j) is subtracted · y i E[( Σ i=1,..,n X i (x i -y i )) 2 ] = 1 0 E[ Σ i=1 n X i i ( 2 (x i -y i ) 2 + Σ i ≠ j X i X j (x i -y i )(x j -y j ) ] = i y i ) j ( i y i )( j y j ) ] [ i=1,..,n i ≠ j i Σ i=1,..,n (x i -y i ) 2 The problem amounts to obtaining an unbiased estimate The problem amounts to obtaining an unbiased estimate Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Standard boosting technique g q Run the algorithm in parallel k= θ (1/ ε 2 ) times Run the algorithm in parallel k θ (1/ ε ) times Maintain sums Σ i=1,..,n X i ( x i - y i ) for k different random 1. assignments for the random var. X i,k i,k Take the average of their squares for a given run r 2. ⇒ v (r) (reduce the variance/error!) Chebyshev Repeat the procedure l = θ (log(1/ δ )) times X i,k,l 3. Output the median over {v (1) ,v (2) ,…,v (l) } Chernoff 4. Maintains nkl values in parallel for the random 5. variables Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Result The Chebyshev inequality + Chernoff: The Chebyshev inequality Chernoff: ⇒ this estimates the square of L 2 within (1± ε ) factor with probability > (1 - δ ) p y ( ) Random variables needed: nkl ! The random variables can be four-wise independent p This is enough so that Chebyshev still holds [AMS’96] pseudorandomly generated on the fly O(kl) = O(1/ ε 2 log(1/ δ )) words + a logarithmic-length array of seeds O(log m) Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Estimating the L p distance g p p -stable distributions [I’00] p [ ] D is a p-stable distribution if: For all real numbers a 1 , a 2 , …, a k If X 1 , X 2 ,…,X k are i.i.d. random var. drawn from D ⇒ Σ a X has the same distribution as X( Σ | a | p ) 1/p ⇒ Σ a i X i has the same distribution as X( Σ i | a i | p ) 1/p for random variable X with distribution D Cauchy distribution is 1-stable L 1 Gaussian distribution is 2-stable L 2 Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
The algorithm g z 1 , z 2 ,…z is the stream vector z 1 , z 2 ,…z n is the stream vector Again… run in parallel k= θ (1/ ε 2 log(1/ δ )) procedures & maintain sums Σ i z i X i for each run procedures & maintain sums Σ i z i X i for each run 1,…k The value of Σ i z i X i in the l -th run is Z ( l ) e va ue o e u s i i i Z (l) is a random variable itself Let D is p -stable: e s p s ab e: Z (l) = X (l) ( Σ i | z i | p ) 1/p for some random variable X (l) drawn from D for some random variable X drawn from D Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Estimating the L p distance cont. g p The output is: p (1/ γ ) median {|Z (1) |, |Z (2) |,…, |Z (k) |} where γ is the median of |X|, for X random variable distributed according to D D Chebyshev : This estimate is within a multiplicative factor (1 ± ε ) of the true norm with probability (1- δ ) ( ) p y ( ) Observation [CDI’02]: L p is a good approximation of the L 0 norm for p sufficiently small ll p= ε /log(m) where m is the maximum absolute value of any item in the stream Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
The Jaccard similarity S A ={a 1 ,a 2 ,..a n } S B ={b 1 ,b 2 ,…,b n } Let A (and B) denote the set of distinct elements |A ∩ B|/|AUB| = Jaccard similarity Example: (view sets as columns) m=6 A B item 1 0 1 |AUB|=5 item 2 1 0 1 1 1 1 simJ(A,B) = 2/5 = 0.4 simJ(A,B) 2/5 0.4 0 0 1 1 item 6 item 6 0 0 1 1 Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Signature idea g Represent the sets A and B by signatures Sig(A) and Represent the sets A and B by signatures Sig(A) and Sig(B) Compute the similarity over the signatures p y g E[simH(Sig(A),Sig(B))]=simJ(A,B) Simplest approach S p pp Sample the sets (rows) uniformly at random k times to get k-bit signature Sig ( instead of m bits ) Problems! Sparsity – sampling might miss important information Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Tool: Min-Wise Hashing π ‐ randomly chosen permutation over {1,…,m} y p { , , } For any subset A ⊆ [m] the min-hash of A is: h π (A) = min i ∊ A { π (i)} π ( ) i ∊ A { ( )} Index of the first row with value 1 random permutation of the rows One bit of the k-bit signature of A, Sig(A) O bi f h k bi i f A Si (A) When π is chosen uniformly at random from the set of all permutations on [m] for any two subsets A B of all permutations on [m] for any two subsets A,B of [m] then: Pr[h (A) = h (B)] = |A ∩ B|/|AUB| Pr[h π (A) h π (B)] |A ∩ B|/|AUB| Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Example p Consider the following permutations: for m=5 1 = (1 2 3 4 5) k=1 2 = (5 4 3 2 1) k=2 3 = (3 4 5 1 2) k=3 And the sets: A = {1,3,4} B = {1,2,5} The min-hash values are as follows: h 1 (A) = 1 h 1 (B) = 1 k=1 h 2 (A) = 4 h 2 (B) = 5 k=2 h (A) = 3 h 3 (A) = 3 h (B) = 5 h 3 (B) = 5 k=3 k=3 � the expectation of the fraction of permutations where min- hash values agree is simJ(A,B) Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010
Recommend
More recommend