querying and mining querying and mining data streams
play

QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena - PowerPoint PPT Presentation

Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010 QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Joef Stefan Institute Department of Knowledge Technologies Outline


  1. Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010 QUERYING AND MINING QUERYING AND MINING DATA STREAMS Elena Ikonomovska Jožef Stefan Institute – Department of Knowledge Technologies

  2. Outline  Definitions  Datastream models  Similarity measures  Historical background  Foundations  Estimating the L 2 distance  Estimating the Jaccard similarity: Min-Wise Hashing  Key applications  Maintaining statistics on streams  Hot items  Some advanced results (Appendix)  Estimating rarity and similarity (the windowed model)  Tight bounds for approximate histograms and cluster ‐ based summaries Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  3. Data stream models: Time series model  A stream is a vector / point in space / p p  Items are arriving in order of their indices:   { , , ,...} { , , ,...} x x x x x x x x 1 1 2 2 3 3 … coordinates of the vector 1 2 3 4 x 1 x 2 x 3 x 4  The value of the i-th item is the value of the i-th coordinate of the vector  Distance (similarity) between two streams is the distance between the two points Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  4. Data stream models: Turnstile model  Each arriving item is an update to some component of g p p the vector: 1 2 3 4 1 2 3 4 (2, 4) ⇒ 10 5 24 12 10 9 24 12 (2, x 2 (2 x (5) ) (5) ) indicates the 5 th update to the 2 nd indicates the 5 -th update to the 2 -nd component of the vector (2) + x i  value: x i = x i (1) + x i (3) … i i i i  positive or negative update  only nonnegative updates ⇒ cash register model Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  5. L p distances (p ≥ 0 ) (p ) p  Stream 1 {x 1 ,x 2 ,x 3 ,…} & stream 2 {y 1 ,y 2 ,y 3 ,…} in {1,…,m} { 1 } { 1 } { } 2 3 2 3 p -y i p | 1/p L p = Σ i |x i  L 0 distance (Hamming distance) ⇔ the number of  L 0 distance (Hamming distance) the number of indices i such that x i ≠ y i  A measure of dis(similarity) of two streams [CDI02]  L ∞ = max i |x i - y i | 2 -y i 2 | 1/2 distance  L 2 = Σ i |x i 2 )- for approximating self-join sizes  L 2 norm (f 2 [AGM’99] Q = COUNT(R A R) |dom(A)| = m Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  6. Basic requirements q  Naïve approach: store the points/vectors in memory  Naïve approach: store the points/vectors in memory and compute any distance/similarity measure or a statistic (norm, frequency moment) ( , q y )  Typically:  Large quantities of data – single pass g q g p  Memory is constrained – O(log m)  Real-time answers – linear time algorithms O(n) g ( )  Allowed approximate answers ( ε , δ )  ε & δ are user-specified parameters p p Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  7. Historical background g  [AMS’96] approximate F 2 (inserts only)  [AMS 96] approximate F 2 (inserts only)  [AGM’99] approximate L 2 norm (inserts and deletes)  [FKS’99] approximate L 1 distance [ ] pp 1  [Indyk’00] approximate L p distance for p � (0,2]  p-stable distributions (Caushy is 1-stable, Gaussian is 2-stable )  [CDI’02] efficient approximation of L 0 distance  Approximate distances on windowed streams  [DGI’02] approximate L p distance  [Datar-Muthukrishnan’02] approximate Jaccard similarity Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  8. Estimating the L 2 distance [AGM’99] g [ ] 2  Data streams (x 1 , x 2 …, x n ) and (y 1 , y 2 … y n )  For each i = 1, 2, …n define a i.i.d. random variable X i P[X i = 1] = P[X i = -1] = 1/2 � E[X i ]=0  Base idea: Simply maintain Σ i=1,..,n X i ( x i - y i )  For some i, j and items (i, x i (j) ), (i, y i (j) ) : (j) is added and X i  X i · x i (j) is subtracted · y i E[( Σ i=1,..,n X i (x i -y i )) 2 ] = 1 0 E[ Σ i=1 n X i i ( 2 (x i -y i ) 2 + Σ i ≠ j X i X j (x i -y i )(x j -y j ) ] = i y i ) j ( i y i )( j y j ) ] [ i=1,..,n i ≠ j i Σ i=1,..,n (x i -y i ) 2  The problem amounts to obtaining an unbiased estimate  The problem amounts to obtaining an unbiased estimate Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  9. Standard boosting technique g q  Run the algorithm in parallel k= θ (1/ ε 2 ) times  Run the algorithm in parallel k θ (1/ ε ) times Maintain sums Σ i=1,..,n X i ( x i - y i ) for k different random 1. assignments for the random var. X i,k i,k Take the average of their squares for a given run r 2. ⇒ v (r) (reduce the variance/error!) Chebyshev Repeat the procedure l = θ (log(1/ δ )) times X i,k,l 3. Output the median over {v (1) ,v (2) ,…,v (l) } Chernoff 4. Maintains nkl values in parallel for the random 5. variables Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  10. Result The Chebyshev inequality + Chernoff: The Chebyshev inequality Chernoff: ⇒ this estimates the square of L 2 within (1± ε ) factor with probability > (1 - δ ) p y ( )  Random variables needed: nkl !  The random variables can be four-wise independent p  This is enough so that Chebyshev still holds [AMS’96]  pseudorandomly generated on the fly  O(kl) = O(1/ ε 2 log(1/ δ )) words + a logarithmic-length array of seeds O(log m) Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  11. Estimating the L p distance g p  p -stable distributions [I’00] p [ ] D is a p-stable distribution if:  For all real numbers a 1 , a 2 , …, a k If X 1 , X 2 ,…,X k are i.i.d. random var. drawn from D ⇒ Σ a X has the same distribution as X( Σ | a | p ) 1/p ⇒ Σ a i X i has the same distribution as X( Σ i | a i | p ) 1/p for random variable X with distribution D  Cauchy distribution is 1-stable L 1  Gaussian distribution is 2-stable L 2 Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  12. The algorithm g z 1 , z 2 ,…z is the stream vector z 1 , z 2 ,…z n is the stream vector  Again… run in parallel k= θ (1/ ε 2 log(1/ δ )) procedures & maintain sums Σ i z i X i for each run procedures & maintain sums Σ i z i X i for each run 1,…k  The value of Σ i z i X i in the l -th run is Z ( l ) e va ue o e u s i i i  Z (l) is a random variable itself  Let D is p -stable: e s p s ab e: Z (l) = X (l) ( Σ i | z i | p ) 1/p for some random variable X (l) drawn from D for some random variable X drawn from D Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  13. Estimating the L p distance cont. g p  The output is: p (1/ γ ) median {|Z (1) |, |Z (2) |,…, |Z (k) |}  where γ is the median of |X|, for X random variable distributed according to D D  Chebyshev : This estimate is within a multiplicative factor (1 ± ε ) of the true norm with probability (1- δ ) ( ) p y ( )  Observation [CDI’02]:  L p is a good approximation of the L 0 norm for p sufficiently small ll  p= ε /log(m) where m is the maximum absolute value of any item in the stream Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  14. The Jaccard similarity S A ={a 1 ,a 2 ,..a n } S B ={b 1 ,b 2 ,…,b n }  Let A (and B) denote the set of distinct elements |A ∩ B|/|AUB| = Jaccard similarity  Example: (view sets as columns) m=6 A B item 1 0 1 |AUB|=5 item 2 1 0 1 1 1 1 simJ(A,B) = 2/5 = 0.4 simJ(A,B) 2/5 0.4 0 0 1 1 item 6 item 6 0 0 1 1 Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  15. Signature idea g  Represent the sets A and B by signatures Sig(A) and  Represent the sets A and B by signatures Sig(A) and Sig(B)  Compute the similarity over the signatures p y g  E[simH(Sig(A),Sig(B))]=simJ(A,B)  Simplest approach S p pp  Sample the sets (rows) uniformly at random k times to get k-bit signature Sig ( instead of m bits )  Problems!  Sparsity – sampling might miss important information Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  16. Tool: Min-Wise Hashing  π ‐ randomly chosen permutation over {1,…,m} y p { , , }  For any subset A ⊆ [m] the min-hash of A is:  h π (A) = min i ∊ A { π (i)} π ( ) i ∊ A { ( )}  Index of the first row with value 1  random permutation of the rows  One bit of the k-bit signature of A, Sig(A) O bi f h k bi i f A Si (A)  When π is chosen uniformly at random from the set of all permutations on [m] for any two subsets A B of all permutations on [m] for any two subsets A,B of [m] then: Pr[h (A) = h (B)] = |A ∩ B|/|AUB| Pr[h π (A) h π (B)] |A ∩ B|/|AUB| Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

  17. Example p  Consider the following permutations: for m=5  1 = (1 2 3 4 5) k=1  2 = (5 4 3 2 1) k=2  3 = (3 4 5 1 2) k=3  And the sets: A = {1,3,4} B = {1,2,5} The min-hash values are as follows: h  1 (A) = 1 h  1 (B) = 1 k=1 h  2 (A) = 4 h  2 (B) = 5 k=2 h  (A) = 3 h  3 (A) = 3 h  (B) = 5 h  3 (B) = 5 k=3 k=3 � the expectation of the fraction of permutations where min- hash values agree is simJ(A,B) Advanced School on Data Exchange, Integration, and Streams - Dagstuhl, November 2010

Recommend


More recommend