IEEE International Conference on Data Mining, Miami, Florida,USA 06-09/12/2009 Finding Associations Similarity via Biased and Computing Pair Sampling
General framework 2/35 Introduction • Running example: co-starring actors Part I Part II • Are there actors having acted together in most Conclusion of their movies? The answer will follow... • (Support = #occurrences). A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Samp ling
General framework 3/35 Introduction • Market basket model Part I Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
General framework 3/35 Introduction • Market basket model Part I Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
General framework 4/35 Introduction • Market basket model • Association rules (binary); Part I ‒ “Do people buying diapers also buy beer?”; ‒ Interesting associations are problem dependent: what about caviar and (expensive) vodka? Part II • More generally: similarity rules; Conclusion ‒ This paper: Cosine, Jaccard, All confidence & more ; A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Samp ling
Previous work 5/35 Introduction Two approaches so far. First one: • Identification of frequent item pairs ( i,j ), Part I counting the number of co-occurrences; (1,5,6,10,21,30,100) (2,3,6,9,11,12,15,20,25,30,99) C(6,30) = 3 (1,7,9,13,14,16,22,26,50,80) (4,6,916,17,23,24,26,30,41,47,81,84,98) Part II • Suppose the average transaction size is b ; when b is large, the cost for this Conclusion operation is huge (quadratic in b times the number of transactions m ). A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
Previous work 6/35 Introduction Pairs counting: a two phases affair: 1. Identify frequent pairs (i,j) and count their oc- Expensive ¡ currences; Part I 2. Find, among the phase 1 output, the pairs with highest affinity (using the number of occur- rences); Part II The focus has usually been the space usage Conclusion Support pruning A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
Previous work 7/35 Introduction Two approaches so far. Second one: • Computation of a signature for each item Part I and use of the signatures to infer the value of the similarity. Part II m’ << m Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
Previous work 8/35 Introduction Signatures: 1. Good performances; poor scalability to large number of items; Part I 2. Only some similarities can be computed (Charikar 2002 shows that Dice and Overlap_coef do not admit a LSH); Part II It is necessary to extend the scope. Conclusion Note: also in this case, support pruning can help. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
Previous work 9/35 Introduction Support Pruning • Pruning threshold = 50 Part I Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
10/35 Introduction Part I BiSam algorithms Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 11/35 Framework: Focus on CPU! Introduction • Larger internal memory; Part I • Faster I/O devices (IODrive – Fusion-io); Part II • High bandwidth. Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 12/35 Framework: differences with Introduction previous approaches • Go directly for the result avoiding the two phases; Part I • No need for support pruning; • No signatures. Part II • A new sampling paradigm; ‒ False positives and negatives; Conclusion ‒ At the cost of efficiency, the errors can be arbitrarily reduced. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 13/35 Introduction Sampling: how? Notation: ‒ T 1 ,...,T m is a sequence of transactions; ‒ ∀ i ∈ {1,...,m}, T i ⊂ [n]=[1,n] ⊂ℵ ; ¦S i ∩ S j ¦ * f(¦S i ¦,¦S j ¦) = s(i,j) Part I ‒ ∀ i ∈ {1,...,n}, let S i ={ j ¦ i ∈ T j }; f( S i ,S j ) m / (| S i ¦ * ¦S j ¦) Part II 1 / (| S i ¦* ¦S j ¦) 1/2 1 / max{| S i ¦,¦S j ¦} 1 / (| S i ¦+¦S j ¦) Conclusion 1 / min{| S i ¦,¦S j ¦} A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 14/35 Introduction Sampling: how? • Characteristics of f(): Part I ‒ ¦S i ∩ S j ¦ * f(¦S i ¦, ¦S j ¦) = s(i,j); Part II ‒ f(¦S i ¦, ¦S j ¦) can be used as (almost) the sampling probability of the pair (i,j); Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 15/35 Introduction Sampling: how? • We can use such a function to sample pairs in transactions: Part I ‒ We roll a [0,1) die per each transaction; call this value r; ‒ If Pr[i,j] = Pr(f()) > r, hence depending on the pair similarity, it will go in the sample; Part II • Given a user defined threshold Δ , we want to find those pairs satisfying: s(i,j) ≥Δ ; Conclusion • The sampling probability is scaled by a factor µ (Pr[i,j] = f(¦ S i ¦, ¦S j ¦) µ / Δ ); A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 16/35 Introduction Sampling: how? • In this way pairs would be sampled a number of times proportional to their similarity! Part I • s(i,j) ≥Δ M(i,j) ≥ µ M = multiset of samples; M(i,j) := # of times the pair • s(i,j)< Δ M(i,j) < µ Part II (i,j) is sampled • Caveat: how much time would this take? Conclusion ‒ Ω (mb 2 )!!! A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 17/35 Introduction A closer look to f() ¦S i ∩ S j ¦ * f(¦S i ¦,¦S j ¦) = s(i,j) Part I f( S i ,S j ) Non increasing in both parameters! If transactions are sorted according m / (| S i ¦ * ¦S j ¦) Part II to the cardinalities of S i , only a 1 / (| S i ¦* ¦S j ¦) 1/2 very limited number of pairs is 1 / max{| S i ¦,¦S j ¦} scanned within a transaction. 1 / (| S i ¦+¦S j ¦) Conclusion 1 / min{| S i ¦,¦S j ¦} A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 18/35 Introduction A new sampling technique • Characteristics of f(): ‒ f : ℵ × ℵ → ℜ + is non increasing in both parameters; Part I ‒ ¦S i ∩ S j ¦ * f(¦S i ¦,¦S j ¦) = s(i,j) ‒ computable in constant time; Part II • Given a user defined threshold Δ , we want to find those pairs satisfying: s(i,j) ≥Δ ; Conclusion • The sampling probability is scaled such that for all pairs with s (i,j) ≥Δ , we expect to see µ occurrences of (i,j) in the sample. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 19/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • ItemCount() returns a function compu- ting the occurrences of each item; Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 19/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • Sampling phase; Part II Conclusion Δ =0.6 µ =10 r=0.4 f(33,36, Δ ) µ =10/[0.6 (33*36) 1/2 ]=0.48 A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 19/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • Sampling phase; Part II Conclusion Δ =0.6 µ =10 r=0.5 f(33,36, Δ ) µ =10/[0.6 (33*36) 1/2 ]=0.48 A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 20/35 Introduction A new sampling technique 1. r = 0.8 2. r = 0.2 M = M ∪ {(OH,SL)} 3. r = 0.7 Part I 4. r = 0.3 5. r = 0.1 M = M ∪ {(OH,SL)} 6. r = 0.5 7. r = 0.5 8. r = 0.6 Part II 9. r = 0.2 M = M ∪ {(OH,SL)} . Δ =0.6 µ =15 . Cosine f() = 1/ (115*107) 1/2 = 0.009 . Conclusion Pr = 0.015 * 15 /0.6 = 0.23 We expect to see |S OH ∩ S SL | 0.23 > cos(OH,SL) = 0.9285 µ samples. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
BiSam algorithms 21/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • Output phase; Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling
Recommend
More recommend