Finding Associations Similarity via Biased and Computing Pair - PDF document

IEEE International Conference on Data Mining, Miami, Florida,USA 06-09/12/2009 Finding Associations Similarity via Biased and Computing Pair Sampling

General framework 2/35 Introduction • Running example: co-starring actors Part I Part II • Are there actors having acted together in most Conclusion of their movies? The answer will follow... • (Support = #occurrences). A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Samp ling

General framework 3/35 Introduction • Market basket model Part I Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

General framework 4/35 Introduction • Market basket model • Association rules (binary); Part I ‒ “Do people buying diapers also buy beer?”; ‒ Interesting associations are problem dependent: what about caviar and (expensive) vodka? Part II • More generally: similarity rules; Conclusion ‒ This paper: Cosine, Jaccard, All confidence & more ; A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Samp ling

Previous work 5/35 Introduction Two approaches so far. First one: • Identification of frequent item pairs ( i,j ), Part I counting the number of co-occurrences; (1,5,6,10,21,30,100) (2,3,6,9,11,12,15,20,25,30,99) C(6,30) = 3 (1,7,9,13,14,16,22,26,50,80) (4,6,916,17,23,24,26,30,41,47,81,84,98) Part II • Suppose the average transaction size is b ; when b is large, the cost for this Conclusion operation is huge (quadratic in b times the number of transactions m ). A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Previous work 6/35 Introduction Pairs counting: a two phases affair: 1. Identify frequent pairs (i,j) and count their oc- Expensive ¡ currences; Part I 2. Find, among the phase 1 output, the pairs with highest affinity (using the number of occurrences); Part II The focus has usually been the space usage Conclusion Support pruning A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Previous work 7/35 Introduction Two approaches so far. Second one: • Computation of a signature for each item Part I and use of the signatures to infer the value of the similarity. Part II m’ << m Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Previous work 8/35 Introduction Signatures: 1. Good performances; poor scalability to large number of items; Part I 2. Only some similarities can be computed (Charikar 2002 shows that Dice and Overlap_coef do not admit a LSH); Part II It is necessary to extend the scope. Conclusion Note: also in this case, support pruning can help. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Previous work 9/35 Introduction Support Pruning • Pruning threshold = 50 Part I Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

10/35 Introduction Part I BiSam algorithms Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 11/35 Framework: Focus on CPU! Introduction • Larger internal memory; Part I • Faster I/O devices (IODrive – Fusion-io); Part II • High bandwidth. Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 12/35 Framework: differences with Introduction previous approaches • Go directly for the result avoiding the two phases; Part I • No need for support pruning; • No signatures. Part II • A new sampling paradigm; ‒ False positives and negatives; Conclusion ‒ At the cost of efficiency, the errors can be arbitrarily reduced. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 13/35 Introduction Sampling: how? Notation: ‒ T 1 ,...,T m is a sequence of transactions; ‒ ∀ i ∈ {1,...,m}, T i ⊂ [n]=[1,n] ⊂ℵ ; ¦S i ∩ S j ¦ * f(¦S i ¦,¦S j ¦) = s(i,j) Part I ‒ ∀ i ∈ {1,...,n}, let S i ={ j ¦ i ∈ T j }; f( S i ,S j ) m / (| S i ¦ * ¦S j ¦) Part II 1 / (| S i ¦* ¦S j ¦) 1/2 1 / max{| S i ¦,¦S j ¦} 1 / (| S i ¦+¦S j ¦) Conclusion 1 / min{| S i ¦,¦S j ¦} A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 14/35 Introduction Sampling: how? • Characteristics of f(): Part I ‒ ¦S i ∩ S j ¦ * f(¦S i ¦, ¦S j ¦) = s(i,j); Part II ‒ f(¦S i ¦, ¦S j ¦) can be used as (almost) the sampling probability of the pair (i,j); Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 15/35 Introduction Sampling: how? • We can use such a function to sample pairs in transactions: Part I ‒ We roll a [0,1) die per each transaction; call this value r; ‒ If Pr[i,j] = Pr(f()) > r, hence depending on the pair similarity, it will go in the sample; Part II • Given a user defined threshold Δ , we want to find those pairs satisfying: s(i,j) ≥Δ ; Conclusion • The sampling probability is scaled by a factor µ (Pr[i,j] = f(¦ S i ¦, ¦S j ¦) µ / Δ ); A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 16/35 Introduction Sampling: how? • In this way pairs would be sampled a number of times proportional to their similarity! Part I • s(i,j) ≥Δ  M(i,j) ≥ µ M = multiset of samples; M(i,j) := # of times the pair • s(i,j)< Δ  M(i,j) < µ Part II (i,j) is sampled • Caveat: how much time would this take? Conclusion ‒ Ω (mb 2 )!!! A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 17/35 Introduction A closer look to f() ¦S i ∩ S j ¦ * f(¦S i ¦,¦S j ¦) = s(i,j) Part I f( S i ,S j ) Non increasing in both parameters! If transactions are sorted according m / (| S i ¦ * ¦S j ¦) Part II to the cardinalities of S i , only a 1 / (| S i ¦* ¦S j ¦) 1/2 very limited number of pairs is 1 / max{| S i ¦,¦S j ¦} scanned within a transaction. 1 / (| S i ¦+¦S j ¦) Conclusion 1 / min{| S i ¦,¦S j ¦} A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 18/35 Introduction A new sampling technique • Characteristics of f(): ‒ f : ℵ × ℵ → ℜ + is non increasing in both parameters; Part I ‒ ¦S i ∩ S j ¦ * f(¦S i ¦,¦S j ¦) = s(i,j) ‒ computable in constant time; Part II • Given a user defined threshold Δ , we want to find those pairs satisfying: s(i,j) ≥Δ ; Conclusion • The sampling probability is scaled such that for all pairs with s (i,j) ≥Δ , we expect to see µ occurrences of (i,j) in the sample. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 19/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • ItemCount() returns a function computing the occurrences of each item; Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 19/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • Sampling phase; Part II Conclusion Δ =0.6 µ =10 r=0.4 f(33,36, Δ ) µ =10/[0.6 (33*36) 1/2 ]=0.48 A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 19/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • Sampling phase; Part II Conclusion Δ =0.6 µ =10 r=0.5 f(33,36, Δ ) µ =10/[0.6 (33*36) 1/2 ]=0.48 A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 20/35 Introduction A new sampling technique 1. r = 0.8 2. r = 0.2 M = M ∪ {(OH,SL)} 3. r = 0.7 Part I 4. r = 0.3 5. r = 0.1 M = M ∪ {(OH,SL)} 6. r = 0.5 7. r = 0.5 8. r = 0.6 Part II 9. r = 0.2 M = M ∪ {(OH,SL)} . Δ =0.6 µ =15 . Cosine  f() = 1/ (115*107) 1/2 = 0.009 . Conclusion Pr = 0.015 * 15 /0.6 = 0.23 We expect to see |S OH ∩ S SL | 0.23 > cos(OH,SL) = 0.9285 µ samples. A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

BiSam algorithms 21/35 Introduction A new sampling technique • T t [j] represents the jth element in transaction t (either before and after sorting); Part I • Output phase; Part II Conclusion A. Campagna, R. Pagh - Finding Associations and Computing Similarity via Biased Pair Sampling

Finding Associations Similarity via Biased and Computing Pair - PDF document

IEEE International Conference on Data Mining, Miami, Florida,USA 06-09/12/2009 Finding Associations Similarity via Biased and Computing Pair Sampling General framework 2/35 Introduction Running example: co-starring actors Part I Part

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Extreme Event-Size Extreme Event-Size Fluctuations in Biased Fluctuations in Biased Random

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Forest management associations Forest owners own associations Forest Management Association is

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Code Similarity via Natural Language Descriptions Meital Ben Sinai & Eran Yahav Technion

An Effective Solution for Finding Persons in Water in Maritime Search and Rescue Operations

A little more than a year ago, on a trip to Nairobi, Kenya, some colleagues and I met a

MARBLE: MicroplAstics Research in the BaLtic marine Environment Project 15-17-10020, funded

Poluqenie rezultatov v vide line inyx matreqnyh neravenstv dl robastnosti Dmitri i

System market power Department of Market Monitoring Amelia Blanke, Manager, Monitoring and

Dr. Upendar Kashaboina C/o Prof. Kiyotaka Asakura Institute for Catalysis Hokkaido University

Integrated thick-film hybrid microelectronics applied on different material substrates C. Jacq,

The mass-loss return from nearby evolved stars as deduced by Herschel Leen Decin on behalf of

Finding Associations Similarity via Biased and Computing Pair - PDF document

IEEE International Conference on Data Mining, Miami, Florida,USA 06-09/12/2009 Finding Associations Similarity via Biased and Computing Pair Sampling General framework 2/35 Introduction Running example: co-starring actors Part I Part

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

Semantic Similarity MultiJEDI ERC 259234 Semantic Similarity Semantic Similarity Mostly

Extreme Event-Size Extreme Event-Size Fluctuations in Biased Fluctuations in Biased Random

Align, Disambiguate, and Walk A Unified Approach for Measuring Semantic Similarity Semantic

Time- -dependent Similarity Measure dependent Similarity Measure Time Time-dependent Similarity

Finding your way in a graph Finding your way in a graph Finding your way in a graph Finding your

Forest management associations Forest owners own associations Forest Management Association is

Unification of CSC and SE ABET Effor ts Similarity of CSC and SE Programs Similarity of CSC and

LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE Thanks to: Tan,

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

COMP9313: Big Data Management High Dimensional Similarity Search Similarity Search Problem

I/O-EFFICIENT SIMILARITY JOIN R. Pagh, N. Pham, F. Silvestri, M. Stckel Similarity Join R = Q

Finding Hidden Supernovae with Finding Hidden Supernovae with Finding Hidden Supernovae with

Code Similarity via Natural Language Descriptions Meital Ben Sinai &amp; Eran Yahav Technion

An Effective Solution for Finding Persons in Water in Maritime Search and Rescue Operations

A little more than a year ago, on a trip to Nairobi, Kenya, some colleagues and I met a

MARBLE: MicroplAstics Research in the BaLtic marine Environment Project 15-17-10020, funded

Poluqenie rezultatov v vide line inyx matreqnyh neravenstv dl robastnosti Dmitri i

System market power Department of Market Monitoring Amelia Blanke, Manager, Monitoring and

Dr. Upendar Kashaboina C/o Prof. Kiyotaka Asakura Institute for Catalysis Hokkaido University

Integrated thick-film hybrid microelectronics applied on different material substrates C. Jacq,

The mass-loss return from nearby evolved stars as deduced by Herschel Leen Decin on behalf of

Code Similarity via Natural Language Descriptions Meital Ben Sinai & Eran Yahav Technion