prograf.ic.uff.br Altobelli B. Mantuan and Leandro A. F. Fernandes {amantuan, laffernandes}@ic.uff.br Acknowledgments Graphics Processing Research Laboratory at IC-UFF, Brazil
It Itemset min ining • Frequent pattern / Itemset: A set of one or more items that occurs frequently in a dataset ▪ k-itemset: X = {x 1 , …, x k } Items bought • Finding all the itemsets Beer, Nuts, Diaper ▪ combinatorial problems Beer, Coffee, Diaper Beer, Diaper, Eggs Example: Itemset = {Beer, Diaper} Nuts, Eggs, Milk Support = 4 Nuts, Coffee, Diaper, Eggs, Beer 2
It Itemset min ining procedure • Existing solutions for generating frequent itemsets ▪ Threshold parameters o Difficult to perceive the influence of the parameter ▪ Amount of itemset retrieved ▪ Search space o large area of search space is unnecessarily explored It would be interesting to use other information to reduce the search space for generating itemset 3
Flo lowchart - SCIM IM 4
Contributions • The spatial contextualization of items for mining interesting itemsets in transactional databases • A procedure for clustering items in the Solution Space of the Dual Scaling mapping • A procedure for generating closed itemsets based on spatial contextualization 5
Summary ry • Dual Scaling • Overlapping clustering procedure • SC-close procedure • Results • Conclusion 6
7
Dual l Scaling Stimulus High Anxiety Frequ. Migr. Low Anxiety Mid Anxiety Middle Age Occa. Migr. Rare Migr. Subject Hight BP Medium Average Aver BP Low BP Young Heavy Short Light Old Tall 1 1 0 0 0 0 1 0 0 1 0 0 1 1 0 0 1 0 0 2 1 0 0 0 0 1 1 0 0 0 0 1 0 1 0 0 0 1 3 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 0 0 1 4 0 0 1 0 0 1 0 0 1 0 0 1 1 0 0 1 0 0 5 0 1 0 1 0 0 0 1 0 0 1 0 0 0 1 0 1 0 .... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... 15 0 1 0 0 1 0 0 1 0 1 0 0 1 0 0 0 0 1 • Greater the occurrence of a set of stimulus appear in the database, smaller the distance between these stimulus • Lower the frequency of the stimulus, further from the origin the stimulus will be positioned Nishisato, Shizuhiko. (1994). Elements of Dual Scaling: An Introduction to Practical Data Analysis. 8
Dual l Scaling – Work rking example 𝜓 -square distance matrix for pairs of items Item Transaction Item 1 2 3 4 5 6 1 2 3 4 5 6 1 0 16.6052 20.5072 3.2338 21.9519 18.4428 1 1 0 0 1 0 0 2 16.6052 0 10.9090 14.0740 10.4710 1.7637 2 1 0 0 1 0 0 3 20.5072 10.9090 0 17.5109 1.8114 9.7269 3 0 0 1 1 0 0 4 3.2338 14.0740 17.5109 0 21.4269 17.5219 4 0 0 1 0 1 0 5 21.9519 10.4710 1.8114 21.4269 0 10.8555 5 0 0 1 0 1 0 6 18.4428 1.7637 9.7269 17.5219 10.8555 0 6 0 0 1 0 1 0 7 0 0 1 0 1 0 Reference distance is the distance of each 8 0 0 1 0 1 0 item to origin of the space 9 0 0 1 0 0 1 10 0 1 0 0 1 0 1 2 3 4 5 6 11 0 1 0 0 0 1 Subject = Transaction 12 0 1 0 0 0 1 9.7467 3.2046 3.7234 8.2790 4.6087 3.7871 13 0 1 0 0 0 1 Stimulus = Item 14 0 1 0 1 0 0 15 0 1 0 0 0 1 • Transactions are not maped to the solution space • Using 𝜓 -square distance for pairs of items 9
10
Overlapping clu lustering procuderure 𝜓 -square distance matrix X 1 X 2 X 3 X 4 X 5 X 6 0 16.6052 20.5072 3.2338 21.9519 18.4428 X 1 16.6052 0 10.9090 14.0740 10.4710 1.7637 X 2 20.5072 10.9090 0 17.5109 1.8114 9.7269 X 3 X 4 3.2338 14.0740 17.5109 0 21.4269 17.5219 X 5 21.9519 10.4710 1.8114 21.4269 0 10.8555 X 6 18.4428 1.7637 9.7269 17.5219 10.8555 0 • 𝜓 -square distance between pairs of points • Each item point 𝑦 𝑗 is the center of cluster 𝐷 𝑗 • Parameter 𝑒𝑠 using to define 𝑑𝑠 𝑗 11
12
SC SC-close • SC-Close uses the compact vertical structure FP-tree ▪ Clusters reduce the search space on the FP-tree • SC-Close uses a CFI-tree for report only closed itemsets • Itemset formation rule 1. The itemset must belong to the cluster coverage 2. At least one item from itemset must belong to minimum coverage 3. Condition 2 is ignored if minimum coverage only includes the center point 13
Results • Using 11 databases from LUCS-KDD • Compared algorithms ▪ SCIM, FPClose, Slim, and TopPI • Metrics ▪ Mean All-Confidence, MDL, and processing time 14
Mean All ll-Confidence 15
Mean All ll-Confidence 16
MDL and processing tim ime Database Database Technique Technique # Patterns # Patterns Time (s) Time (s) L% L% Database Database Technique Technique # Patterns # Patterns Time (s) Time (s) L% L% 31.89 Letter recognition Letter recognition Slim Slim 1,231 1,231 34.30 34.30 31.89 Ecoli Ecoli FPClose FPClose 530 530 0.12 0.12 53.59 53.59 40.15 n = 20,000 n = 20,000 TopPI k = 7 TopPI k = 7 447 447 0.62 0.62 73.58 73.58 Slim Slim 25 25 0.16 0.16 40.15 n = 336 n = 336 q = 16, m = 80 q = 16, m = 80 75.28 SCIM dr = 0.03 SCIM dr = 0.03 89 89 0.48 0.48 75.28 TopPI k = 3 TopPI k = 3 60 60 0.25 0.25 68.67 68.67 q = 7, m = 26 q = 7, m = 26 35.82 mFeat 59.50 mFeat Slim Slim 5,121 5,121 10,053.93 10,053.93 35.82 0.03 SCIM dr = 0.02 SCIM dr = 0.02 66 66 0.03 59.50 12.12 n = 2;000 n = 2;000 TopPI k = 3 TopPI k = 3 3,567 3,567 1.35 1.35 72.89 72.89 Connect-4 Connect-4 Slim Slim 1,506 1,506 88.79 88.79 12.12 MDL metric q = 240, m = 1;648 q = 240, m = 1;648 11,943 3,351.34 56.00 SCIM dr = 0.00 SCIM dr = 0.00 11,943 3,351.34 56.00 n = 67,557 TopPI k = 8 TopPI k = 8 965 965 2.87 2.87 67.39 67.39 n = 67,557 The relative total compressed 112.52 q = 42, m = 126 q = 42, m = 126 Wine Wine FPClose FPClose 13,169 13,169 0.42 0.42 112.52 SCIM dr = 0.00 SCIM dr = 0.00 1,002 1,002 9.98 9.98 54.28 54.28 76.79 Slim Slim 55 55 0.22 0.22 76.79 size (L%) achieved by the set of Tic-tac-toe Tic-tac-toe FPClose FPClose 42,684 42,684 2.98 2.98 145.81 145.81 n = 178, n = 178, 53.19 TopPI k = 2 TopPI k = 2 68 68 0.25 0.25 90.22 90.22 Slim Slim 125 125 0.22 0.22 53.19 q = 13, m = 65 q = 13, m = 65 patterns retrieved by each n = 958 n = 958 0.04 SCIM dr = 0.03 SCIM dr = 0.03 60 60 0.04 88.92 88.92 TopPI k = 3 TopPI k = 3 250 250 0.31 0.31 86.27 86.27 q = 9, m = 27 q = 9, m = 27 algorithm 91.32 Page blocks Page blocks FPClose FPClose 714 714 0.20 0.20 3.90 3.90 0.06 SCIM dr = 0.02 SCIM dr = 0.02 330 330 0.06 91.32 3.84 Slim Slim 40 40 0.19 0.19 3.84 Led7 Led7 FPClose FPClose 1,936 1,936 0.20 0.20 25.77 25.77 n = 5,473 n = 5,473 TopPI k = 1 TopPI k = 1 31 31 0.29 0.29 83.09 83.09 25.34 Slim 78 0.16 25.34 Slim 78 0.16 q = 10, m = 41 q = 10, m = 41 n = 3,200 n = 3,200 Processing time 38.23 SCIM dr = 0.00 SCIM dr = 0.00 54 54 0.13 0.13 38.23 TopPI k = 3 67 0.30 69.06 TopPI k = 3 67 0.30 69.06 q = 7, m = 14 q = 7, m = 14 38.67 Pen digits Slim 1,220 45.23 38.67 0.04 Pen digits Slim 1,220 45.23 SCIM dr = 0.02 15 70.54 SCIM dr = 0.02 15 0.04 70.54 Slim and TopPI runned in n = 10,992 TopPI k = 7 401 0.55 76.14 n = 10,992 TopPI k = 7 401 0.55 76.14 Pima Pima FPClose 1,608 0.17 41.47 FPClose 1,608 0.17 41.47 single thread q = 16, m = 79 q = 16, m = 79 SCIM dr = 0.04 148 0.37 76.51 31.35 SCIM dr = 0.04 148 0.37 76.51 Slim 55 0.21 31.35 Slim 55 0.21 n = 768 n = 768 38.84 Waveform Slim 717 8.49 38.84 Waveform Slim 717 8.49 TopPI k = 3 717 0.31 54.57 TopPI k = 3 717 0.31 54.57 q = 8, m = 36 q = 8, m = 36 n = 5,000 TopPI k = 9 734 0.44 77.53 0.06 42.38 n = 5,000 TopPI k = 9 734 0.44 77.53 SCIM dr = 0.02 522 42.38 SCIM dr = 0.02 522 0.06 q = 21, m = 98 q = 21, m = 98 SCIM dr = 0.00 724 0.39 72.35 SCIM dr = 0.00 724 0.39 72.35 17
Conclusions • We showed that spatial contextualization can be used to guide a closed itemset generation • We provided an unsupervised clustering heuristic with cluster overlapping • We presented a procedure to reduce the search space in FP-tree for generation of closed itemsets 18
Thank you! prograf.ic.uff.br Project webpage 19
Recommend
More recommend