The 8M Algorithm from Today’s Perspective Radim Belohlavek, Martin Trnecka DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY OLOMOUC CLA 2018 14th International Conference on Concept Lattices and Their Applications Olomouc, Czech Republic, June 12–14, 2018
Our Contributions Boolean matrix factorization (BMF) Current research = design of new factorization algorithms Present and analyze 8M method – unknown in present research on BMF – (first) complete description of the 8M algorithm – improvement of the 8M algorithm ( 8M +) – lessons performance of existing algorithms R. Belohlavek, M. Trnecka (Palacký University Olomouc) 1 / 18
Boolean Matrix Factorization A general aim: for a given matrix I ∈ { 0 , 1 } n × m find matrices A ∈ { 0 , 1 } n × k and B ∈ { 0 , 1 } k × m for which I (approximately) equals A ○ B , k reasonably small ○ is the Boolean matrix product k ( A ○ B ) ij = l = 1 min ( A il ,B lj ) . max ⎛ 10111 ⎞ ⎛ 110 ⎞ ⎛ ⎞ 10110 ⎜ ⎟ ⎜ ⎟ 01101 011 ⎜ ⎟ ⎜ ⎟ ⎜ = ○ ⎟ 00101 ⎜ ⎟ ⎜ ⎟ 01001 001 ⎝ ⎠ ⎝ ⎠ ⎝ ⎠ 01001 10110 100 Various terminology and notation (including FCA) Factors = interesting patterns that help explain data R. Belohlavek, M. Trnecka (Palacký University Olomouc) 2 / 18
Error Measure I (approximately) equals A ○ B Assessed by means of the metric E ( ⋅ , ⋅ ) E ( C,D ) = ∑ m,n i,j = 1 ∣ C ij − D ij ∣ . Two components of E E ( I,A ○ B ) = E u ( I,A ○ B ) + E o ( I,A ○ B ) , where E u ( I,A ○ B ) = ∣{⟨ i,j ⟩ ; I ij = 1 , ( A ○ B ) ij = 0 }∣ , E o ( I,A ○ B ) = ∣{⟨ i,j ⟩ ; I ij = 0 , ( A ○ B ) ij = 1 }∣ . Non-symmetry of undercovering and overcovering error R. Belohlavek, M. Trnecka (Palacký University Olomouc) 3 / 18
8M Statistical software package known as BMDP Developed in 1960s at the University of California in Los Angeles (W. J. Dixon) Developed by: M. R. Mickey, L. Engelman and P. Mudle 8M method has been added to BMDP in the late 1970s Probably the oldest BMF method No longer available Dixon, W. J. (ed.): BMDP Statistical Software Manual. Berkeley, CA: University of California Press (1992) Incomplete description → several blindspots Partially black box analysis of 8M R. Belohlavek, M. Trnecka (Palacký University Olomouc) 4 / 18
Basic Idea of 8M Input: – I ∈ { 0 , 1 } n × m . . . Boolean matrix – k . . . number of desired factors – init . . . number of initial factors – cost . . . determines significance of overcovering Output: – A ∈ { 0 , 1 } n × k and B ∈ { 0 , 1 } k × m R. Belohlavek, M. Trnecka (Palacký University Olomouc) 5 / 18
Basic Idea of 8M: main procedure Algorithm 1: 8M B ← ComputeInitialFactors ( init ) A ← 0 n × init f ← init RefineMatricesAB ( A, B, I, cost ) kReached ← 0 while kReached < 2 or I ≤ A ○ B do foreach ⟨ i, j ⟩ do if I ij > ( A ○ B ) ij then ∆ + ij ← 1 else ∆ + ij ← 0 add column j of ∆ + with the largest count of 1 s as new column to A add row of 0 s as new row to B and set entry j of this row to 1 f ← f + 1 RefineMatricesAB ( A, B, I, cost ) if another two new factors were added then remove column A _ ( f − 2 ) from A and row B ( f − 2 ) _ from B f ← f − 1 RefineMatricesAB ( A, B, I, cost ) if f=k then kReached ← kReached + 1 return A, B R. Belohlavek, M. Trnecka (Palacký University Olomouc) 6 / 18
Basic Idea of 8M: refine matrices Algorithm 2: RefineMatricesAB repeat RefineMatrixA ( A, B, I, cost ) RefineMatrixB ( A, B, I, cost ) until loop executed 3 times or A and B did not change Algorithm 3: RefineMatrixA foreach row i ∈ { 1 , . . . , n } do y ← I i _ ; Z ← B ; A i _ ← 0 repeat foreach factor l ∈ 1 , . . . , f do m l ← ∑ m j = 1 y j ⋅ Z lj − cost ⋅ ∑ m j = 1 ( 1 − y j ) ⋅ Z lj select p for which m p = max l m l if m p > 0 then A ip ← 1 foreach j ∈ { 1 , . . . , m } do if Z pj = 1 then Z _ j ← 0 ; y j ← 0 until m p > 0 R. Belohlavek, M. Trnecka (Palacký University Olomouc) 7 / 18
Basic Idea of 8M: initialization Algorithm 4: ComputeInitialFactors C ← m × m Boolean matrix with all entries equal to 0 foreach C ij do if I _ i ≤ I _ j and ∣ I _ i ∣ > 0 then C ij ← 1 remove all duplicate and empty rows from C f ← 0 foreach row i ∈ 1 , . . . , m of matrix C do if row C i _ has entry j for which C ij = 1 and C kj = 0 for all k < i then f ← f + 1 add row C i _ as a new row to B if f = init then return B R. Belohlavek, M. Trnecka (Palacký University Olomouc) 8 / 18
Basic Idea of 8M 1 Computing init initial factors – similarity with Asso algorithm 2 Iteratively computes new factors until k factors are obtained 3 Generating new factor via Boolean regression 4 Previously generated factors are revisited and dropped – adds two factors, then removes factor generated two steps back – k = 6 , sequence: 2, 3, 4, 3, 4, 5, 4, 5, 6, 5, 6 R. Belohlavek, M. Trnecka (Palacký University Olomouc) 9 / 18
Comparison with Other Methods Tiling Geerts, Goethals, Mielikainen: Tiling databases. In: Discovery Science 2004 (2004). Asso Miettinen, Mielikainen, Gionis, Das, Mannila: The discrete basis problem. IEEE Trans. Knowledge and Data Eng. (2008). GreConD Belohlavek, Vychodil: Discovery of optimal factors in binary data via a novel method of matrix decomposition. J. Comput. Syst. Sci. (2010). Hyper Xiang, Jin, Fuhry, Dragan: Summarizing transactional databases with overlapped hyperrectangles. Data Mining and Know. Discovery (2011). PaNDa Lucchese, Orlando, Perego: Mining top-K patterns from binary datasets in presence of noise. In: SIAM DM 2010 (2010). R. Belohlavek, M. Trnecka (Palacký University Olomouc) 10 / 18
Comparison with Other Methods: results 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 coverage coverage 0.5 0.5 0.4 0.4 0.3 8M 0.3 8M Tiling Tiling Asso Asso 0.2 0.2 GreConD GreConD PaNDa PaNDa 0.1 0.1 Hyper Hyper 0 0 0 20 40 60 80 100 120 0 5 10 15 20 25 30 35 40 45 50 k (number of factors) k (number of factors) (a) Mushroom (b) Set X1 Figure: Coverage quality of the first l factors on real and synthetic data. R. Belohlavek, M. Trnecka (Palacký University Olomouc) 11 / 18
8M from Today’s Perspective Improvements of 8M Lessons from 8M R. Belohlavek, M. Trnecka (Palacký University Olomouc) 12 / 18
Improvements of 8M 8M + New initialization step Very fast strategy of GreConD algorithm No overcovering error R. Belohlavek, M. Trnecka (Palacký University Olomouc) 13 / 18
Comparison of 8M and 8M+ 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 coverage coverage 0.5 0.5 0.4 0.4 0.3 0.3 0.2 0.2 8M 8M 0.1 0.1 8M+ 8M+ 0 0 0 10 20 30 40 50 60 70 80 90 100 110 0 50 100 150 200 250 l (number of factors) l (number of factors) (a) Mushroom (b) DNA Figure: Coverage quality of the first l factors on real data: 8M vs. 8M +. R. Belohlavek, M. Trnecka (Palacký University Olomouc) 14 / 18
Lesson from 8M Revisiting the previously generated factors Significant aspect Non-symmetry of undercovering and overcovering error Existing algorithms do not use any kind of revisiting Improvement of existing algorithms Removes factors driven by parameter p R. Belohlavek, M. Trnecka (Palacký University Olomouc) 15 / 18
Lesson from 8M: improvement of GreConD p Dataset orig. 0 0.01 0.02 0.03 0.04 0.05 Emea k 42 34 29 26 25 24 23 1.000 1.000 0.992 0.981 0.975 0.963 0.956 c Chess 124 119 72 62 55 51 47 k c 1.000 1.000 0.991 0.981 0.970 0.962 0.952 Firewall 1 k 66 65 17 10 8 7 6 1.000 1.000 0.990 0.981 0.972 0.964 0.953 c Firewall 2 10 10 4 4 4 4 3 k c 1.000 1.000 0.998 0.998 0.998 0.998 0.958 Mushroom k 120 113 81 73 69 65 61 1.000 1.000 0.990 0.980 0.970 0.960 0.951 c R. Belohlavek, M. Trnecka (Palacký University Olomouc) 16 / 18
Conclusions Detailed description of 8M Improvement of 8M New ideas for current BMF algorithms Explore revisiting of factors R. Belohlavek, M. Trnecka (Palacký University Olomouc) 17 / 18
Thank you R. Belohlavek, M. Trnecka (Palacký University Olomouc) 18 / 18
Recommend
More recommend