How to assess quality of BMF algorithms? Radim Belohlavek, Jan - PowerPoint PPT Presentation

How to assess quality of BMF algorithms? Radim Belohlavek, Jan Outrata, Martin Trnecka DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY OLOMOUC CZECH REPUBLIC IEEE International Conference on Intelligent systems IS’16 Sofia, Bulgaria, September 4-6, 2016

Motivation Boolean matrix factorization (BMF). Method for analysis of Boolean data. Various algorithms (more than 25). How to assess their quality? Poorly discussed in literature. R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 1 / 15

Boolean Matrix Factorization A general aim: for a given matrix I ∈ { 0 , 1 } n × m find matrices A ∈ { 0 , 1 } n × k and B ∈ { 0 , 1 } k × m for which I (approximately) equals A ◦ B ◦ is the Boolean matrix product k ( A ◦ B ) ij = max l =1 min( A il , B lj ) .     10111 110   10110 01101 011      =  ◦ 00101       01001 001       01001   10110 100 Discovery of k factors that exactly or approximately explain the data. Factors = interesting patterns (rectangles) in data. R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 2 / 15

Computational Complexity Basic feature of each algorithm. We prefer algorithm with the smaller complexity. Big O notation (hides several issues). Better way: relative time complexity. “One algorithm is three-times faster than other.” Time (and space) complexity is not critical issue (for the most of current algorithms). Runable on ordinar PC. R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 3 / 15

Approximation Factor Optimization version of the basic decomposition problem is NP-hard. No polynomial time algorithm (computing exact solution) exists. Based on heuristic → approximation factor. Recent results on inapproximability: basic decomposition problem is NP-hard to approximate within factor n 1 − ǫ . Chalermsook P., Heydrich S., Holm E., Karrenbauer A.: Nearly tight approximability results for minimum biclique cover and partition. ESA 2014, pp. 235-–246. Lower bound is not encouraging. Current algorithm produce much better results. R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 4 / 15

Quality of Factors 1 Geometry of factorization → coverage of the entries containing 1s by rectangles     10111 110   10110 01101 011      =  ◦ 00101       01001 001       01001   10110 100    1 0 1 1 0   0 0 0 0 0   0 0 0 0 0  10111 01101 0 0 0 0 0 0 0 1 0 1 0 1 0 0 1          =  ∨  ∨         01001    0 0 0 0 0   0 0 0 0 0   0 1 0 0 1       10110 1 0 1 1 0 0 0 0 0 0 0 0 0 0 0 2 Interpretability of individual factors Knowledge discovery view → maximal rectangles 3 Quality of a set of extracted factors Reduction of dimensionality Explanatory view R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 5 / 15

Explanation of Data by Factors How large portion of data is explain by factors? Distance (error function) E ( C, D ) = || C − D || = � m,n i,j =1 | C ij − D ij | . Two components of E E ( I, A ◦ B ) = E u ( I, A ◦ B ) + E o ( I, A ◦ B ) , where E u ( I, A ◦ B ) = |{� i, j � ; I ij = 1 , ( A ◦ B ) ij = 0 }| , E o ( I, A ◦ B ) = |{� i, j � ; I ij = 0 , ( A ◦ B ) ij = 1 }| . Coverage quality for A ∈ { 0 , 1 } n × l and B ∈ { 0 , 1 } l × m c ( l ) = 1 − E ( I, A ◦ B ) / || I || . R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 6 / 15

Two Basic Viewpoint to BMF Discrete Basis Problem – Given I ∈ { 0 , 1 } n × m and a positive integer k , find A ∈ { 0 , 1 } n × k and B ∈ { 0 , 1 } k × m that minimize || I − A ◦ B || . – Emphasizes the importance of the first few (presumably most important) factors. – Miettinen P., Mielikainen T., Gionis A., Das G., Mannila H., The discrete basis problem, IEEE Transactional Knowledge and Data Engineering 20(10)(2008), 1348–1362 Approximate Factorization Problem – Given I and prescribed error ε ≥ 0 , find A ∈ { 0 , 1 } n × k and B ∈ { 0 , 1 } k × m with k as small as possible such that || I − A ◦ B || ≤ ε . – Emphasizes the need to account for (and thus to explain) a prescribed (presumably reasonably large) portion of data. – Belohlavek R., Trnecka M., From-below approximations in Boolean matrix factorization: Geometry and new algorithm, Journal of Computer and System Science 81(8)(2015), 1678–1697. R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 7 / 15

Quality Measure w l = l/k for the DBP view w l = 1 + ( E ( I, A ◦ B ) − ε ) / ( || I || − ε ) for AFP view w l = ( l/k + 1 + ( E ( I, A ◦ B ) − ε ) / ( || I || − ε )) / 2 combined view.     l l E ( I, A ◦ B ) �  / �  . q = 1 − w j w j   || I || j =0 j =0 Reflect natural requirement for a good decomposition. R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 8 / 15

Interpretation 1 c ( j ) q 0 j l = 99 Figure: Measure of quality of BMF algorithm R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 9 / 15

Experimental Evaluation Asso — Miettinen P., Mielikainen T., Gionis A., Das G., Mannila H., The discrete basis problem, IEEE Transactional Knowledge and Data Engineering 20(10)(2008), 1348–1362 . NaiveCol — Ene A. et al., Fast exact and heuristic methods for role minimization problems. Proc. SACMAT 2008, pp. 1–10. GreConD — Belohlavek R., Vychodil V., Discovery of optimal factors in binary data via a novel method of matrix decomposition, Journal of Computer and System Science 76(1)(2010), 3–20. Panda — Lucchese C., Orlando S., Perego R., Mining top-K patterns from binary datasets in presence of noise, SIAM DM 2010, pp. 165–176. Hyper — Xiang Y., Jin R., Fuhry D., Dragan F. F., Summarizing transactional databases with overlapped hyperrectangles, Data Mining and Knowledge Discovery 23(2011), 215–251 GreEss — Belohlavek R., Trnecka M., From-below approximations in Boolean matrix factorization: Geometry and new algorithm, Journal of Computer and System Science 81(8)(2015), 1678–1697 . R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 10 / 15

Results Table: Numbers of factors and coverage quality GreConD NaiveCol GreEss PaNDa Hyper Asso Dataset Mushroom c = 80 % 19 29 32 42 NA 31 c = 90 % 34 46 47 57 NA 47 c = 95 % 50 62 62 70 NA 61 c = 100 % NA 120 110 123 NA 105 k = 10 0.556 0.582 0.512 0.285 0.346 0.546 k = 20 0.652 0.715 0.674 0.502 0.346 0.696 k = 30 0.720 0.812 0.789 0.664 0.346 0.793 k = 40 0.765 0.873 0.862 0.780 0.346 0.865 R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 11 / 15

Results Table: BMF algorithm quality GreConD NaiveCol GreEss PaNDa Hyper Asso Dataset Mushroom q 0 . 8 0.622 0.740 0.729 0.657 0.344 0.733 q 0 . 9 0.695 0.801 0.786 0.709 0.344 0.794 q 0 . 95 0.725 0.827 0.810 0.728 0.344 0.819 q 1 0.745 0.844 0.827 0.749 0.344 0.835 q 10 0.556 0.582 0.511 0.285 0.346 0.545 q 20 0.650 0.712 0.671 0.498 0.346 0.693 q 30 0.715 0.805 0.781 0.654 0.346 0.786 q 40 0.756 0.861 0.848 0.760 0.346 0.851 q 10 , 0 . 9 0.764 0.876 0.863 0.798 0.344 0.870 q 20 , 0 . 8 0.763 0.874 0.860 0.792 0.344 0.867 R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 12 / 15

General Discussion GreConD → good from DBP and AFP view. GreEss → outperform GreConD . Asso → good from DBP, bad from AFP view. NaiveCol → good form AFP, bad from DBP view. PaNDa → very poor results (MDL as main criterion). R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 13 / 15

Conclusion We point out an important problem in BMF: assessment of quality of BMF algorithms. We identify key aspects of such assessment. We propose quantitative ways how to measure quality of BMF algorithms. R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 14 / 15

Thank you R. Belohlavek, J. Outrata, M. Trnecka (Palacký University Olomouc) Sofia, Bulgaria, September 2016 15 / 15

How to assess quality of BMF algorithms? Radim Belohlavek, Jan - PowerPoint PPT Presentation

How to assess quality of BMF algorithms? Radim Belohlavek, Jan Outrata, Martin Trnecka DEPARTMENT OF COMPUTER SCIENCE PALACK UNIVERSITY OLOMOUC CZECH REPUBLIC IEEE International Conference on Intelligent systems IS16 Sofia, Bulgaria,

The App Universe After the Big Bang Mike Lee @bmf bmf@le.mu.rs In the

Standard Introduction of the new Standard to key Stakeholders Welcome to the Builders Merchants

Bobby Marie 1. About the Bench Marks Foundation - The BMF is Independent NGO set up by faith based

Data Mining and Matrices 08 Boolean Matrix Factorization Rainer Gemulla, Pauli Miettinen

Using process indicators to Using process indicators to assess and improve the quality of assess

RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW CASHEW NUT QUALITY RAW

Graph Algorithms Chapter 22 1 CPTR 430 Algorithms Graph Algorithms Why Study Graph Algorithms?

Greedy Algorithms Chapter 16 1 CPTR 430 Algorithms Greedy Algorithms Greedy Algorithms For

Algorithms Chapter 3 Chapter Summary Algorithms n Example Algorithms n Algorithmic Paradigms

NAHDO Data Quality Forum Cross State Metrics to Assess Data Quality September 2019 Identify

Mail Service Quality Support: Mail Service Quality Support: Mail Service Quality Support: Mail

General remarks Algorithms Algorithms Oliver Oliver Week 8 Kullmann Kullmann Greedy Greedy

How to Assess How to Assess Environmental Toxicity Environmental Toxicity Loads in Your

Application of Life Cycle Thinking Application of Life Cycle Thinking to Assess Local Measures

How often do we assess our pupils? Parents often ask us how often we assess our pupils. The

How to Assess Pain in How to Assess Pain in Newborn Babies? Newborn Babies? Linda Franck Linda

Evaluating Association Rules in Boolean Matrix Factorization Jan Outrata, Martin Trnecka

COEN 212: DIGITAL SYSTEMS DESIGN I Lecture 3: Logic Gates Instr Instructor: Dr. Reza Soleymani,

Evolving Algebraic Constructions for Designing Bent Boolean Functions Stjepan Picek and Domagoj

Any Monotone Function Is Realized by Interlocked Polygons Authors: Erik Demaine, Martin Demaine,

Announcements Readings Today CSE 321 Discrete Structures Section 8.2 n-Ary

Chapter VI All Pair Shortest Paths and Matrix Multiplication VI.1 APSPs and Matrix

The Rectangle Covering number of Random Boolean Matrices Mozhgan Pourmoradnasseri University of

Operators and Spaces Associated to Matrices with Grades and Their Decompositions II Radim