an algorithm for the multi relational boolean factor
play

An Algorithm for the Multi-Relational Boolean Factor Analysis based - PowerPoint PPT Presentation

An Algorithm for the Multi-Relational Boolean Factor Analysis based on Essential Elements Martin Trnecka, Marketa Trneckova DEPARTMENT OF COMPUTER SCIENCE PALACK UNIVERSITY, OLOMOUC CLA: Concept Lattices and Their Applications Koice,


  1. An Algorithm for the Multi-Relational Boolean Factor Analysis based on Essential Elements Martin Trnecka, Marketa Trneckova DEPARTMENT OF COMPUTER SCIENCE PALACKÝ UNIVERSITY, OLOMOUC CLA: Concept Lattices and Their Applications Košice, Slovakia, October 7-10, 2014

  2. Introduction The Boolean factor analysis (BFA) is an established method for analysis and preprocessing of Boolean data. The basic task in the BFA: find new variables (factors) that explain or describe original single input data. Finding factors is an important step for understanding and managing data. Boolean Factor analysis, in classic settings, can handle only one input data table. Many real-word data sets are more complex than one simple data table. Multi-Relational Data = data composed from many tables interconnected via relations between objects or attributes of these data tables. Our goal: propose an algorithm form Multi-Relation Boolean Factor Analysis. M. Trnecka, M. Trneckova (Palacký University, Olomouc) Košice, Slovakia, October 2014 1 / 16

  3. Previous Work Krmelova M., Trnecka M.: Boolean Factor Analysis of Multi-Relational Data. In: M. Ojeda-Aciego, J. Outrata (Eds.): CLA 2013: Proceedings of the 10th International Conference on Concept Lattices and Their Applications, 2013, pp. 187-198. Problem settings: We have two Boolean data tables C 1 and C 2 , that are interconnected with relation R C 1 C 2 . Relation is over objects of first data table C 1 and attributes of second data table C 2 , i.e. it is an objects-attributes relation. Notion of Multi-Relational Factor, i.e. pair of classic factors from data tables. Algorithm for computing Multi-Relational factors is missing! M. Trnecka, M. Trneckova (Palacký University, Olomouc) Košice, Slovakia, October 2014 2 / 16

  4. Satisfyng Relation In previous work were introduced three approaches: - Narrow approach - Wide approach - α -approach We use the most natural approach = narrow approach. Idea of the narrow approach: we connect two factors F C 1 and F C 2 if the non-empty i j set of attributes (if such exist), that are common (in the relation R C 1 C 2 ) to all objects from the first factor F C 1 , is the subset of attributes of the second factor F C 2 . i j M. Trnecka, M. Trneckova (Palacký University, Olomouc) Košice, Slovakia, October 2014 3 / 16

  5. Naive Algorithm Table: C 1 Table: C 2 Table: R C 1 C 2 a b c d e f g h e f g h 1 × × × 5 × × 1 × × 2 × × 6 × × 2 × × 3 × × 7 × × × 3 × × × 4 × × × × 8 × × 4 × × × × Factors of data table C 1 are: F C 1 = �{ 1, 4 } , { b, c, d }� , F C 1 = �{ 2, 4 } , { a, c }� , 1 2 F C 1 = �{ 1, 3, 4 } , { b, d }� and factors of table C 2 are: F C 2 = �{ 6 , 7 } , { f, g }� , 3 1 F C 2 = �{ 5 } , { e, h }� , F C 2 = �{ 5 , 7 } , { e }� , F C 2 = �{ 8 } , { g, h }� . 2 3 4 These factors can be connected in to two multi-relational factors � F C 1 1 , F C 2 1 � and � F C 1 3 , F C 2 1 � . Usually is problematic to connect all factors from each data table = small number of connections between them. This leads to poor quality multi-relational factors. M. Trnecka, M. Trneckova (Palacký University, Olomouc) Košice, Slovakia, October 2014 4 / 16

  6. Essential Elements Notion of the Essential Elements was introduced in: Belohlavek R., Trnecka M.: From-Below Approximations in Boolean Matrix Factorization: Geometry and New Algorithm. http://arxiv.org/abs/1306.4905 , 2013. Essential elements in the Boolean data table are entries in this data table that are sufficient for covering the whole data table by factors (concepts). If we take factors that cover all these entries, we automatically cover all entries of the input data table. Formally, essential elements in the data table � X, Y, C � are defined via minimal intervals in the concept lattice. The entry C ij is essential iff interval bounded by formal concepts � i ↑↓ , i ↑ � and � j ↓ , j ↓↑ � is non-empty and minimal w.r.t. ⊆ (if it not contains any other interval). If the table entry C ij is essential, then interval I ij represents the set of all formal concepts (factors) that cover this entry. It is sufficient take only one arbitrary concept from each interval to create exact Boolean decomposition of � X, Y, C � . Essential part of input data table can be easily constructed. M. Trnecka, M. Trneckova (Palacký University, Olomouc) Košice, Slovakia, October 2014 5 / 16

  7. Idea of Algorithm Table: C 1 Table: C 2 a b c d e f g h 1 × × × 5 × × 2 × × 6 × × g e h 3 × × 7 × × × c b, d 4 × × × × 8 × × 3 f Table: Ess ( C 1 ) Table: Ess ( C 2 ) 6 a 1 5 8 2 a b c d e f g h 7 1 × 5 × × 4 2 × 6 × 3 × × 7 × 4 8 × × M. Trnecka, M. Trneckova (Palacký University, Olomouc) Košice, Slovakia, October 2014 6 / 16

  8. Idea of Algorithm If we take highlighted intervals, we obtain possibly four connections. First highlighted interval contains two concepts c 1 = �{ 1 , 2 , 4 } , { c }� and c 2 = �{ 1 , 4 } , { b, c, d }� . Second consist of concepts d 1 = �{ 6 , 7 , 8 } , { g }� and d 2 = �{ 8 } , { g, h }� . Only two connections ( c 1 with d 1 and c 1 with d 2 ) satisfy relation R C 1 C 2 , i.e. can be connected. Search space reduction: for two intervals it is not necessary to try all combination of factors. If we are not able to connect concept � A, B � from the first interval with concept � C, D � from the second interval, we are not able connect � A, B � with any concept � E, F � from the second interval, where � C, D � ⊆ � E, F � . Also if we are not able to connect concept � A, B � from the first interval with concept � E, F � from the second interval, we are not able connect any concept � C, D � from the first interval, where � C, D � ⊆ � A, B � , with concept � E, F � . M. Trnecka, M. Trneckova (Palacký University, Olomouc) Košice, Slovakia, October 2014 7 / 16

  9. Search in intervals is still time consuming. Heuristic: take attribute concepts in intervals of the second data table (bottom elements in each interval). In intervals of the first data table take greatest concepts that can be connected via relation (set of common attributes in relation is non-empty). The idea behind this heuristic: a bigger set of objects possibly have a smaller set of common attributes in a relation = bigger probability to connect this factor with some factor from the second data table. Applying this heuristic on data from the example, we obtain three factors in the first data table, F C 1 = �{ 2 , 4 } , { a, c }� , F C 1 = �{ 1 , 3 , 4 } , { c, d }� , F C 1 = �{ 1 , 2 , 4 } , { c }� and 1 2 3 four factors F C 2 = �{ 5 } , { e, h }� , F C 2 = �{ 6 , 7 } , { f, g }� , F C 2 = �{ 7 } , { e, f, g }� , 1 2 3 F C 2 = �{ 8 } , { g, h }� from the second one. Between this factors, there are six 4 connections satisfying the relation. F C 2 F C 2 F C 2 F C 2 1 2 3 4 F C 1 × 1 F C 1 × × 2 F C 1 × × × 3 M. Trnecka, M. Trneckova (Palacký University, Olomouc) Košice, Slovakia, October 2014 8 / 16

  10. Final Algorithm for MBMF Input : Boolean matrices C 1 , C 2 and relation R C 1 C 2 between them and p ∈ [0 , 1] Output : set M of multi-relational factors 1 E C 1 ← Ess ( C 1 ) 2 E C 2 ← Ess ( C 2 ) 3 U C 1 ← C 1 4 U C 2 ← C 2 5 while ( | U C 1 | + | U C 2 | ) / ( | C 1 | + | C 2 | ) ≥ p do foreach essential element ( E C 1 ) ij do 6 compute the best candidate � a, b � from interval I ij 7 end 8 � A, B � ← select one from set of candidates which maximize cover of C 1 9 ↑ RC 1 C 2 ⊆ ( C 2 ) ↓↑ C 2 select non-empty row i in E C 2 for which is A and which maximize cover of C 1 and C 2 10 i _ ↑↓ C 2 , ( C 2 ) ↑ C 2 � C, D � ← � ( C 2 ) � 11 i _ i _ if value of cover function for C 1 and C 2 is equal to zero then 12 break 13 end 14 add �� A, B � , � C, D �� to M 15 set ( U C 1 ) ij = 0 where i ∈ A and j ∈ B 16 set ( U C 1 ) ij = 0 where i ∈ C and j ∈ D 17 18 end 19 return F M. Trnecka, M. Trneckova (Palacký University, Olomouc) Košice, Slovakia, October 2014 9 / 16

  11. Remarks In each step we connect factors, that cover the biggest part of still uncovered part of data tables C 1 and C 2 . Firstly, we obtain multi-relational factor � F C 1 2 , F C 2 2 � which covers 50 percent of the data. Then we obtain factor � F C 1 3 , F C 2 4 � which covers together with first factor 75 percent of the data and last we obtain factor � F C 1 1 , F C 2 3 � . All these factors cover 90 percent of the data. By adding other factors we do not obtain better coverage of input data. These three factors cover the same part of input data as six connections from previous table. Multi-relational factors are not always able to explain the whole data. This is due to nature of data. Simply there is no information how to connect some classic factors, e.g. in the example no set of objects from C 1 has in R C 1 C 2 a set of common attributes equal to { e, h } (or only { e } or only { h } ). From this reason we are not able to connect any factor from C 1 with factor F C 2 1 . M. Trnecka, M. Trneckova (Palacký University, Olomouc) Košice, Slovakia, October 2014 10 / 16

Recommend


More recommend