SAT-Based Data Mining Saïd Jabbour CRIL - CNRS UMR 8188 Université d’Artois, France GDR-IA - GT CAVIAR Orléans May 27, 2019
Outline Frequent Itemsets Mining Propositional Logic and SAT problem (Parallel) SAT-based Solvers for Enumerating all (C, M)FIM on on (Uncertain) Transaction Databases Association Rules Mining Gradual Itemsets Mining Symmetry Breaking in Frequent Itemsets Mining FIM for CNF Formulas compression 2/71
Data Mining ◮ Discovering interesting knowledge from large amounts of data. ◮ Frequent itemsets ◮ Sequential patterns ◮ Association rules ◮ Emerging patterns ◮ . . . ◮ Frequent itemset mining is an important part of data mining. ◮ Different variety of applications : Healthcare, Business, Education, Disaster prevention, etc. 3/71
Frequent Itemset Mining ◮ A set of items : Ω = { a , b , c , . . . } . TID Transactions ◮ An itemset I over Ω : is a subset of Ω , T 1 a b c d i.e., I ⊆ Ω . T 2 a b c e T 3 a e ◮ A transaction : couple ( tid , I ) tid is the transaction identifier and T 4 a d e I is an itemset , i.e., I ⊆ Ω . T 5 a b T 6 b d ◮ Transaction database D : set of T 7 b e transactions. ◮ A transaction ( tid , I ) supports an itemset J if J ⊆ I . ◮ The cover of an itemset I : Cover ( I , D ) = { tid | ( tid , J ) ∈ D , I ⊆ J } . ◮ Cover ( { ab } , D )= { T 1 , T 2 , T 5 } ◮ The support of an itemset I in D : Supp ( I , D ) = | Cover ( I , D ) | . ◮ Supp ( { ab } , D )= 3 4/71
Frequent Itemset Mining ◮ FIM ( D , θ ) = { I ⊆ Ω | Supp ( I , D ) ≥ θ } ◮ An itemset I is frequent if its support is greater than or equal to a minsup threshold. 5/71
Frequent Itemset Mining ◮ FIM ( D , θ ) = { I ⊆ Ω | Supp ( I , D ) ≥ θ } ◮ CFIM ( D , θ ) = { I ∈ FIM ( D , θ ) | ∀ J ⊃ I , Supp ( I , D ) > S ( J , D ) } ◮ An itemset I is closed if I is frequent and there exists no super-pattern J ⊃ I , with the same support as I . 6/71
Frequent Itemset Mining ◮ FIM ( D , θ ) = { I ⊆ Ω | Supp ( I , D ) ≥ θ } ◮ CFIM ( D , θ ) = { I ∈ FIM ( D , θ ) | ∀ J ⊃ I , Supp ( I , D ) > S ( J , D ) } ◮ MFIM ( D , θ ) = { I ∈ FIM ( D , θ ) | ∀ J ⊃ I , Supp ( J , D ) < θ } An itemset I is a max-pattern if I is frequent and there exists no frequent super-pattern J ⊃ I . 7/71
Frequent Itemset Mining FIM Approches Specialized Approaches Declarative Approaches ◮ Apriori [Agrawal’93] ◮ CP [De Raedt’08] ◮ FP-growth [Han’00] ◮ SAT [Jabbour’13] ◮ ECLAT [Zaki’00] ◮ ASP [Gebser’16] ◮ LCM [Un’04], . . . ◮ ... 8/71
Propositional Logic Formal Language of propositional formulas : P rop Syntax ◮ Logical constant : ⊥ , ⊤ ◮ Propositional symbols : a , b , c , . . . (atomic sentences) ◮ Wrapping parentheses : ( . . . ) ◮ Sentences are combined by connectives : ¬ , ∧ , ∨ , → , ⇔ . If Φ 1 , Φ 2 ∈ P rop , then the following formulas are in P rop : ¬ Φ 1 (Φ 1 ∧ Φ 2 ) (Φ 1 ∨ Φ 2 ) (Φ 1 → Φ 2 ) (Φ 1 ⇔ Φ 2 ) 9/71
Propositional Logic : SAT Semantic : an interpretation is a fonction from P rop to { 0 , 1 } (0 : false; 1 : true). Defined inductively as : P rop → { 0 , 1 } 0 ⊥ B : ⊤ 1 F ∧ G min ( B ( F ) , B ( G )) ¬ F 1 − B ( F ) F ∨ G max ( B ( F ) , B ( G )) ◮ A model of Φ is an interpretation B satisfying Φ , i.e., B (Φ) = 1. ◮ A formula Φ is satisfiable if there exists a model of Φ . 10/71
Propositional logic : SAT SAT problem : decide if a formula in CNF is satisfiable or not? [NP-Complete’71] CNF : conjunction of clauses c 1 ∧ . . . ∧ c n Clause : disjunction of literals ( l 1 . . . ∨ l k ) Literal : a variable or its negation { l i , ¬ l i } C 1 C 2 C 3 C 4 � �������� �� �������� � � ���� �� ���� � � �� �� �� � � ���� �� ���� � Φ = ( a ∨ b ∨ c ) ∧ ( ¬ a ∨ b ) ∧ ( b ∨ c ) ∧ ( ¬ c ∨ a ) Various Applications : Model Checking, Planning, Data Mining, etc. → easier formulation → efficient solving 11/71
SAT Problem ◮ Models enumeration problem ◮ Variant of the propositional satisfiability problem (SAT) C 3 C 1 C 2 C 4 � �������� �� �������� � � ���� �� ���� � � �� �� �� � � ���� �� ���� � Φ = ( a ∨ b ∨ c ) ∧ ( ¬ a ∨ b ) ∧ ( b ∨ c ) ∧ ( ¬ c ∨ a ) � { a = 1 , b = 1 , c = 1 } � { a = 0 , b = 1 , c = 0 } M (Φ) = { a = 1 , b = 1 , c = 0 } { a = 0 , b = 1 , c = 0 } ◮ Different application domains : ◮ Data mining ◮ Bounded model checking ◮ Knowledge compilation ◮ . . . ◮ Models enumeration problem received little attention compared to other SAT issues. 12/71
Itemsets Mining Ω items (finite set of symbols) I Itemset (subset of Ω) T i = ( i , I i ) Transaction with i ∈ N the transaction identifier , I i an itemset D Transactional database (set of transactions) id transactions id transactions 1 0 0 1 1 1 1 1 1 c d e f g 2 0 0 1 1 1 1 1 2 c d e f g 3 1 1 1 1 0 0 0 3 a b c d 4 1 1 1 1 0 1 0 4 a b c d f 5 1 1 1 1 0 0 0 5 a b c d 6 0 0 1 0 1 0 0 6 c e a b c d e f g 13/71
Symbolic approach [ECML/PKDD’13] Find { I ⊆ Ω | | Supp ( I , D ) | ≥ θ } , θ ∈ N Make frequent itemsets extraction as the models enumeration of a CNF formula ((anti-)monotonicity) m m � � � � � ( ¬ q i ↔ p a ) q i ≥ θ ( p a ∨ q i ) i = 1 a ∈ Ω \ T i i = 1 a ∈ Ω T i ∈ D | a � T i � ����� �� ����� � � ��������������������� �� ��������������������� � � ���������������������� �� ���������������������� � cover : Φ cov frequency : Φ freq closeness : Φ clos ( q 3 ∨ q 4 ∨ q 5 ∨ p a ) ∧ ¬ q 1 ↔ p a p b c d e f g ( q 3 ∨ q 4 ∨ q 5 ∨ p b ) ∧ ¬ q 2 ↔ p a p b c d e f g ( p c ) ∧ ¬ q 3 ↔ a b c d p e p f p g ( q 6 ∨ p d ) ∧ ¬ q 4 ↔ a b c d p e f p g ( q 1 ∨ q 2 ∨ q 6 ∨ p e ) ∧ ¬ q 5 ↔ a b c d p e p f p g ( q 1 ∨ q 2 ∨ q 4 ∨ p f ) ∧ ¬ q 6 ↔ p a p b c p d e p f p g ( q 1 ∨ q 2 ∨ p e ) q 1 + q 2 + q 3 + q 4 + q 5 + q 6 ≥ θ 14/71
Symbolic approach Declarativity : easy extension to mine particular patterns (add new constraints) m � � � Φ cov = ( ¬ q i ↔ p a ) ( | Ω | − | T | + 1 ) ≈ | D | × | Ω | i = 1 a ∈ Ω \ T i T ∈ D m � Φ freq = O ( mlog 2 ( min _ supp )) q i ≥ θ i = 1 � � Φ clos = ( p a ∨ q i ) | D | − | Supp ( { a } ) | a ∈ Ω T i ∈ D | a � T i � Φ len = p a ≥ min _ length a ∈ Ω Instance #Tran, #Items Type of Data #CFIM θ > 1 . 10 5 Retail 10 88162, 6470 market basket data ≃ 5 . 10 5 Kosarak 1000 990002, 41267 hungarian on-line news portal ≃ 6 . 10 6 accidents 40000 340183, 468 traffic accidents ◮ The number of closed frequent itemsets is often significant. 15/71
SAT-based Solvers for Enumerating all CFIM Decisions (VSIDS) Restarts Model analysis (1) Literal model (4) Backtrack Backjumping Boolean Propagation (3) Conflict clause (2) Implication Graph Conflict Analysis ◮ DPLL SAT-based solver for enumerating CFIM is more efficient 16/71
SAT-based Solvers for Enumerating all CFIM Decisions (VSIDS) Restarts Model analysis (1) Literal model (4) Backtrack Backjumping Boolean Propagation (3) Conflict clause (2) Implication Graph Conflict Analysis ◮ DPLL SAT-based solver for enumerating CFIM is more efficient 16/71
SAT-based Solvers for Enumerating all CFIM Decisions (VSIDS) Restarts Model analysis (1) Literal model (4) Backtrack Backjumping Boolean Propagation (3) Conflict clause (2) Implication Graph Conflict Analysis ◮ DPLL SAT-based solver for enumerating CFIM is more efficient 16/71
DPLL-based procedure for CFIM [SGAI’16] ◮ DPLL-Enum+VSIDS : Variable State Independent, Decaying Sum branching heuristic ◮ DPLL-Enum+JW : branching heuristic based on the maximum number of occurrences of the variables ◮ DPLL-Enum+RAND : random variable selection 1000 CDCL+Enum DPLL-Enum+RAND 900 DPLL-Enum+VSIDS DPLL-Enum+JW 800 700 time (seconds) 600 500 400 300 200 100 0 50 100 150 200 250 300 Quorum 17/71
Limitations m � � � Φ cov = ( ¬ q i ↔ p a ) ( | Ω | − | T | + 1 ) ≈ | D | × | Ω | i = 1 a ∈ Ω \ T i T ∈ D m � Φ freq = O ( mlog 2 ( min _ supp )) q i ≥ θ i = 1 � � Φ clos = ( p a ∨ q i ) | D | − | Supp ( { a } ) | a ∈ Ω T i ∈ D | a � T i � Φ len = p a ≥ min _ length a ∈ Ω Instance θ #Tran, #Items Type of Data #Clauses #CFIM > 1 . 10 5 Retail 10 88162, 16470 market basket data 1451119564 ≃ 5 . 10 5 Kosarak 1000 990002, 41267 hungarian on-line 40846393519 news portal ≃ 6 . 10 6 Accidents 40000 340183, 468 traffic accidents 147704774 ◮ Scalability problem : the number of clauses of the SAT encodings is very large. 18/71
Recommend
More recommend