SAT-Based Data Mining Sad Jabbour CRIL - CNRS UMR 8188 Universit - PowerPoint PPT Presentation

SAT-Based Data Mining Saïd Jabbour CRIL - CNRS UMR 8188 Université d’Artois, France GDR-IA - GT CAVIAR Orléans May 27, 2019

Outline Frequent Itemsets Mining Propositional Logic and SAT problem (Parallel) SAT-based Solvers for Enumerating all (C, M)FIM on on (Uncertain) Transaction Databases Association Rules Mining Gradual Itemsets Mining Symmetry Breaking in Frequent Itemsets Mining FIM for CNF Formulas compression 2/71

Data Mining ◮ Discovering interesting knowledge from large amounts of data. ◮ Frequent itemsets ◮ Sequential patterns ◮ Association rules ◮ Emerging patterns ◮ . . . ◮ Frequent itemset mining is an important part of data mining. ◮ Different variety of applications : Healthcare, Business, Education, Disaster prevention, etc. 3/71

Frequent Itemset Mining ◮ A set of items : Ω = { a , b , c , . . . } . TID Transactions ◮ An itemset I over Ω : is a subset of Ω , T 1 a b c d i.e., I ⊆ Ω . T 2 a b c e T 3 a e ◮ A transaction : couple ( tid , I ) tid is the transaction identifier and T 4 a d e I is an itemset , i.e., I ⊆ Ω . T 5 a b T 6 b d ◮ Transaction database D : set of T 7 b e transactions. ◮ A transaction ( tid , I ) supports an itemset J if J ⊆ I . ◮ The cover of an itemset I : Cover ( I , D ) = { tid | ( tid , J ) ∈ D , I ⊆ J } . ◮ Cover ( { ab } , D )= { T 1 , T 2 , T 5 } ◮ The support of an itemset I in D : Supp ( I , D ) = | Cover ( I , D ) | . ◮ Supp ( { ab } , D )= 3 4/71

Frequent Itemset Mining ◮ FIM ( D , θ ) = { I ⊆ Ω | Supp ( I , D ) ≥ θ } ◮ An itemset I is frequent if its support is greater than or equal to a minsup threshold. 5/71

Frequent Itemset Mining ◮ FIM ( D , θ ) = { I ⊆ Ω | Supp ( I , D ) ≥ θ } ◮ CFIM ( D , θ ) = { I ∈ FIM ( D , θ ) | ∀ J ⊃ I , Supp ( I , D ) > S ( J , D ) } ◮ An itemset I is closed if I is frequent and there exists no super-pattern J ⊃ I , with the same support as I . 6/71

Frequent Itemset Mining ◮ FIM ( D , θ ) = { I ⊆ Ω | Supp ( I , D ) ≥ θ } ◮ CFIM ( D , θ ) = { I ∈ FIM ( D , θ ) | ∀ J ⊃ I , Supp ( I , D ) > S ( J , D ) } ◮ MFIM ( D , θ ) = { I ∈ FIM ( D , θ ) | ∀ J ⊃ I , Supp ( J , D ) < θ } An itemset I is a max-pattern if I is frequent and there exists no frequent super-pattern J ⊃ I . 7/71

Frequent Itemset Mining FIM Approches Specialized Approaches Declarative Approaches ◮ Apriori [Agrawal’93] ◮ CP [De Raedt’08] ◮ FP-growth [Han’00] ◮ SAT [Jabbour’13] ◮ ECLAT [Zaki’00] ◮ ASP [Gebser’16] ◮ LCM [Un’04], . . . ◮ ... 8/71

Propositional Logic Formal Language of propositional formulas : P rop Syntax ◮ Logical constant : ⊥ , ⊤ ◮ Propositional symbols : a , b , c , . . . (atomic sentences) ◮ Wrapping parentheses : ( . . . ) ◮ Sentences are combined by connectives : ¬ , ∧ , ∨ , → , ⇔ . If Φ 1 , Φ 2 ∈ P rop , then the following formulas are in P rop : ¬ Φ 1 (Φ 1 ∧ Φ 2 ) (Φ 1 ∨ Φ 2 ) (Φ 1 → Φ 2 ) (Φ 1 ⇔ Φ 2 ) 9/71

Propositional Logic : SAT Semantic : an interpretation is a fonction from P rop to { 0 , 1 } (0 : false; 1 : true). Defined inductively as :  P rop → { 0 , 1 }          0 ⊥      B :  ⊤ 1      F ∧ G min ( B ( F ) , B ( G ))      ¬ F 1 − B ( F )      F ∨ G max ( B ( F ) , B ( G )) ◮ A model of Φ is an interpretation B satisfying Φ , i.e., B (Φ) = 1. ◮ A formula Φ is satisfiable if there exists a model of Φ . 10/71

Propositional logic : SAT SAT problem : decide if a formula in CNF is satisfiable or not? [NP-Complete’71] CNF : conjunction of clauses c 1 ∧ . . . ∧ c n Clause : disjunction of literals ( l 1 . . . ∨ l k ) Literal : a variable or its negation { l i , ¬ l i } C 1 C 2 C 3 C 4 � �� Φ = ( a ∨ b ∨ c ) ∧ ( ¬ a ∨ b ) ∧ ( b ∨ c ) ∧ ( ¬ c ∨ a ) Various Applications : Model Checking, Planning, Data Mining, etc. → easier formulation → efficient solving 11/71

SAT Problem ◮ Models enumeration problem ◮ Variant of the propositional satisfiability problem (SAT) C 3 C 1 C 2 C 4 � �� Φ = ( a ∨ b ∨ c ) ∧ ( ¬ a ∨ b ) ∧ ( b ∨ c ) ∧ ( ¬ c ∨ a ) � { a = 1 , b = 1 , c = 1 } � { a = 0 , b = 1 , c = 0 } M (Φ) = { a = 1 , b = 1 , c = 0 } { a = 0 , b = 1 , c = 0 } ◮ Different application domains : ◮ Data mining ◮ Bounded model checking ◮ Knowledge compilation ◮ . . . ◮ Models enumeration problem received little attention compared to other SAT issues. 12/71

Itemsets Mining Ω items (finite set of symbols) I Itemset (subset of Ω) T i = ( i , I i ) Transaction with i ∈ N the transaction identifier , I i an itemset D Transactional database (set of transactions) id transactions id transactions 1 0 0 1 1 1 1 1 1 c d e f g 2 0 0 1 1 1 1 1 2 c d e f g 3 1 1 1 1 0 0 0 3 a b c d 4 1 1 1 1 0 1 0 4 a b c d f 5 1 1 1 1 0 0 0 5 a b c d 6 0 0 1 0 1 0 0 6 c e a b c d e f g 13/71

Symbolic approach [ECML/PKDD’13] Find { I ⊆ Ω | | Supp ( I , D ) | ≥ θ } , θ ∈ N Make frequent itemsets extraction as the models enumeration of a CNF formula ((anti-)monotonicity) m m � � � � � ( ¬ q i ↔ p a ) q i ≥ θ ( p a ∨ q i ) i = 1 a ∈ Ω \ T i i = 1 a ∈ Ω T i ∈ D | a � T i � �� cover : Φ cov frequency : Φ freq closeness : Φ clos ( q 3 ∨ q 4 ∨ q 5 ∨ p a ) ∧ ¬ q 1 ↔ p a p b c d e f g ( q 3 ∨ q 4 ∨ q 5 ∨ p b ) ∧ ¬ q 2 ↔ p a p b c d e f g ( p c ) ∧ ¬ q 3 ↔ a b c d p e p f p g ( q 6 ∨ p d ) ∧ ¬ q 4 ↔ a b c d p e f p g ( q 1 ∨ q 2 ∨ q 6 ∨ p e ) ∧ ¬ q 5 ↔ a b c d p e p f p g ( q 1 ∨ q 2 ∨ q 4 ∨ p f ) ∧ ¬ q 6 ↔ p a p b c p d e p f p g ( q 1 ∨ q 2 ∨ p e ) q 1 + q 2 + q 3 + q 4 + q 5 + q 6 ≥ θ 14/71

Symbolic approach Declarativity : easy extension to mine particular patterns (add new constraints) m � � � Φ cov = ( ¬ q i ↔ p a ) ( | Ω | − | T | + 1 ) ≈ | D | × | Ω | i = 1 a ∈ Ω \ T i T ∈ D m � Φ freq = O ( mlog 2 ( min _ supp )) q i ≥ θ i = 1 � � Φ clos = ( p a ∨ q i ) | D | − | Supp ( { a } ) | a ∈ Ω T i ∈ D | a � T i � Φ len = p a ≥ min _ length a ∈ Ω Instance #Tran, #Items Type of Data #CFIM θ > 1 . 10 5 Retail 10 88162, 6470 market basket data ≃ 5 . 10 5 Kosarak 1000 990002, 41267 hungarian on-line news portal ≃ 6 . 10 6 accidents 40000 340183, 468 traffic accidents ◮ The number of closed frequent itemsets is often significant. 15/71

SAT-based Solvers for Enumerating all CFIM Decisions (VSIDS) Restarts Model analysis (1) Literal model (4) Backtrack Backjumping Boolean Propagation (3) Conflict clause (2) Implication Graph Conflict Analysis ◮ DPLL SAT-based solver for enumerating CFIM is more efficient 16/71

DPLL-based procedure for CFIM [SGAI’16] ◮ DPLL-Enum+VSIDS : Variable State Independent, Decaying Sum branching heuristic ◮ DPLL-Enum+JW : branching heuristic based on the maximum number of occurrences of the variables ◮ DPLL-Enum+RAND : random variable selection 1000 CDCL+Enum DPLL-Enum+RAND 900 DPLL-Enum+VSIDS DPLL-Enum+JW 800 700 time (seconds) 600 500 400 300 200 100 0 50 100 150 200 250 300 Quorum 17/71

Limitations m � � � Φ cov = ( ¬ q i ↔ p a ) ( | Ω | − | T | + 1 ) ≈ | D | × | Ω | i = 1 a ∈ Ω \ T i T ∈ D m � Φ freq = O ( mlog 2 ( min _ supp )) q i ≥ θ i = 1 � � Φ clos = ( p a ∨ q i ) | D | − | Supp ( { a } ) | a ∈ Ω T i ∈ D | a � T i � Φ len = p a ≥ min _ length a ∈ Ω Instance θ #Tran, #Items Type of Data #Clauses #CFIM > 1 . 10 5 Retail 10 88162, 16470 market basket data 1451119564 ≃ 5 . 10 5 Kosarak 1000 990002, 41267 hungarian on-line 40846393519 news portal ≃ 6 . 10 6 Accidents 40000 340183, 468 traffic accidents 147704774 ◮ Scalability problem : the number of clauses of the SAT encodings is very large. 18/71

SAT-Based Data Mining Sad Jabbour CRIL - CNRS UMR 8188 Universit - PowerPoint PPT Presentation

SAT-Based Data Mining Sad Jabbour CRIL - CNRS UMR 8188 Universit dArtois, France GDR-IA - GT CAVIAR Orlans May 27, 2019 Outline Frequent Itemsets Mining Propositional Logic and SAT problem (Parallel) SAT-based Solvers for

Watched Literals in SAT and CP T opics in this Series Why SAT & Constraints? SAT

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Smarter Balanced/SAT Testing Results 2017 1 Smarter Balanced 2 3 4 SAT Achievement Trend 5

SAT SAT SAT SAT To Become an Auto Parts Manufacturing Leader in ASEAN with Excellent Quality

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

SAT and SMT Murphy Berzish Overview Boolean Satisfiability (SAT) problem SAT solvers:

Practical Proof Systems for SAT and QBF Marijn J.H. Heule Dagstuhl Seminar on SAT and

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

SAT Course Proposal West Orange High School SAT Data Team Approved December 18, 2017 SAT Data

CDCL SAT Solvers & SAT-Based Problem Solving Joao Marques-Silva 1 , 2 & Mikolas Janota 2 1

Redesigned SAT Redesigned SAT Category Redesigned SAT Total Testing 3 hours (plus 50 minutes

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

SAT ACT vs Which is best for your student? Aaron Golumbfskie Education Director

Z3: an efficient SAT/SMT solver SAT Problem SAT problem is translate in propositional formula

Board of Governors Meeting Monday, December 8, 2014 10:15 a.m. -5:45 p.m. ET Board of Governors

Section 4.1: Therapeutic indications Rev.1 SmPC training presentation Note : for full

NAPTP PTP Presentation esentation Barry Davis, Presi side dent nt & CEO May 21, 2014

Group Leader: Rachel Crawford Team Leader: David De Soto Edward Monaghan and Tara Ramsey:

Mathematical Modeling of Evolution Solved and Open Problems Peter Schuster Institut fr

and where else? Nicole Pavio Anses Laboratoire de sant animale, UMR 1161 INRA,ENVA, Anses,

Public Reporting of Food Science in the Public Interest Import Metrics July 20, 2011 If you

Bringing Hepatitis C Treatment into the Medical Home Dr. Joanna Eveland MS, MD, Clinical Chief for

SAT-Based Data Mining Sad Jabbour CRIL - CNRS UMR 8188 Universit - PowerPoint PPT Presentation

SAT-Based Data Mining Sad Jabbour CRIL - CNRS UMR 8188 Universit dArtois, France GDR-IA - GT CAVIAR Orlans May 27, 2019 Outline Frequent Itemsets Mining Propositional Logic and SAT problem (Parallel) SAT-based Solvers for

Watched Literals in SAT and CP T opics in this Series Why SAT &amp; Constraints? SAT

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

Smarter Balanced/SAT Testing Results 2017 1 Smarter Balanced 2 3 4 SAT Achievement Trend 5

SAT SAT SAT SAT To Become an Auto Parts Manufacturing Leader in ASEAN with Excellent Quality

Introduction What is data mining? to Data Mining: On what kind of data? Data Mining

SAT and SMT Murphy Berzish Overview Boolean Satisfiability (SAT) problem SAT solvers:

Practical Proof Systems for SAT and QBF Marijn J.H. Heule Dagstuhl Seminar on SAT and

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

SAT Course Proposal West Orange High School SAT Data Team Approved December 18, 2017 SAT Data

CDCL SAT Solvers &amp; SAT-Based Problem Solving Joao Marques-Silva 1 , 2 &amp; Mikolas Janota 2 1

Redesigned SAT Redesigned SAT Category Redesigned SAT Total Testing 3 hours (plus 50 minutes

Introduction What is data mining? to Data mining functionalities Data Mining Major

Data mining Machine Intelligence Thomas D. Nielsen September 2008 Data mining September 2008

DATA MINING LECTURE 2 What is data? The data mining pipeline What is Data Mining? Data

SAT ACT vs Which is best for your student? Aaron Golumbfskie Education Director

Z3: an efficient SAT/SMT solver SAT Problem SAT problem is translate in propositional formula

Board of Governors Meeting Monday, December 8, 2014 10:15 a.m. -5:45 p.m. ET Board of Governors

Section 4.1: Therapeutic indications Rev.1 SmPC training presentation Note : for full

NAPTP PTP Presentation esentation Barry Davis, Presi side dent nt &amp; CEO May 21, 2014

Group Leader: Rachel Crawford Team Leader: David De Soto Edward Monaghan and Tara Ramsey:

Mathematical Modeling of Evolution Solved and Open Problems Peter Schuster Institut fr

and where else? Nicole Pavio Anses Laboratoire de sant animale, UMR 1161 INRA,ENVA, Anses,

Public Reporting of Food Science in the Public Interest Import Metrics July 20, 2011 If you

Bringing Hepatitis C Treatment into the Medical Home Dr. Joanna Eveland MS, MD, Clinical Chief for

Watched Literals in SAT and CP T opics in this Series Why SAT & Constraints? SAT

CDCL SAT Solvers & SAT-Based Problem Solving Joao Marques-Silva 1 , 2 & Mikolas Janota 2 1

NAPTP PTP Presentation esentation Barry Davis, Presi side dent nt & CEO May 21, 2014