Analyzing manuscript traditions using constraint-based data mining A case study in declarative data mining Tara Andrews, Hendrik Blockeel, Bart Bogaerts, Maurice Bruynooghe, Marc Denecker, Stef De Pooter, Caroline Macé, Jan Ramon KU Leuven, Department of Computer Science KU Leuven, Faculty of Arts
Overview Principles of “declarative data mining” IDP: a modeling language based on first order logic Using IDP for data analysis in stemmatology
Declarative data mining
Data mining Current state of the art in data mining: a large variety of tasks, methods, and systems data analysis is limited to: mapping your problem on one of the predefined tasks, then running an existing system not much flexibility . ANN . DT . Assoc-rules . PCA . k-means . SVM
Constraint-based data mining More flexibility: define the task more precisely by imposing constraints on the solutions you want to find E.g.: “find all frequent itemsets that have Cheese & Beer in the IF part”; “find the clustering with minimal SSE that has a & b in the same cluster, and c&d in another cluster” (must-link/cannot-link constraints), ... Basic task structure remains the same . ANN . DT . Assoc-rules . PCA . k-means . SVM
Inductive querying Fits in the “inductive databases” viewpoint (Imielinski & Mannila, 1996) patterns are DB objects that can be stored, queried, manipulated data mining = “querying for patterns” Most inductive query languages still focus on particular types of data mining approaches (e.g., MINE RULE extension to SQL, Meo et al. 1998: association rule mining) A unified . ANN . DT . Assoc-rules approach . PCA . k-means . SVM
Is a more generic approach possible? A general-purpose modeling language for data mining? allowing to model the task, the background knowledge, the inputs, constraints on the outputs, ... In (numerical) ML, linear algebra & optimization play a similar role First steps towards this in DM: Nijssen & Guns, 2010 rephrase itemset mining in a constraint programming framework, demonstrate efficiency of the approach This work continues in that direction DM task modeling . ANN . DT . Assoc-rules system . PCA . k-means . SVM
IDP
IDP An environment for knowledge-based programming (Wittocx et al. 2008) Combines imperative and declarative elements declarative objects: vocabularies, theories, structures (predefined) procedures to create and manipulate these objects perform inference on them (model expansion, ...) Includes state-of-the-art model generator (ref. ASP competition)
FO(.) IDP FO(.) = family of extensions of first order logic IDP supports FO(.) IDP , an FO(.) language that supports integer & real algebra aggregates inductive definitions ...
Inductive definitions Inductive definition: model = “minimal” interpretation that fulfills the constraints (“minimal” = number of true facts is minimal) Set of constraints: model = any interpretation that fulfills the constraints FO itself cannot express inductive definitions, it needs an extension for that { integer(0). integer(0). integer(s(X)) <- integer(X). } integer(s(X)) <= integer(X).
Example: find shortest path theory satisfied <=> vocabulary sp_voc { edgeOnPath represents a type node path from ‘from’ to ‘to’ from, to: node edge(node,node) edgeOnPath(node,node) subgraph reaches(node,node) } begins in from, theory sp_theory: sp_voc { ends in to ! x y : edgeOnPath(x,y) => edge(x,y). ~(? x : edgeOnPath(x,from)) & ~(? x : edgeOnPath(to,x)). not branching !x: (?<2 y: edgeOnPath(y,x)) & (?<2 y: edgeOnPath(x,y)). { reaches(x,y) <- edgeOnPath(x,y). must connect reaches(x,y) <- reaches(x,z) & reaches(z,y). } from & to reaches(from,to). ! x y : edgeOnPath(x,y) => reaches(from,y). connected }
Example: find shortest path vocabulary sp_voc { type node from, to: node edge(node,node) edgeOnPath(node,node) reaches(node,node) } theory sp_theory: sp_voc { ! x y : edgeOnPath(x,y) => edge(x,y). ~(? x : edgeOnPath(x,from)) & ~(? x : edgeOnPath(to,x)). !x: (?<2 y: edgeOnPath(y,x)) & (?<2 y: edgeOnPath(x,y)). { reaches(x,y) <- edgeOnPath(x,y). reaches(x,y) <- reaches(x,z) & reaches(z,y). } reaches(from,to). ! x y : edgeOnPath(x,y) => reaches(from,y). input graph } structure sp_struct: sp_voc { (= partial interpretation node = {A..D} / / shorthand for A,B,C,D of sp_voc) edge = {A,B; B,C; C,D; A,D} from = A to = D } term lengthOfPath: sp_voc { defines length #{ x y : edgeOnPath(x,y) } }
Example: find shortest path vocabulary sp_voc { type node from, to: node edge(node,node) edgeOnPath(node,node) reaches(node,node) } theory sp_theory: sp_voc { ! x y : edgeOnPath(x,y) => edge(x,y). ~(? x : edgeOnPath(x,from)) & ~(? x : edgeOnPath(to,x)). !x: (?<2 y: edgeOnPath(y,x)) & (?<2 y: edgeOnPath(x,y)). { reaches(x,y) <- edgeOnPath(x,y). reaches(x,y) <- reaches(x,z) & reaches(z,y). } reaches(from,to). ! x y : edgeOnPath(x,y) => reaches(from,y). } structure sp_struct: sp_voc { node = {A..D} / / shorthand for A,B,C,D edge = {A,B; B,C; C,D; A,D} from = A to = D } term lengthOfPath: sp_voc { #{ x y : edgeOnPath(x,y) } } procedure main() { sols = minimize(sp_theory,sp_struct,lengthOfPath) main procedure: finds the if sols path with minimal length then print(sols[1]) in the given graph; prints else print("No models exist.\n") it, if it exists end }
Example: find frequent itemsets vocabulary FrequentItemsetMiningVoc { type Transaction type Item Freq: int Includes(Transaction,Item) FrequentItemset represents FrequentItemset(Item) a set of items } theory FrequentItemsetMiningTh: FrequentItemsetMiningVoc { #{t: !i: FrequentItemset(i) => Includes(t,i) } >= Freq. } #{t: FrequentItemset ⊆ t} structure Input : FrequentItemsetMiningVoc { >= Freq. Freq = 7 / / threshold for frequent itemsets Transaction = { t1; ... ; tn } / / n transactions Item = {i1 ; ... ; im } / / m items Includes = {t1,i2; t1,i7; ...} / / items of transactions }
IDP for stemmatology
Stemmatology (stemmatics) Subfield of philology concerned with studying relationships between surviving variants of an old text (for instance, in order to reconstruct a lost original) Monks copied manuscripts manually, made changes -> “evolution” of the story Stemma = “family tree” of a set of manuscripts Somewhat similar to phylogenetic trees in bioinformatics but there are some differences... solutions specific to stemmatology are needed
Stemma stemma = connected DAG with one root A (“rooted DAG”) B F multifurcation C D E G contamination H
Stemma with witnesses A non-leaf witness B F C D E G ... : “witness” H
The data Given: A set of manuscripts, which differ in particular places Each manuscript is described by a fixed set of attributes; an attribute indicates for a particular position which variant occurs there P1 P2 P3 ... text1 ... has Fred “no”, he said text2 ... had he he said no text3 ... has he “never”, he said
The “classical” task Classical task: given the data, hypothesize a stemma DAG indicating relationships between the documents may include nodes for “lost” documents, the existence of which is hypothesized But this is not the only task we can consider (nor the task our philologists were interested in)
Other tasks In this case, for a number of cases a stemma is given together with the dataset for synthetic data: the correct stemma for real data: current best guess Analyze the relationship between the stemmata & data in order to learn something about the evolution of manuscript traditions E.g., which types of copying errors are more/less commonly made, ... ?
Task 1 Tara’ s original question: “Is there an algorithm that solves the following problem: given a directed graph, with some nodes assigned to particular ‘groups’, is it possible to complete the groups such that each node occurs in at most one group, and each group is connected?”
DAG formulation In a DAG with some groups of nodes defined, complete the groups such that each group forms a rooted DAG itself (“is connected”) given solution
How to solve? Several algorithms had been tried; all but one found incorrect on at least one case “I haven’ t been able to find any case where my latest algorithm won’ t work - but I can’ t prove it’ s correct either. ” (370 lines of Perl code, excluding I/O etc.) So we tried a declarative approach v1: model groups using equivalence relation v2: model groups using labels v3: use concept of “source” (we only discuss this one)
Terminology A source of a variant = document where the variant first occurred (= parents do not have that variant) Problem reduces to: “given a partially labeled DAG, can you complete the labeling such that each label has only one source?”
IDP formulation /* ---------- Knowledge base ------------------------- */ vocabulary V { type Manuscript type Variant CopiedBy(Manuscript,Manuscript) VariantIn(Manuscript): Variant } vocabulary Vsrc { extern vocabulary V SourceOf(Variant): Manuscript } theory Tsrc : Vsrc { ! x : (x ~= SourceOf(VariantIn(x))) => ? y: CopiedBy(y,x) & VariantIn(y) = VariantIn(x). }
Recommend
More recommend