Stochastic models for semi- structured document mining P. Gallinari Collaboration with G. Wisniewski – L. Denoyer – F. Maes LIP6 University Pierre – Marie Curie - Fr
Outline Context Generative tree models 3 problems Classification Clustering Document mapping Experiments Conclusion and future work XML Document Mining Challenge 2006-04-27 - LIPN - P. Gallinari 2
Context - Machine learning in the structured domain Model, Classify, cluster structured data Domains: Chemistry, biology, XML, etc Models: discriminant e.g. kernels, generative e.g. tree densities Predict structured outputs Domains: natural language parsing, taxonomies, etc Models: relational learning, large margin extensions Learn to associate structured representations aka Tree mapping Domains: databases, semi-structured data 2006-04-27 - LIPN - P. Gallinari 3
Context- Machine learning in the structured domain Structure only vs Structure + content Central complexity issue Representation space (#words, #tags, #relations) Search space for structured outputs - idem Large corpora needs simple and approximate methods 2006-04-27 - LIPN - P. Gallinari 4
Context-XML semi-structured documents <article> <hdr> <bdy> <fig> <fgc> text <sec> <st> text <p> text 2006-04-27 - LIPN - P. Gallinari 5
Outline Context Generative tree models 3 problems Classification Clustering Document restructuration Experiments Conclusion and future work XML Document Mining Challenge 2006-04-27 - LIPN - P. Gallinari 6
Document model s d t d d = ( s , ) t d d P ( D d / ) P ( S s , T t / ) = Θ = = = Θ d d d P ( S s / ) P ( T t / S s , ) = = Θ = = Θ Structural probability Content probability 2006-04-27 - LIPN - P. Gallinari 7
Document Model: Structure Belief Networks Document Document Document Document Titre Corps ⇒ Titre du document Section Section Section Intro Intro Intro Section Section Section Section Section Cette section contient deux La deuxième section ne paragraphes contient pas de paragraphes Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Titre Paragraphe Paragraphe / d / / d / / d / ( ) d d d i d i d d d P ( s ) ∏ P s / label ( parent ∏ ( n )), label ( précédent ( n )) P ( s ) P ( s / label ( parent ( n ))) ∏ P ( s ) P ( s ) = = = i d i d i i i 1 i 1 = = i 1 = 2006-04-27 - LIPN - P. Gallinari 8
Document Model: Content Model for each node t = ( t ,...., t ) 1 / d / d d d 1st order dependency / d / P ( t / s , ) P ( t / s , ) ∏ i i θ = θ d d d d i 1 Use of a local generative model for each = label P ( t / s , ) P ( t / ) i i i θ = θ i s d d d d 2006-04-27 - LIPN - P. Gallinari 9
Final network Document P ( Section / Document ) P ( Intro / Document ) P ( Section / Document ) Section Intro Section P ( Paragraphe / Section ) P ( Paragraphe / Section ) P ( T 1 / Intro ) P ( Paragraphe / Section ) P ( T 5 / Section ) P ( T 2 / Section ) Paragraphe Paragraphe T5= «La seconde Paragraphe T1= « Ce document section » est un exemple de document structuré arborescent » P ( T 3 / Paragraphe ) P ( T 6 / Paragraphe ) P ( T 4 / Paragraphe ) T2= « Ceci est la T3= « Le premier T4= «Le second T6= «Le troisième première section du paragraphe » paragraphe » paragraphe » ( ) document » 3 P ( d ) P ( Intro / Document ) P ( Section / Document )? P ( P arg raphe / Section ) = * P ( T 1 / Intro ) P ( T 2 / Section ) P ( T 3 / Paragraphe ) * P ( T 4 / Paragraphe ) P ( T 5 / Section ) P ( T 6 / Paragraphe ) 2006-04-27 - LIPN - P. Gallinari 10
Different learning techniques Discriminant learning Likelihood maximization 1 P ( c / x ) = P ( x / c ) ∑ log L log P ( d / ) − = Θ P ( x / c ) 1 e + d D ∈ 1 TRAIN = / d / c n θ x , pa ( x ) i i d s d d t ∑ log ∑ ∑ ∑ − log P ( s / ) log P ( t / s , ) = Θ + Θ c i i θ d 1 e x , pa ( x ) s i 1 = + i i i d D d D i 1 ∈ ∈ = TRAIN TRAIN Logistic function L L = + structure contenu Error minimization 2006-04-27 - LIPN - P. Gallinari 11
Fisher Kernel Fisher Score : U X ∇ log P ( X / ) = θ θ Hypothesis : : The gradient of the log-likelihood is Hypothesis informative about how much a feature « participate » to the generation of an example. Fisher Kernel : K(X,Y)=K(Ux,Uy) 2006-04-27 - LIPN - P. Gallinari 12
Use with the model ( ) d s d d t d s d d t ∑ ∑ U log P ( s / ) log P ( t / s , ) log P ( s / ) log P ( t / s , ) = ∇ Θ + Θ = ∇ Θ + ∇ Θ d i i tl Θ Θ Θ d l i / s l ∈ Λ = i ? ? ? ? ? ? ? ? ? ? ? ? d s d d t d d t ? ? U log P ( s / ), log P ( t / s , ) ,..., log P ( t / s , ) ? ? ? ? ? ? ? d i i i i ? tl tl ? ? ? ? ? ? ? ? ? ? ? ? ? ? d d i / s l i / s l ? ? ? ? ? ? ? ? i 1 i / / ? Sous-vecteur Sous-vecteur Sous-vecteur correspondant au correspondant au correspondant gradient pour les gradient pour les au gradient sur l / ? / nœuds de label l1 nœuds de label le modèle de structure d s d t log P ( t / , ) ? ? ? 2006-04-27 - LIPN - P. Gallinari 13
Remark Fisher kernels: very large number of parameters On INEX : With flat models : 200 000 parameters With structure models : 20 millions parameters 2006-04-27 - LIPN - P. Gallinari 14
Conclusion about this faimily of generative models Natural setting for modeling semi structured multimedia documents Structural probability (Belief network) Content probability (local generative model) Learning with maximum likelihood, or cross-entropy Discriminant learning and Fisher Kernel 2006-04-27 - LIPN - P. Gallinari 15
Outline Context Generative tree models 3 problems Classification Clustering Document restructuration Experiments Conclusion and future work XML Document Mining Challenge 2006-04-27 - LIPN - P. Gallinari 16
Classification One model for each category 3 XML corpora + 1 multimedia corpus INEX : 12 000 articles from IEEE 18 journals WebKB : Web pages (8K pages) course, department, …7 topics WIPO : XML Documents of patents categories of patents NetProtect (European project) : 100 000 web pages pornographic or not 2006-04-27 - LIPN - P. Gallinari 17
Categorization : Generative models F1 micro F1 macro NB 0.59 0.605 INEX Structure 0.619 0.622 NB 0.801 0.706 WebKB Structure 0.827 0.743 NB 0.662 0.565 WIPO Structure 0.677 0.604 2006-04-27 - LIPN - P. Gallinari 18
Discriminant models F1 micro F1 macro F1 micro F1 macro NB 0.59 0.605 NB 0.801 0.706 Structure model 0.619 0.622 Structure model 0.827 0.743 SVM TF-IDF 0.534 0.564 SVM TF-IDF 0.737 0.651 Fisher kernel 0.661 0.668 Fisher Kernel 0.823 0.738 Discriminant learning 0.575 0.600 Discriminant learning 0.868 0.792 INEX WebKB F1 micro F1 macro NB 0.662 0.565 Structure model 0.677 0.604 SVM TF-IDF 0.822 0.71 Fisher Kernel 0.862 0.715 WIPO 2006-04-27 - LIPN - P. Gallinari 19
Multimedia model Director Ang Lee Takes Risks with Mean Macroaverage Microaverage Green 'Hulk' recall recall NB 89.9 88.4 [89.2 ;90.4] [87.7 ;89] LOS ANGELES (Reuters) - Taiwan-born director Ang Lee, Structure 92.5 92.9 perhaps best known for his Oscar-winning "Crouching Tiger, model [91.9 ;93] [92.3 ;93.3] Hidden Dragon," is taking a big with text risk with the splashy summer popcorn flick …... Structure 83 82.7 model [82.2 ;83.7] [81.9 ;83.4] with pictures Structure 93.6 94.7 FAMILY DRAMA, BIG ACTION model [93.1 ;94] [94.2 ;95.1] text and For loyal comic book fans who may think Lee's "Hulk" will be too touchy-feely, think again. pictures " This is a drama, a family drama," said Lee, "but with big action." His slumping shoulders twitch and he laughs….. 2006-04-27 - LIPN - P. Gallinari 20
Classification : conclusion Structure model is able to handle structure and content information Both structure and content carry class information Multimedia categorization Not in this talk : Categorization of parts of documents Categorization of trees (structure only) 2006-04-27 - LIPN - P. Gallinari 21
Outline Context Generative tree models 3 problems Classification Clustering Document restructuration Experiments Conclusion and future work XML Document Mining Challenge 2006-04-27 - LIPN - P. Gallinari 22
Recommend
More recommend