stochastic models for semi structured document mining p
play

Stochastic models for semi- structured document mining P. Gallinari - PowerPoint PPT Presentation

Stochastic models for semi- structured document mining P. Gallinari Collaboration with G. Wisniewski L. Denoyer F. Maes LIP6 University Pierre Marie Curie - Fr Outline Context Generative tree models 3 problems


  1. Stochastic models for semi- structured document mining P. Gallinari Collaboration with G. Wisniewski – L. Denoyer – F. Maes LIP6 University Pierre – Marie Curie - Fr

  2. Outline  Context  Generative tree models  3 problems  Classification  Clustering  Document mapping  Experiments  Conclusion and future work  XML Document Mining Challenge 2006-04-27 - LIPN - P. Gallinari 2

  3. Context - Machine learning in the structured domain  Model, Classify, cluster structured data Domains: Chemistry, biology, XML, etc  Models: discriminant e.g. kernels, generative e.g.  tree densities  Predict structured outputs Domains: natural language parsing, taxonomies, etc  Models: relational learning, large margin extensions   Learn to associate structured representations aka Tree mapping Domains: databases, semi-structured data  2006-04-27 - LIPN - P. Gallinari 3

  4. Context- Machine learning in the structured domain  Structure only vs Structure + content  Central complexity issue  Representation space (#words, #tags, #relations)  Search space for structured outputs - idem  Large corpora needs simple and approximate methods 2006-04-27 - LIPN - P. Gallinari 4

  5. Context-XML semi-structured documents <article> <hdr> <bdy> <fig> <fgc> text <sec> <st> text <p> text 2006-04-27 - LIPN - P. Gallinari 5

  6. Outline  Context  Generative tree models  3 problems  Classification  Clustering  Document restructuration  Experiments  Conclusion and future work  XML Document Mining Challenge 2006-04-27 - LIPN - P. Gallinari 6

  7. Document model s d t d d = ( s , ) t d d P ( D d / ) P ( S s , T t / ) = Θ = = = Θ d d d P ( S s / ) P ( T t / S s , ) = = Θ = = Θ Structural probability Content probability 2006-04-27 - LIPN - P. Gallinari 7

  8. Document Model: Structure  Belief Networks Document Document Document Document Titre Corps ⇒ Titre du document Section Section Section Intro Intro Intro Section Section Section Section Section Cette section contient deux La deuxième section ne paragraphes contient pas de paragraphes Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Paragraphe Titre Paragraphe Paragraphe / d / / d / / d / ( ) d d d i d i d d d P ( s ) ∏ P s / label ( parent ∏ ( n )), label ( précédent ( n )) P ( s ) P ( s / label ( parent ( n ))) ∏ P ( s ) P ( s ) = = = i d i d i i i 1 i 1 = = i 1 = 2006-04-27 - LIPN - P. Gallinari 8

  9. Document Model: Content  Model for each node t = ( t ,...., t ) 1 / d / d d d  1st order dependency / d / P ( t / s , ) P ( t / s , ) ∏ i i θ = θ d d d d i 1  Use of a local generative model for each = label P ( t / s , ) P ( t / ) i i i θ = θ i s d d d d 2006-04-27 - LIPN - P. Gallinari 9

  10. Final network Document P ( Section / Document ) P ( Intro / Document ) P ( Section / Document ) Section Intro Section P ( Paragraphe / Section ) P ( Paragraphe / Section ) P ( T 1 / Intro ) P ( Paragraphe / Section ) P ( T 5 / Section ) P ( T 2 / Section ) Paragraphe Paragraphe T5= «La seconde Paragraphe T1= « Ce document section » est un exemple de document structuré arborescent » P ( T 3 / Paragraphe ) P ( T 6 / Paragraphe ) P ( T 4 / Paragraphe ) T2= « Ceci est la T3= « Le premier T4= «Le second T6= «Le troisième première section du paragraphe » paragraphe » paragraphe » ( ) document » 3 P ( d ) P ( Intro / Document ) P ( Section / Document )? P ( P arg raphe / Section ) = * P ( T 1 / Intro ) P ( T 2 / Section ) P ( T 3 / Paragraphe ) * P ( T 4 / Paragraphe ) P ( T 5 / Section ) P ( T 6 / Paragraphe ) 2006-04-27 - LIPN - P. Gallinari 10

  11. Different learning techniques  Discriminant learning  Likelihood maximization 1 P ( c / x ) = P ( x / c ) ∑ log L log P ( d / ) − = Θ P ( x / c ) 1 e + d D ∈ 1 TRAIN =    / d /  c n θ x , pa ( x ) i i     d s d d t ∑ log ∑ ∑ ∑ − log P ( s / ) log P ( t / s , ) = Θ + Θ c    i i  θ d 1 e x , pa ( x ) s i 1 = + i i i     d D d D i 1 ∈ ∈ =     TRAIN TRAIN  Logistic function L L = + structure contenu Error minimization  2006-04-27 - LIPN - P. Gallinari 11

  12. Fisher Kernel Fisher Score :  U X ∇ log P ( X / ) = θ θ Hypothesis : : The gradient of the log-likelihood is Hypothesis   informative about how much a feature « participate » to the generation of an example. Fisher Kernel : K(X,Y)=K(Ux,Uy)  2006-04-27 - LIPN - P. Gallinari 12

  13. Use with the model   ( )   d s d d t d s d d t ∑ ∑ U log P ( s / ) log P ( t / s , ) log P ( s / ) log P ( t / s , ) = ∇ Θ + Θ = ∇ Θ + ∇ Θ d i i tl Θ Θ Θ     d l i / s l ∈ Λ  =  i ? ? ? ? ? ? ? ? ? ? ? ? d s d d t d d t ? ? U log P ( s / ), log P ( t / s , ) ,..., log P ( t / s , ) ? ? ? ? ? ? ? d i i i i ? tl tl ? ? ? ? ? ? ? ? ? ? ? ? ? ? d d i / s l i / s l ? ? ? ? ? ? ? ? i 1 i / / ? Sous-vecteur Sous-vecteur Sous-vecteur correspondant au correspondant au correspondant gradient pour les gradient pour les au gradient sur l / ? / nœuds de label l1 nœuds de label le modèle de structure d s d t log P ( t / , ) ? ? ? 2006-04-27 - LIPN - P. Gallinari 13

  14. Remark  Fisher kernels: very large number of parameters  On INEX :  With flat models : 200 000 parameters  With structure models : 20 millions parameters 2006-04-27 - LIPN - P. Gallinari 14

  15. Conclusion about this faimily of generative models  Natural setting for modeling semi structured multimedia documents  Structural probability (Belief network)  Content probability (local generative model)  Learning with maximum likelihood, or cross-entropy  Discriminant learning and Fisher Kernel 2006-04-27 - LIPN - P. Gallinari 15

  16. Outline  Context  Generative tree models  3 problems  Classification  Clustering  Document restructuration  Experiments  Conclusion and future work  XML Document Mining Challenge 2006-04-27 - LIPN - P. Gallinari 16

  17. Classification  One model for each category  3 XML corpora + 1 multimedia corpus INEX : 12 000 articles from IEEE   18 journals WebKB : Web pages (8K pages)   course, department, …7 topics WIPO : XML Documents of patents   categories of patents NetProtect (European project) : 100 000 web pages   pornographic or not 2006-04-27 - LIPN - P. Gallinari 17

  18. Categorization : Generative models F1 micro F1 macro NB 0.59 0.605 INEX Structure 0.619 0.622 NB 0.801 0.706 WebKB Structure 0.827 0.743 NB 0.662 0.565 WIPO Structure 0.677 0.604 2006-04-27 - LIPN - P. Gallinari 18

  19. Discriminant models F1 micro F1 macro F1 micro F1 macro NB 0.59 0.605 NB 0.801 0.706 Structure model 0.619 0.622 Structure model 0.827 0.743 SVM TF-IDF 0.534 0.564 SVM TF-IDF 0.737 0.651 Fisher kernel 0.661 0.668 Fisher Kernel 0.823 0.738 Discriminant learning 0.575 0.600 Discriminant learning 0.868 0.792 INEX WebKB F1 micro F1 macro NB 0.662 0.565 Structure model 0.677 0.604 SVM TF-IDF 0.822 0.71 Fisher Kernel 0.862 0.715 WIPO 2006-04-27 - LIPN - P. Gallinari 19

  20. Multimedia model Director Ang Lee Takes Risks with Mean Macroaverage Microaverage Green 'Hulk' recall recall NB 89.9 88.4 [89.2 ;90.4] [87.7 ;89] LOS ANGELES (Reuters) - Taiwan-born director Ang Lee, Structure 92.5 92.9 perhaps best known for his Oscar-winning "Crouching Tiger, model [91.9 ;93] [92.3 ;93.3] Hidden Dragon," is taking a big with text risk with the splashy summer popcorn flick …... Structure 83 82.7 model [82.2 ;83.7] [81.9 ;83.4] with pictures Structure 93.6 94.7 FAMILY DRAMA, BIG ACTION model [93.1 ;94] [94.2 ;95.1] text and For loyal comic book fans who may think Lee's "Hulk" will be too touchy-feely, think again. pictures " This is a drama, a family drama," said Lee, "but with big action." His slumping shoulders twitch and he laughs….. 2006-04-27 - LIPN - P. Gallinari 20

  21. Classification : conclusion  Structure model is able to handle structure and content information  Both structure and content carry class information  Multimedia categorization  Not in this talk :  Categorization of parts of documents  Categorization of trees (structure only) 2006-04-27 - LIPN - P. Gallinari 21

  22. Outline  Context  Generative tree models  3 problems  Classification  Clustering  Document restructuration  Experiments  Conclusion and future work  XML Document Mining Challenge 2006-04-27 - LIPN - P. Gallinari 22

Recommend


More recommend