sub article matching
play

Sub-article Matching Muhao Chen 1 , Changping Meng 2 , Gang Huang 3 , - PowerPoint PPT Presentation

Neural Article Pair Modeling for Wikipedia Sub-article Matching Muhao Chen 1 , Changping Meng 2 , Gang Huang 3 , and Carlo Zaniolo 1 1 University of California, Los Angeles 2 Purdue University, West Lafayette 3 Google, Mountain View Outline


  1. Neural Article Pair Modeling for Wikipedia Sub-article Matching Muhao Chen 1 , Changping Meng 2 , Gang Huang 3 , and Carlo Zaniolo 1 1 University of California, Los Angeles 2 Purdue University, West Lafayette 3 Google, Mountain View

  2. Outline • Background • Modeling • Experimental Evaluation • Future Work

  3. Wik ikipedia: th the source of f knowledge for people and computing research Countless knowledge driven Essential sources of knowledge for technologies people • Knowledge bases • 45,567,563 encyclopedia articles • Semantic Analysis • 34,248,801 users • Semantic search (As of 21 August 2018) • Open-domain question answering • Named Entity Recognition • etc.

  4. Article-as as-concept Assumption 1-to-1 Mapping between entities and Wikipedia articles Wikipedia-based computing technologies that rely on this assumption: • Automated knowledge base construction • Semantic search of entities • Explicit and implicit semantic representations • Cross-lingual Knowledge alignment • etc.

  5. Recent Editing Trends of f Wik ikipedia • Splitting different aspects of an entity into multiple articles. Enhance human readability Are problematic to Wikipedia-based technologies and applications Main-article summarizes an entity. Sub-article comprehensively describes an aspect or a subtopic of the main-article.

  6. Vio iolation of f Art rticle-as as-concept Causes Problems to Existing Technologies • Automated knowledge base construction: infoboxes and links are separated to multiple pages. • Cross-lingual knowledge alignment and Wikification: one-to-one match does not hold. • Semantic search: descriptions of entities are diffused • Semantic representations: affected by the above • … We need to restore the scattered Wikipedia back

  7. Problem Defi finition of f Sub-article Matching • Input: A pair of Wikipedia pages ( A i , A j ) (text contents, titles and links) • Target: identify if A i is the Sub-article of A j The sub-article relation conforms • Criteria of the sub-article relations: anti-symmetry . 1. A j describes an aspect or a subtopic of A i 2. The text content of A j can be inserted as a section of A i without breaking the topic of A i

  8. Our Approach • A deep neural document pair model that incorporates 1. Latent semantic features of articles and titles 2. Comprehensive explicit features that measure the symbolic and structural aspects of article pairs ‐ Obtains near-perfect performance on contributed data +A scalable solution to extract high-quality M-S matching with thousand-machine MapReduce from the entire English Wikipedia. +A large contributed dataset of 196k English Wikipedia article pairs for this task

  9. Overall Le Learning Architecture Outputs ( s + ,s - ) MLP MLP MLPs and Explicit MLP MLP F(A i ,A j ) Features Embeddings Document (1) (1) (2) (2) E E E E Encoders t c c t Text Content c i Title t i Text Content c j Title t j Article pair Article A j Article A i • Learning Objective: minimizes the binary cross-entropy loss

  10. Neural Document Encoders Note: document encoders only reads the first paragraph of a Wikipedia article. • Three types of neural document encoders 1. CNN+Dynamic MaxPooling 2. GRU 3. GRU+Self-attention (1) (1) E E t c Title t i Text Content c i • Word embedding layer: entity-annotated SkipGram

  11. Explicit Features r tto Token overlap ratio of titles. r st Maximum token overlap ratios of section titles. r indeg Relative in-degree centrality. Based on [Lin et al. 2017] r mt Article template token overlap ratio. f TF Normalized term frequency of A i title in A i text content. d MW Milne-Witten Index. r outdeg Relative out-degree centrality. d te Average embedding distance of title tokens. Additional r dt Token overlap ratios of text contents. 1. Symbolic similarity measures: r tto r st r mt f TF r dt 2. Structural measures: r indeg r outdeg d MW 3. Semantic measure: d te

  12. WAP196k — A La Large Corpus of f Main and Sub-article Pairs 1. Candidate sub- 2. Massive 3. Negative cases article selection crowdsourcing generation • Articles like German Army or Annotators decide whether Three rule patterns: Fictional Universe of Harry candidates from 1 are sub- 1. Invert positive matches. Potter : articles. If so, find the 2. Pair two sub-articles of the • Article titles that corresponding main-articles. same main-article • concatenate two Candidate article pairs (positive 3. Randomly corrupt the main- Wikipedia entity names and some negative matches) are article of a positive match directly or with a selected based on total with an adjacent article. proposition agreement . 1:10 positive to negative cases

  13. Experimental Evaluation • Task 1: 10-fold cross validation • Metrics: Precision, Recall and F1 for identifying positive cases • Baselines and model variants 1. Statistical classification algorithms based on explicit features: Logistic Regression, NBC, LinearSVM, DecisionTree, Adaboost+DT, Random Forest, kNN. [Lin et al. 2017] 2. Neural document pair models with latent semantics only (CNN, GRU, AGRU) 3. Neural document pair models with latent semantics + Explicit feature (CNN+F, GRU+F, AGRU+F)

  14. 10 10-fold Cross Validation Results • Semantic features are more effective than explicit features • Incorporating both feature types reaches near-perfect performance

  15. Feature Ablation Analysis Titles are then most important features (close to the practice of human cognition) Topological measures are relatively less important

  16. Experimental Evaluation • Task 2: large-scale sub-article relation mining from the entire English Wikipedia • Model: CNN+ F trained on the full WAP196k • Candidate space: 108 million ordered article pairs linked with at least one inline hyperlink • Workload: ~ 9 hours with a 3,000-machine MapReduce

  17. Ext xtraction Results • ~85.7% Precision @200k • Avg 4.9 sub-articles per main-article • Sub-article matching and Google Knowledge Graph

  18. Future Work • Document classification 1. Learning to differentiate main and sub-articles 2. Learning to differentiate sub-articles that describe refined entities and those that describe abstract sub-concepts • Extending the proposed model to populate the incomplete cross-lingual alignment

  19. References 1. Lin, Y., Yu, B., Hall, A., & Hecht, B. Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia. In CSCW . ACM 2017 2. Chen,M.,Tian,Y.,etal.:Multilingualknowledgegraphembeddingsforcross-lingual knowledge alignment. In: IJCAI (2017) 3. Chen, M., Tian, Y., et al.: Co-training embeddings of knowledge graphs and entity descriptions for cross- lingual entity alignment. In: IJCAI (2018) 4. Chen, M., Tian, Y., et al.: On2vec: Embedding-based relation prediction for ontology population. In: SDM (2018) 5. Dhingra, B., Liu, H., et al.: Gated-attention readers for text comprehension. In: ACL (2017) 6. Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP (2014) 7. Jozefowicz, R., Zaremba, W., et al.: An empirical exploration of recurrent network architectures. In: ICML (2015) 8. Milne, D., Witten, I.H.: Learning to link with wikipedia. In: CIKM (2008) 9. Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In: AAAI (2006) 10. Gabrilovich, Evgeniy, and Shaul Markovitch. "Computing semantic relatedness using wikipedia-based explicit semantic analysis." IJCAI . (2007) 11. Chen, Danqi, et al. "Reading Wikipedia to Answer Open-Domain Questions." ACL. (2017)

  20. Thank You 21

Recommend


More recommend