Sub-article Matching Muhao Chen 1 , Changping Meng 2 , Gang Huang 3 , - PowerPoint PPT Presentation

Neural Article Pair Modeling for Wikipedia Sub-article Matching Muhao Chen 1 , Changping Meng 2 , Gang Huang 3 , and Carlo Zaniolo 1 1 University of California, Los Angeles 2 Purdue University, West Lafayette 3 Google, Mountain View

Outline • Background • Modeling • Experimental Evaluation • Future Work

Wik ikipedia: th the source of f knowledge for people and computing research Countless knowledge driven Essential sources of knowledge for technologies people • Knowledge bases • 45,567,563 encyclopedia articles • Semantic Analysis • 34,248,801 users • Semantic search (As of 21 August 2018) • Open-domain question answering • Named Entity Recognition • etc.

Article-as as-concept Assumption 1-to-1 Mapping between entities and Wikipedia articles Wikipedia-based computing technologies that rely on this assumption: • Automated knowledge base construction • Semantic search of entities • Explicit and implicit semantic representations • Cross-lingual Knowledge alignment • etc.

Recent Editing Trends of f Wik ikipedia • Splitting different aspects of an entity into multiple articles. Enhance human readability Are problematic to Wikipedia-based technologies and applications Main-article summarizes an entity. Sub-article comprehensively describes an aspect or a subtopic of the main-article.

Vio iolation of f Art rticle-as as-concept Causes Problems to Existing Technologies • Automated knowledge base construction: infoboxes and links are separated to multiple pages. • Cross-lingual knowledge alignment and Wikification: one-to-one match does not hold. • Semantic search: descriptions of entities are diffused • Semantic representations: affected by the above • … We need to restore the scattered Wikipedia back

Problem Defi finition of f Sub-article Matching • Input: A pair of Wikipedia pages ( A i , A j ) (text contents, titles and links) • Target: identify if A i is the Sub-article of A j The sub-article relation conforms • Criteria of the sub-article relations: anti-symmetry . 1. A j describes an aspect or a subtopic of A i 2. The text content of A j can be inserted as a section of A i without breaking the topic of A i

Our Approach • A deep neural document pair model that incorporates 1. Latent semantic features of articles and titles 2. Comprehensive explicit features that measure the symbolic and structural aspects of article pairs ‐ Obtains near-perfect performance on contributed data +A scalable solution to extract high-quality M-S matching with thousand-machine MapReduce from the entire English Wikipedia. +A large contributed dataset of 196k English Wikipedia article pairs for this task

Overall Le Learning Architecture Outputs ( s + ,s - ) MLP MLP MLPs and Explicit MLP MLP F(A i ,A j ) Features Embeddings Document (1) (1) (2) (2) E E E E Encoders t c c t Text Content c i Title t i Text Content c j Title t j Article pair Article A j Article A i • Learning Objective: minimizes the binary cross-entropy loss

Neural Document Encoders Note: document encoders only reads the first paragraph of a Wikipedia article. • Three types of neural document encoders 1. CNN+Dynamic MaxPooling 2. GRU 3. GRU+Self-attention (1) (1) E E t c Title t i Text Content c i • Word embedding layer: entity-annotated SkipGram

Explicit Features r tto Token overlap ratio of titles. r st Maximum token overlap ratios of section titles. r indeg Relative in-degree centrality. Based on [Lin et al. 2017] r mt Article template token overlap ratio. f TF Normalized term frequency of A i title in A i text content. d MW Milne-Witten Index. r outdeg Relative out-degree centrality. d te Average embedding distance of title tokens. Additional r dt Token overlap ratios of text contents. 1. Symbolic similarity measures: r tto r st r mt f TF r dt 2. Structural measures: r indeg r outdeg d MW 3. Semantic measure: d te

WAP196k — A La Large Corpus of f Main and Sub-article Pairs 1. Candidate sub- 2. Massive 3. Negative cases article selection crowdsourcing generation • Articles like German Army or Annotators decide whether Three rule patterns: Fictional Universe of Harry candidates from 1 are sub- 1. Invert positive matches. Potter : articles. If so, find the 2. Pair two sub-articles of the • Article titles that corresponding main-articles. same main-article • concatenate two Candidate article pairs (positive 3. Randomly corrupt the main- Wikipedia entity names and some negative matches) are article of a positive match directly or with a selected based on total with an adjacent article. proposition agreement . 1:10 positive to negative cases

Experimental Evaluation • Task 1: 10-fold cross validation • Metrics: Precision, Recall and F1 for identifying positive cases • Baselines and model variants 1. Statistical classification algorithms based on explicit features: Logistic Regression, NBC, LinearSVM, DecisionTree, Adaboost+DT, Random Forest, kNN. [Lin et al. 2017] 2. Neural document pair models with latent semantics only (CNN, GRU, AGRU) 3. Neural document pair models with latent semantics + Explicit feature (CNN+F, GRU+F, AGRU+F)

10 10-fold Cross Validation Results • Semantic features are more effective than explicit features • Incorporating both feature types reaches near-perfect performance

Feature Ablation Analysis Titles are then most important features (close to the practice of human cognition) Topological measures are relatively less important

Experimental Evaluation • Task 2: large-scale sub-article relation mining from the entire English Wikipedia • Model: CNN+ F trained on the full WAP196k • Candidate space: 108 million ordered article pairs linked with at least one inline hyperlink • Workload: ~ 9 hours with a 3,000-machine MapReduce

Ext xtraction Results • ~85.7% Precision @200k • Avg 4.9 sub-articles per main-article • Sub-article matching and Google Knowledge Graph

Future Work • Document classification 1. Learning to differentiate main and sub-articles 2. Learning to differentiate sub-articles that describe refined entities and those that describe abstract sub-concepts • Extending the proposed model to populate the incomplete cross-lingual alignment

References 1. Lin, Y., Yu, B., Hall, A., & Hecht, B. Problematizing and Addressing the Article-as-Concept Assumption in Wikipedia. In CSCW . ACM 2017 2. Chen,M.,Tian,Y.,etal.:Multilingualknowledgegraphembeddingsforcross-lingual knowledge alignment. In: IJCAI (2017) 3. Chen, M., Tian, Y., et al.: Co-training embeddings of knowledge graphs and entity descriptions for cross- lingual entity alignment. In: IJCAI (2018) 4. Chen, M., Tian, Y., et al.: On2vec: Embedding-based relation prediction for ontology population. In: SDM (2018) 5. Dhingra, B., Liu, H., et al.: Gated-attention readers for text comprehension. In: ACL (2017) 6. Kim, Y.: Convolutional neural networks for sentence classification. In: EMNLP (2014) 7. Jozefowicz, R., Zaremba, W., et al.: An empirical exploration of recurrent network architectures. In: ICML (2015) 8. Milne, D., Witten, I.H.: Learning to link with wikipedia. In: CIKM (2008) 9. Strube, M., Ponzetto, S.P.: Wikirelate! computing semantic relatedness using wikipedia. In: AAAI (2006) 10. Gabrilovich, Evgeniy, and Shaul Markovitch. "Computing semantic relatedness using wikipedia-based explicit semantic analysis." IJCAI . (2007) 11. Chen, Danqi, et al. "Reading Wikipedia to Answer Open-Domain Questions." ACL. (2017)

Thank You 21

Sub-article Matching Muhao Chen 1 , Changping Meng 2 , Gang Huang 3 , - PowerPoint PPT Presentation

Neural Article Pair Modeling for Wikipedia Sub-article Matching Muhao Chen 1 , Changping Meng 2 , Gang Huang 3 , and Carlo Zaniolo 1 1 University of California, Los Angeles 2 Purdue University, West Lafayette 3 Google, Mountain View Outline

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

AATO CONSTITUTION 1 Article of the Constitution Article 6 The Council Article 1

Article 1-To accept reports Article 2-To set salaries for school officials Article 3-To

X u a u a a matching. 3 3 3 0 0 1 dashed means matched. Algorithm No augmenting path

Article 6 Kelley Kizzier UNFCCC Co-Chair Article 6 Context and Overview The last Article to

Selection of an article While selecting the article you would like to present, keep n into

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

1 Article 17 of the OECD Model Tax Convention ARTICLE 17 ENTERTAINERS AND SPORTSPERSONS 1.

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Chapter 8 Shape representation and description 8.1 Matching 2 8.1 Matching we wish to

Parton Showers and Matching/Merging Lecture 2 of 2: Matching/Merging & Non-Perturbative

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

Outline Flexible, optimal matching for observational Optimal matching of two groups studies

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b Pattern

Graph Matchings Matching A matching M in a graph G is a set of non-loop edges with no shared

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

Two-view 2D->3D matching with calorimetry in pandora Dom Brailsford, Etienne Chardonnet FD

CS 401: Computer Algorithms I Stable Matching / Representative Problems Xiaorui Sun 1 Last

Weighted Bipartite Matching CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Matching A

Paris Agreements Article 6 Update Stefano De Clara Director for International Policy, IETA

String Matching String matching problem: string T (text) and string P (pattern) over an

Sub-article Matching Muhao Chen 1 , Changping Meng 2 , Gang Huang 3 , - PowerPoint PPT Presentation

Neural Article Pair Modeling for Wikipedia Sub-article Matching Muhao Chen 1 , Changping Meng 2 , Gang Huang 3 , and Carlo Zaniolo 1 1 University of California, Los Angeles 2 Purdue University, West Lafayette 3 Google, Mountain View Outline

7.5 Bipartite Matching Matching Matching. Input: undirected graph G = (V, E). M E

AATO CONSTITUTION 1 Article of the Constitution Article 6 The Council Article 1

Article 1-To accept reports Article 2-To set salaries for school officials Article 3-To

X u a u a a matching. 3 3 3 0 0 1 dashed means matched. Algorithm No augmenting path

Article 6 Kelley Kizzier UNFCCC Co-Chair Article 6 Context and Overview The last Article to

Selection of an article While selecting the article you would like to present, keep n into

Outline Morning program Preliminaries Text matching I Text matching II Afternoon program

1 Article 17 of the OECD Model Tax Convention ARTICLE 17 ENTERTAINERS AND SPORTSPERSONS 1.

Matching of Matrix Elements and Parton Showers CKKW matching in e + e collisions Lecture 2:

Global Shape Matching Section 3.3: Articulated Matching using Graph Cuts Global Shape Matching:

Chapter 8 Shape representation and description 8.1 Matching 2 8.1 Matching we wish to

Parton Showers and Matching/Merging Lecture 2 of 2: Matching/Merging &amp; Non-Perturbative

Scalable String Matching on the Scalable String Matching on the Scalable String Matching on the

CSE182-L7 Dicitionary matching Pattern matching October 09 CSE182 Dictionary Matching

Outline Flexible, optimal matching for observational Optimal matching of two groups studies

String Matching Inge Li Grtz CLRS 32 String Matching String matching problem: string

Pattern Matching a b a c a a b 1 a b a c a b 4 3 2 a b a c a b Pattern

Graph Matchings Matching A matching M in a graph G is a set of non-loop edges with no shared

CSE182-L6 P-value and E-value Dicitionary matching Pattern matching October 09 CSE182 Why is

Two-view 2D-&gt;3D matching with calorimetry in pandora Dom Brailsford, Etienne Chardonnet FD

CS 401: Computer Algorithms I Stable Matching / Representative Problems Xiaorui Sun 1 Last

Weighted Bipartite Matching CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Matching A

Paris Agreements Article 6 Update Stefano De Clara Director for International Policy, IETA

String Matching String matching problem: string T (text) and string P (pattern) over an

Parton Showers and Matching/Merging Lecture 2 of 2: Matching/Merging & Non-Perturbative

Two-view 2D->3D matching with calorimetry in pandora Dom Brailsford, Etienne Chardonnet FD