semantic tagging using topic models exploiting wikipedia
play

Semantic Tagging Using Topic Models Exploiting Wikipedia Category - PowerPoint PPT Presentation

Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network Nitesh Prakash, Duncan Rule, Boh Young Suh Introduction Goal: tag web articles with most probable Wikipedia categories - What is the article about in terms of


  1. Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network Nitesh Prakash, Duncan Rule, Boh Young Suh

  2. Introduction Goal: tag web articles with most probable Wikipedia categories - What is the article “about” in terms of categories? - Helpful for information access and retrieval

  3. Model Overview (sOntoLDA) Modify LDA to suit problem’s needs - Pre-define topics as Wiki categories - Use prior knowledge to improve topic-word distribution - Wikipedia articles labeled with categories - Represent with � matrix LDA sOntoLDA � ~ Dir( � ) � ~ Dir( � x � )

  4. Building Prior Matrix ( � ) How do we represent prior word-topic knowledge? - Start with tf-idf matrix - Each “document” is the set of Wiki articles tagged with a given category - Add subcategories down to a specific level ℓ

  5. Inference using Gibbs Sampling Now that we have a generative LDA model and the � priors we need to ● reverse the process to infer from the observed documents: Denominator cannot be computed with C n terms where n is number of ● words in vocabulary. Collapsed Gibbs Sampling which uses a Markov Chain Monte Carlo to ● converge to a posterior distribution over categories c, conditioned on the observed words w, and hyperparameters � and �

  6. Inference using Gibbs Sampling Probability of a category ● given a document Probability of a word given a ● category

  7. Health Tagging Example Health Health Care Sciences Structure of and relationships ● between Wikipedia categories as represented by SKOS properties. Self Care Dentistry 0.0403 0.0478 Sub Categories and Super ● Categories Personal Dentistry Consider super-categories in Hygiene ● Hygiene Branches Products addition to exact match 0.0302 Categories assigned to article on Chiropractic ● Oral Hygiene Treatment “tooth brushing” and the related 0.1533 Techniques 0.0227 category hierarchy Tooth Brushing

  8. Experiments 1. How well does the model predict the categories of a collection of the Wikipedia articles? 2. Assign Wikipedia tags to Reuters News articles and compare top-k topics

  9. Preprocessing Final Topic Graph 1,353 categories ● 30,300 articles ● Vocabulary size 99,665 ●

  10. Evaluation metric Precision@k and Mean Average Precision (MAP)

  11. Tagging wikipedia article results

  12. Real-world document set Evaluation on Reuters news (2,914 articles) Applied the “Hierarchical match” method used for the Wikipedia dataset Removed words not defined in Prior Matrix ( � )

  13. Example of topic and word distribution

  14. Conclusions Utilizing prior knowledge from Wikipedia’s hierarchical ontology can be ● successfully used for semantic tagging for documents Future work - Expand to other topics - Explore richer topic models - Incorporate hierarchical structure of categories

  15. Questions?

Recommend


More recommend