Semantic Tagging Using Topic Models Exploiting Wikipedia Category Network Nitesh Prakash, Duncan Rule, Boh Young Suh
Introduction Goal: tag web articles with most probable Wikipedia categories - What is the article “about” in terms of categories? - Helpful for information access and retrieval
Model Overview (sOntoLDA) Modify LDA to suit problem’s needs - Pre-define topics as Wiki categories - Use prior knowledge to improve topic-word distribution - Wikipedia articles labeled with categories - Represent with � matrix LDA sOntoLDA � ~ Dir( � ) � ~ Dir( � x � )
Building Prior Matrix ( � ) How do we represent prior word-topic knowledge? - Start with tf-idf matrix - Each “document” is the set of Wiki articles tagged with a given category - Add subcategories down to a specific level ℓ
Inference using Gibbs Sampling Now that we have a generative LDA model and the � priors we need to ● reverse the process to infer from the observed documents: Denominator cannot be computed with C n terms where n is number of ● words in vocabulary. Collapsed Gibbs Sampling which uses a Markov Chain Monte Carlo to ● converge to a posterior distribution over categories c, conditioned on the observed words w, and hyperparameters � and �
Inference using Gibbs Sampling Probability of a category ● given a document Probability of a word given a ● category
Health Tagging Example Health Health Care Sciences Structure of and relationships ● between Wikipedia categories as represented by SKOS properties. Self Care Dentistry 0.0403 0.0478 Sub Categories and Super ● Categories Personal Dentistry Consider super-categories in Hygiene ● Hygiene Branches Products addition to exact match 0.0302 Categories assigned to article on Chiropractic ● Oral Hygiene Treatment “tooth brushing” and the related 0.1533 Techniques 0.0227 category hierarchy Tooth Brushing
Experiments 1. How well does the model predict the categories of a collection of the Wikipedia articles? 2. Assign Wikipedia tags to Reuters News articles and compare top-k topics
Preprocessing Final Topic Graph 1,353 categories ● 30,300 articles ● Vocabulary size 99,665 ●
Evaluation metric Precision@k and Mean Average Precision (MAP)
Tagging wikipedia article results
Real-world document set Evaluation on Reuters news (2,914 articles) Applied the “Hierarchical match” method used for the Wikipedia dataset Removed words not defined in Prior Matrix ( � )
Example of topic and word distribution
Conclusions Utilizing prior knowledge from Wikipedia’s hierarchical ontology can be ● successfully used for semantic tagging for documents Future work - Expand to other topics - Explore richer topic models - Incorporate hierarchical structure of categories
Questions?
Recommend
More recommend