representing documents via latent keyphrase inference
play

Representing Documents via Latent Keyphrase Inference April. 15 th , - PowerPoint PPT Presentation

Representing Documents via Latent Keyphrase Inference April. 15 th , 2016 Document Representation in Vector Space Critical for document retrieval, categorization 2 Traditional Methods q Bag-of-Words or Phrases q Cons: Sparse on short texts 3 q


  1. Representing Documents via Latent Keyphrase Inference April. 15 th , 2016

  2. Document Representation in Vector Space Critical for document retrieval, categorization 2

  3. Traditional Methods q Bag-of-Words or Phrases q Cons: Sparse on short texts 3

  4. q Topic models [LDA] Each topic is a distribution over words, each document is a mixture of corpus-wide topics q Cons: Difficult for human to infer topic semantics 4

  5. q Concept-based models [ESA] Every Wikipedia article represents a concept Concept: Panthera Cat [0.92] Leopard [0.84] Roar [0.77] Article words are associated with the concept (TF.IDF), which help infer concepts from document q Cons: Low coverage of concepts in human-curated knowledge base 5

  6. q Word/Document embedding models [word2vec paragraph2vec] q Cons: Difficult to explain what each dimension means 6

  7. Document Representation Using Keyphrases Corpus Domain Keyphrases <K 1 , K 2 , …, K M > q Use domain keyphrases as the entries in the vector and q Identify document keyphrases (subset of domain keyphrases) by evaluating relatedness between (doc, domain keyphrase) q Unsupervised model 7

  8. Challenges q Where to get domain keyphrases from a given corpus? Mining Quality Phrases from Massive Text Corpora [SIGMOD15] q q How to identify document keyphrases? q Can be latent mentions (short text) q Relatedness scores 8

  9. How to identify document keyphrases? q Powered by Bayesian Inference on “Domain Keyphrase Silhouette” Domain Keyphrase Silhouette: Topic centered on domain keyphrase q “Reverse” topic models q Learned from corpus q 9

  10. Framework for Latent Keyphrase Inference (LAKI) 10

  11. 11

  12. Domain Keyphrase Silhouette q Learning Hierarchical Bayesian Network (DAG) Binary Variables Task 1: Model Learning: learning link weights Task 2: Structure Learning: learning network structure 12

  13. Task 1: Model Learning given Structure q Use Z to represent K (domain keyphrases) and T (content units) q Noisy-OR A parent node is easier to activate its children q when the link weight is larger Toy example A child node is influenced by all its parents q Noise / Prior Aggregated over all other links connected with 𝑎 " 13

  14. Maximum Likelihood Estimation q Training data: Documents q Expectation-step: q For each document, collect sufficient statistics q Link firing (Parent, child both being activated) probability q Node activation probability Partially observed q Maximization-step: document keyphrases q Update link weight Fully observed content units 14

  15. Task 2: Structure Learning q Domain keyphrases are connected to content units Help infer document keyphrases from content units q q Domain keyphrases are interconnected Help infer document keyphrases from other keyphrases q 15

  16. A Heuristic Approach q Data-Driven, DAG, similar to ontology q Heuristic: q Two nodes are connected only q Closely Related: word2vec q Co-occur frequently q Links are always point to less frequent nodes q Work well in practice 16

  17. 17

  18. Inference q Exact inference is slow! q NP hard to compute posterior probability for Noisy-Or networks q Approximate inference instead q Pruning irrelevant nodes using an efficient scoring function q Gibbs sampling 18

  19. Experiments q Two text-related tasks to evaluate document representation quality q Phrase relatedness q Document classification q Two datasets 19

  20. Methods ESA (Explicit Semantic Analysis) q KBLink uses link structure in Wikipedia q BoW (bag-of-words) q ESA-C extends ESA by replacing Wiki with domain corpus q LSA (Latent Semantic Analysis) q LDA (Latent Dirichlet Allocation) q Word2Vec is a neural network computing word q embeddings EKM uses explicit keyphrase detection q 20

  21. Phrase Relatedness Correlation Document Classification 21

  22. Case Study 22

  23. 23

  24. Time Complexity 500 500 1500 Runing Time (ms) Runing Time (ms) Runing Time (ms) 400 400 300 300 1000 200 200 500 Academia Academia Academia 100 100 Yelp Yelp Yelp 0 1000 3000 5000 7000 9000 10 100 200 300 400 500 0 100 200 400 800 #Samples #Quality Phrases After Pruning #Words 24

  25. Breakdown of Processing Time 25

  26. Conclusion We have introduced a novel document representation method using latent q keyphrases Each dimension is explainable q Works for short text q Works for closed-domain text q We have developed an efficient inference method to do real time keyphrase q identification Future work q Better structure learning approach q Combined with knowledge base q Try other inference method other than Gibbs sampling q Code available at http://jialu.info q 26 26

Recommend


More recommend