a phrase mining framework for recursive construction of a
play

A Phrase Mining Framework for Recursive Construction of a Topical - PowerPoint PPT Presentation

A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Akshat Pandey Dipshil Agrawal Shrirag Kodoor March 27th, 2018. 1. Introduction 1.1. Aim: Creation of a high quality hierarchical organization of the concepts in a


  1. A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Akshat Pandey Dipshil Agrawal Shrirag Kodoor March 27th, 2018.

  2. 1. Introduction 1.1. Aim: Creation of a high quality hierarchical organization of the concepts in a data set 1.2. Motivations: ◮ Summarization: User could familiarize themselves with a new domain by browsing hierarchy ◮ Search: User could discover which phrases are representative of their topic of interest ◮ Browsing: User could search for relevant work done by others, possibly discovering subtopics to focus on 1.3. Data set: ◮ Content-representative documents: a concise description of an accompanying full document eg titles (in scientific papers) ◮ Probabilistic priors for which terms most likely to generate representative phrases

  3. 1. Introduction 1.4. Framework Features ◮ Phrase-centric approach: Determine topical frequency of phrases instead of unigrams ◮ Ranking of topical phrases: Rank phrases based on 4 specified criteria (Problem Formulation) ◮ Recursive clustering for hierarchy construction: Topic inference is based on term co-occurrence network clustering, which can be performed recursively

  4. 2. Problem Formulation 2.1. Problem ◮ Traditional hierarchy formulation: Phrase is a consecutive sequence of unigrams.Too sensitive to term order variation or morphological structure. ◮ Proposed solution: Define phrase as an order-free set of terms appearing in the same document 2.2. Example ◮ Mining frequent patterns: { “mining frequent patterns”, “frequent pattern mining”, “mining top-k frequent closed patterns” }

  5. 2. Problem Formulation 2.3. Criteria ◮ Coverage: Phrase for a topic should cover many documents within that topic ◮ Purity: Frequent in documents belonging to that topic, not in documents within other topics ◮ Phraseness: Phrase if terms co-occur significantly more often expected chance co-occurrence frequency, given that each term in the phrase occurs independently ◮ Completeness: A phrase is not complete if it is a subset of a longer phrase 2.4. Topical Frequency ◮ Measures which represent these criteria can all be characterized by topical frequency ◮ Topical frequency of a phrase is the count of the number of times the phrase is attributed to a topic

  6. 3. CATHY Framework 3.1 Overview ◮ Construct the term co-occurrence network for the entire document collection ◮ For a topic t, cluster the co-occurrence network G t into subtopic sub-networks and estimate the sub-topical frequency for its sub-topical phrases using a generative model ◮ For each topic, extract candidate phrases based on estimated topical frequency ◮ For each topic, rank the topical phrases based on topical frequency ◮ Recursively apply steps 2-5 to each subtopic to construct the hierarchy in a top-down fashion

  7. 3. CATHY Framework 3.1 Clustering: Estimating Topical Frequency ◮ 3.1.1 A generative model for term co-occurrence network analysis ◮ This approach uses a generative model of the term co-occurrence network to estimate topical frequency ◮ The observed information is the total number of links between every pair of nodes ◮ parameters which must be learned are the role of each node in each topic and the expected number of links in each topic

  8. 3.CATHY Framework 3.1.1 A generative model for term co-occurrence network analysis ◮ Θ z i , Θ z j are the p ( w i | z ) , p ( w j | z ) for a multinomial distribution ◮ ρ z is the number of iterations that links are generated ◮ e z ij ≈ Poisson ( ρ z Θ z i Θ z j ), when ρ z is large � Θ z i = 1 ◮ k � e z ◮ e ij = i , j , where k is the number of topics, and z is the z =1 topic � e z � Θ z i Θ z ◮ E ( ij ) = ρ z j = ρ z , where E is the expectancy i , j i , j property of the Poisson distribution and ρ z is the expected number of links in topic z

  9. 3. CATHY Framework 3.1.1 A generative model for term co-occurrence network analysis ◮ p ( { e ij }| Θ , ρ ) = ( � k j ) e ij exp( − � k z =1 ρ z Θ z i Θ z z =1 ρ z Θ z i Θ z j ) � e ij ! w i , w j ∈ W ρ z Θ z i Θ z j ◮ E-step: ˆ e z ij = e ij � k t =1 ρ t Θ t i Θ t j ◮ M-step: � e z ◮ ρ z = ˆ ij i , j e z � j ˆ ij ◮ Θ z i = ρ z ◮ If ˆ e z ij ≥ 1, then apply the same model recursively on the sub-network

  10. 3. CATHY Framework ◮ 3.1.2 Topical frequency estimation ◮ The topical frequency estimation is based on two assumptions: ◮ When generating a topic phrase, each of the terms is generated with the multinomial distribution Θ z ◮ The total number of topic-z phrases of length n is proportional to ρ z � n i =1 Θ z ρ z x i ◮ f z ( P ) = f par ( z ) ( P ) � n � i =1 Θ t t ∈ C par ( z ) ρ t x i

  11. 3. CATHY Framework 3.2 Topical Phrase Extraction ◮ Approach defines an algorithm to mine frequent topical patterns ◮ The goal is to extract patters with topical frequency larger than some threshold for every topic z ◮ Steps ◮ To get candidate phrases use a traditional pattern mining algorithm ◮ Then filter them using the topical frequency estimation ◮ To remove incomplete phrases, use the notion of closed patterns and maximal patterns ◮ For 2 phrases P , P ′ in a topic such that P ⊂ P ′ , f z ( P ′ ) ≥ γ f z ( P ) ◮ 0 ≤ gamma ≤ 1, where closer to 0 is maximal, and closer to 1 is closed pattern

  12. 3. CATHY Framework 3.3 Ranking ◮ Comparability - ranking function able to compare phrases of different lengths ◮ Consider occurrence probability of seeing a phrase p in a random document with topic t

  13. 3. CATHY Framework 3.3 Ranking ◮ Can calculate the occurrence probability of a phrase P conditioned on topic z as p ( P | z ) = f z ( P ) m z , where m z is the number of documents where the phrase has been seen at least once for the topic z : n f z ( P ) � ◮ p indep ( P | z ) = m z i =1 ◮ The mixture contrastive probability is the probability of a phrase P conditioned on a mixture of multiple sibling topics Z � t f t ( P ) ◮ p ( P | Z ) = m z ◮ The three criteria are unified by the following ranking function: ◮ r z ( P ) = p ( P | z )(log p ( P | z ) p ( P | z ) p ( P | Z ) + ω log p indep ( P | z ))

  14. 4. Related Work Ontology Learning ◮ Topic hierarchies, concept hierarchies, ontologies provide a hierarchical organization of data at different levels of granularity ◮ This framework is different from a subsumption hierarchy ◮ Approach uses statistics-based techniques, without resorting to external knowledge resources Topical key phrase extraction and ranking ◮ Key phrases are traditionally extracted as n-grams using statistical modeling ◮ This approach relaxes the restriction that a phrase must be a consecutive n-gram, and instead uses document co-location which is effective when considering the content-representative documents used

  15. 4. Related Work Topic Modeling ◮ Traditional topic-modeling techniques (LDA) have a more restrictive definition of phrases, and cannot find hierarchical topics ◮ These techniques are not used due to the sparseness of the data set and because they cant be used recursively

  16. 5. Experiments 5.1 Datasets ◮ DBLP - titles of CS papers related to Databases, IR, ML, and NLP ◮ Library - University of Illinois Library catalogue in 6 categories: Titles of books from Architecture, Literature, Mass Media, Motion Pictures, Music, and Theater

  17. 5. Experiments 5.2 methods for Comparison ◮ SpecClus: Baseline - extracts all concepts from the text and then hierarchically clusters them. Similarity between two phrases is their co-occurrence count in the data set ◮ hPAM: Second baseline - state-of-the-art topic modeling approach - takes documents as input and outputs a specified number of supertopics and subtopics ◮ hPAMrr: Implement a method that re-ranks the unigrams in each topic generated by hPAM ◮ CATHYcp: This version of CATHY only considers the coverage and purity criteria ◮ CATHY: All criteria are used

  18. 5. Experiments 5.3 Topical Hierarchy of DBLP Paper Titles ◮ Assesses the ability of method to construct topical phrases that appear to be high quality to human judges via a human study ◮ Create hierarchies using all 5 methods ◮ Topic Intrusion tasks ◮ Judges are shown a topic t, and T candidate child topics ◮ One of the child topics is not actually a child topic, judge must pick wrong one ◮ Test quality of parent child relationships ◮ Phrase Intrusion tasks ◮ Judges are shown a phrase t, and T candidate child phrases ◮ One of the child phrases is not actually a child phrase, judge must pick wrong one ◮ Evaluates how well hierarchy is able to separate phrases in different topics

  19. 5. Experiments 5.3 Topical Hierarchy of DBLP Paper Titles

  20. 5. Experiments 5.4 Topical Hierarchy of Book Titles ◮ Examine how well a high quality topical phrase can predict its category and vice-versa ◮ Construct a hierarchy and measure the coverage-conscious mutual information ( CCMI ) at K of the labels with the top level branches 5.5 On Defining Term Co-occurrence ◮ Traditional methods of key phrase extraction only consider phrases to be sequences of terms which explicitly occur in the text ◮ This approach consistently defined term co-occurrence to mean co-occurring in the same document

  21. 5. Experiments 5.4 Topical Hierarchy of Book Titles

  22. Questions?

Recommend


More recommend