A Phrase Mining Framework for Recursive Construction of a Topical Hierarchy Akshat Pandey Dipshil Agrawal Shrirag Kodoor March 27th, 2018.
1. Introduction 1.1. Aim: Creation of a high quality hierarchical organization of the concepts in a data set 1.2. Motivations: ◮ Summarization: User could familiarize themselves with a new domain by browsing hierarchy ◮ Search: User could discover which phrases are representative of their topic of interest ◮ Browsing: User could search for relevant work done by others, possibly discovering subtopics to focus on 1.3. Data set: ◮ Content-representative documents: a concise description of an accompanying full document eg titles (in scientific papers) ◮ Probabilistic priors for which terms most likely to generate representative phrases
1. Introduction 1.4. Framework Features ◮ Phrase-centric approach: Determine topical frequency of phrases instead of unigrams ◮ Ranking of topical phrases: Rank phrases based on 4 specified criteria (Problem Formulation) ◮ Recursive clustering for hierarchy construction: Topic inference is based on term co-occurrence network clustering, which can be performed recursively
2. Problem Formulation 2.1. Problem ◮ Traditional hierarchy formulation: Phrase is a consecutive sequence of unigrams.Too sensitive to term order variation or morphological structure. ◮ Proposed solution: Define phrase as an order-free set of terms appearing in the same document 2.2. Example ◮ Mining frequent patterns: { “mining frequent patterns”, “frequent pattern mining”, “mining top-k frequent closed patterns” }
2. Problem Formulation 2.3. Criteria ◮ Coverage: Phrase for a topic should cover many documents within that topic ◮ Purity: Frequent in documents belonging to that topic, not in documents within other topics ◮ Phraseness: Phrase if terms co-occur significantly more often expected chance co-occurrence frequency, given that each term in the phrase occurs independently ◮ Completeness: A phrase is not complete if it is a subset of a longer phrase 2.4. Topical Frequency ◮ Measures which represent these criteria can all be characterized by topical frequency ◮ Topical frequency of a phrase is the count of the number of times the phrase is attributed to a topic
3. CATHY Framework 3.1 Overview ◮ Construct the term co-occurrence network for the entire document collection ◮ For a topic t, cluster the co-occurrence network G t into subtopic sub-networks and estimate the sub-topical frequency for its sub-topical phrases using a generative model ◮ For each topic, extract candidate phrases based on estimated topical frequency ◮ For each topic, rank the topical phrases based on topical frequency ◮ Recursively apply steps 2-5 to each subtopic to construct the hierarchy in a top-down fashion
3. CATHY Framework 3.1 Clustering: Estimating Topical Frequency ◮ 3.1.1 A generative model for term co-occurrence network analysis ◮ This approach uses a generative model of the term co-occurrence network to estimate topical frequency ◮ The observed information is the total number of links between every pair of nodes ◮ parameters which must be learned are the role of each node in each topic and the expected number of links in each topic
3.CATHY Framework 3.1.1 A generative model for term co-occurrence network analysis ◮ Θ z i , Θ z j are the p ( w i | z ) , p ( w j | z ) for a multinomial distribution ◮ ρ z is the number of iterations that links are generated ◮ e z ij ≈ Poisson ( ρ z Θ z i Θ z j ), when ρ z is large � Θ z i = 1 ◮ k � e z ◮ e ij = i , j , where k is the number of topics, and z is the z =1 topic � e z � Θ z i Θ z ◮ E ( ij ) = ρ z j = ρ z , where E is the expectancy i , j i , j property of the Poisson distribution and ρ z is the expected number of links in topic z
3. CATHY Framework 3.1.1 A generative model for term co-occurrence network analysis ◮ p ( { e ij }| Θ , ρ ) = ( � k j ) e ij exp( − � k z =1 ρ z Θ z i Θ z z =1 ρ z Θ z i Θ z j ) � e ij ! w i , w j ∈ W ρ z Θ z i Θ z j ◮ E-step: ˆ e z ij = e ij � k t =1 ρ t Θ t i Θ t j ◮ M-step: � e z ◮ ρ z = ˆ ij i , j e z � j ˆ ij ◮ Θ z i = ρ z ◮ If ˆ e z ij ≥ 1, then apply the same model recursively on the sub-network
3. CATHY Framework ◮ 3.1.2 Topical frequency estimation ◮ The topical frequency estimation is based on two assumptions: ◮ When generating a topic phrase, each of the terms is generated with the multinomial distribution Θ z ◮ The total number of topic-z phrases of length n is proportional to ρ z � n i =1 Θ z ρ z x i ◮ f z ( P ) = f par ( z ) ( P ) � n � i =1 Θ t t ∈ C par ( z ) ρ t x i
3. CATHY Framework 3.2 Topical Phrase Extraction ◮ Approach defines an algorithm to mine frequent topical patterns ◮ The goal is to extract patters with topical frequency larger than some threshold for every topic z ◮ Steps ◮ To get candidate phrases use a traditional pattern mining algorithm ◮ Then filter them using the topical frequency estimation ◮ To remove incomplete phrases, use the notion of closed patterns and maximal patterns ◮ For 2 phrases P , P ′ in a topic such that P ⊂ P ′ , f z ( P ′ ) ≥ γ f z ( P ) ◮ 0 ≤ gamma ≤ 1, where closer to 0 is maximal, and closer to 1 is closed pattern
3. CATHY Framework 3.3 Ranking ◮ Comparability - ranking function able to compare phrases of different lengths ◮ Consider occurrence probability of seeing a phrase p in a random document with topic t
3. CATHY Framework 3.3 Ranking ◮ Can calculate the occurrence probability of a phrase P conditioned on topic z as p ( P | z ) = f z ( P ) m z , where m z is the number of documents where the phrase has been seen at least once for the topic z : n f z ( P ) � ◮ p indep ( P | z ) = m z i =1 ◮ The mixture contrastive probability is the probability of a phrase P conditioned on a mixture of multiple sibling topics Z � t f t ( P ) ◮ p ( P | Z ) = m z ◮ The three criteria are unified by the following ranking function: ◮ r z ( P ) = p ( P | z )(log p ( P | z ) p ( P | z ) p ( P | Z ) + ω log p indep ( P | z ))
4. Related Work Ontology Learning ◮ Topic hierarchies, concept hierarchies, ontologies provide a hierarchical organization of data at different levels of granularity ◮ This framework is different from a subsumption hierarchy ◮ Approach uses statistics-based techniques, without resorting to external knowledge resources Topical key phrase extraction and ranking ◮ Key phrases are traditionally extracted as n-grams using statistical modeling ◮ This approach relaxes the restriction that a phrase must be a consecutive n-gram, and instead uses document co-location which is effective when considering the content-representative documents used
4. Related Work Topic Modeling ◮ Traditional topic-modeling techniques (LDA) have a more restrictive definition of phrases, and cannot find hierarchical topics ◮ These techniques are not used due to the sparseness of the data set and because they cant be used recursively
5. Experiments 5.1 Datasets ◮ DBLP - titles of CS papers related to Databases, IR, ML, and NLP ◮ Library - University of Illinois Library catalogue in 6 categories: Titles of books from Architecture, Literature, Mass Media, Motion Pictures, Music, and Theater
5. Experiments 5.2 methods for Comparison ◮ SpecClus: Baseline - extracts all concepts from the text and then hierarchically clusters them. Similarity between two phrases is their co-occurrence count in the data set ◮ hPAM: Second baseline - state-of-the-art topic modeling approach - takes documents as input and outputs a specified number of supertopics and subtopics ◮ hPAMrr: Implement a method that re-ranks the unigrams in each topic generated by hPAM ◮ CATHYcp: This version of CATHY only considers the coverage and purity criteria ◮ CATHY: All criteria are used
5. Experiments 5.3 Topical Hierarchy of DBLP Paper Titles ◮ Assesses the ability of method to construct topical phrases that appear to be high quality to human judges via a human study ◮ Create hierarchies using all 5 methods ◮ Topic Intrusion tasks ◮ Judges are shown a topic t, and T candidate child topics ◮ One of the child topics is not actually a child topic, judge must pick wrong one ◮ Test quality of parent child relationships ◮ Phrase Intrusion tasks ◮ Judges are shown a phrase t, and T candidate child phrases ◮ One of the child phrases is not actually a child phrase, judge must pick wrong one ◮ Evaluates how well hierarchy is able to separate phrases in different topics
5. Experiments 5.3 Topical Hierarchy of DBLP Paper Titles
5. Experiments 5.4 Topical Hierarchy of Book Titles ◮ Examine how well a high quality topical phrase can predict its category and vice-versa ◮ Construct a hierarchy and measure the coverage-conscious mutual information ( CCMI ) at K of the labels with the top level branches 5.5 On Defining Term Co-occurrence ◮ Traditional methods of key phrase extraction only consider phrases to be sequences of terms which explicitly occur in the text ◮ This approach consistently defined term co-occurrence to mean co-occurring in the same document
5. Experiments 5.4 Topical Hierarchy of Book Titles
Questions?
Recommend
More recommend