Supervised and Relational Topic Models David M. Blei Department of Computer Science Princeton University October 5, 2009 Joint work with Jonathan Chang and Jon McAuliffe
Topic modeling • Large electronic archives of document collections require new statistical tools for analyzing text. • Topic models have emerged as a powerful technique for unsupervised analysis of large document collections. • Topic models posit latent topics in text using hidden random variables, and uncover that structure with posterior inference . • Useful for tasks like browsing, search, information retrieval, etc.
Examples of topic modeling contractual employment female markets criminal expectation industrial men earnings discretion gain local women investors justice promises jobs see sec civil expectations employees sexual research process breach relations note structure federal enforcing unfair employer managers see supra agreement discrimination firm officer note economic harassment risk parole perform case gender large inmates
Examples of topic modeling online scheduling task quantum Quantum lower bounds by polynomials competitive automata On the power of bounded concurrency I: finite automata approximation tasks nc Dense quantum coding and quantum finite automata s Classical physics and the Church--Turing Thesis automaton points languages distance convex n routing Nearly optimal algorithms and bounds for multilayer channel routing functions adaptive machine How bad is selfish routing? networks polynomial network Authoritative sources in a hyperlinked environment domain networks Balanced sequences and optimal routing degree protocol log protocols degrees algorithm network polynomials packets link learning learnable statistical constraint examples Module algebra dependencies classes On XML integrity constraints in the presence of DTDs local graph An optimal algorithm for intersecting line segments in the plane Closure properties of constraints consistency Recontamination does not help to search a graph graphs Dynamic functional dependencies and database aging tractable A new approach to the maximum-flow problem edge the,of database The time complexity of maximum matching by simulated annealing minimum constraints vertices a, is algebra and boolean logic m relational logics n merging query networks algorithm theories sorting languages time multiplication log bound system systems learning consensus knowledge performance objects logic reasoning messages analysis verification programs protocol circuit asynchronous distributed systems language trees sets regular networks Single-class bounds of multi-class queuing networks tree queuing The maximum concurrent flow problem search asymptotic Contention in shared memory algorithms compression productform database Linear probing with a nonuniform address distribution server transactions retrieval concurrency proof Magic Functions: In Memoriam: Bernard M. Dwork 1923--1998 restrictions formulas A mechanical proof of the Church-Rosser theorem property firstorder program Timed regular expressions decision resolution On the power and limitations of strictness analysis temporal abstract queries
Examples of topic modeling 1880 1890 1900 1910 1920 1930 1940 electric electric apparatus air apparatus tube air apparatus tube machine power steam water tube power company power engineering air glass apparatus engine steam engine apparatus pressure air glass mercury laboratory steam electrical engineering room water two machine water laboratory glass laboratory rubber machines two construction engineer gas pressure pressure made small iron system engineer made made battery motor room gas laboratory gas mercury wire engine feet tube mercury small gas 1950 1960 1970 1980 1990 2000 tube tube air high materials devices apparatus system heat power high device glass temperature power design power materials air air system heat current current chamber heat temperature system applications gate instrument chamber chamber systems technology high small power high devices devices light laboratory high flow instruments design silicon pressure instrument tube control device material rubber control design large heat technology
Examples of topic modeling neurons brain stimulus motor memory visual synapses activated subjects cortical tyrosine phosphorylation left ltp activation surface task glutamate p53 phosphorylation cell cycle proteins tip synaptic kinase activity protein image neurons rna cyclin binding dna sample materials regulation domain computer rna polymerase organic domains receptor problem device cleavage polymer information science amino acids research receptors site polymers computers scientists cdna funding molecules ligand problems physicists says sequence support laser particles research ligands isolated nih optical physics people protein program sequence apoptosis light particle sequences surface electrons experiment genome liquid quantum wild type stars dna surfaces mutant enzyme sequencing fluid reaction astronomers model enzymes mutations united states iron reactions mutants universe active site women molecule mutation reduction galaxies cells molecules universities cell magnetic galaxy expression magnetic field transition state cell lines students plants spin bone marrow plant superconductivity education gene superconducting genes pressure mantle arabidopsis bacteria high pressure crust sun pressures upper mantle bacterial solar wind host fossil record core meteorites earth resistance birds inner core ratios planets development mice parasite fossils planet embryos gene antigen dinosaurs species drosophila virus disease fossil t cells forest hiv genes mutations antigens forests aids expression families earthquake co2 immune response populations infection mutation earthquakes carbon ecosystems viruses fault carbon dioxide ancient images methane genetic patients cells found data water disease population impact ozone treatment proteins populations million years ago volcanic atmospheric drugs differences africa researchers deposits measurements clinical climate variation magma stratosphere protein ocean concentrations eruption ice found volcanism changes climate change
Supervised topic models • These applications of topic modeling work in the same way. • Fit a model using a likelihood criterion. Then, hope that the resulting model is useful for the task at hand. • Supervised topic models and relational topic models fit topics explicitly to perform prediction. • Useful for building topic models that can • Predict the rating of a review • Predict the category of an image • Predict the links emitted from a document
Outline 1 Unsupervised topic models 2 Supervised topic models 3 Relational topic models
Probabilistic modeling 1 Treat data as observations that arise from a generative probabilistic process that includes hidden variables • For documents, the hidden variables reflect the thematic structure of the collection. 2 Infer the hidden structure using posterior inference • What are the topics that describe this collection? 3 Situate new data into the estimated model. • How does this query or new document fit into the estimated topic structure?
Intuition behind LDA Simple intuition : Documents exhibit multiple topics.
Generative model Topic proportions and Topics Documents assignments gene 0.04 dna 0.02 genetic 0.01 .,, life 0.02 evolve 0.01 organism 0.01 .,, brain 0.04 neuron 0.02 nerve 0.01 ... data 0.02 number 0.02 computer 0.01 .,, • Each document is a random mixture of corpus-wide topics • Each word is drawn from one of those topics
The posterior distribution Topic proportions and Topics Documents assignments • In reality, we only observe the documents • Our goal is to infer the underlying topic structure
Latent Dirichlet allocation Per-word Dirichlet topic assignment parameter Per-document Observed Topic topic proportions word Topics hyperparameter θ d α Z d,n W d,n β k η N D K Each piece of the structure is a random variable.
Latent Dirichlet allocation θ d α Z d,n W d,n β k η N D K β k ∼ Dir ( η ) k = 1 . . . K θ d ∼ Dir ( α ) d = 1 . . . D Z d , n | θ d ∼ Mult ( 1 , θ d ) d = 1 . . . D , n = 1 . . . N W d , n | θ d , z d , n , β 1 : K ∼ Mult ( 1 , β z d , n ) d = 1 . . . D , n = 1 . . . N
Latent Dirichlet allocation θ d α Z d,n W d,n β k η N D K 1 Draw each topic β k ∼ Dir ( η ) , for k ∈ { 1 , . . . , K } . 2 For each document: 1 Draw topic proportions θ d ∼ Dir ( α ) . 2 For each word: 1 Draw Z d , n ∼ Mult ( θ d ) . 2 Draw W d , n ∼ Mult ( β z d , n ) .
Latent Dirichlet allocation θ d α Z d,n W d,n β k η N D K • From a collection of documents, infer • Per-word topic assignment z d , n • Per-document topic proportions θ d • Per-corpus topic distributions β k • Use posterior expectations to perform the task at hand, e.g., information retrieval, document similarity, etc.
Recommend
More recommend