neural models for documents with metadata
play

Neural Models for Documents with Metadata Dallas Card, Chenhao Tan, - PowerPoint PPT Presentation

Neural Models for Documents with Metadata Dallas Card, Chenhao Tan, Noah A. Smith July 18, 2018 Outline Main points of this talk: 1. Introducing Scholar 1 : a neural model for documents with metadata Background (LDA, SAGE, SLDA, etc.) Model


  1. Neural Models for Documents with Metadata Dallas Card, Chenhao Tan, Noah A. Smith July 18, 2018

  2. Outline Main points of this talk: 1. Introducing Scholar 1 : a neural model for documents with metadata Background (LDA, SAGE, SLDA, etc.) Model and related work Experiments and Results 2. Power of neural variational inference for interactive modeling 1 Sparse Contextual Hidden and Observed Language Autoencoder 1

  3. Latent Dirichlet Allocation Blei, Ng, and Jordan. Latent Dirichlet Allocation . JMLR. 2003. David Blei. Probabilistic topic models . Comm. ACM. 2012 2

  4. Types of metadata Date or time Author(s) Rating Sentiment Ideology etc. 3

  5. Variations and extensions Author topic model (Rosen-Zvi et al 2004) Supervised LDA (SLDA; McAuliffe and Blei, 2008) Dirichlet multinomial regression (Mimno and McCallum, 2008) Sparse additive generative models (SAGE; Eisenstein et al, 2011) Structural topic model (Roberts et al, 2014) ... 4

  6. Desired features of model Fast, scalable inference. Easy modification by end-users. 5

  7. Desired features of model Fast, scalable inference. Easy modification by end-users. Incorporation of metadata: Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA). 5

  8. Desired features of model Fast, scalable inference. Easy modification by end-users. Incorporation of metadata: Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA). Possibility of sparse topics. 5

  9. Desired features of model Fast, scalable inference. Easy modification by end-users. Incorporation of metadata: Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA). Possibility of sparse topics. Incorporate additional prior knowledge. 5

  10. Desired features of model Fast, scalable inference. Easy modification by end-users. Incorporation of metadata: Covariates: features which influences text (as in SAGE). Labels: features to be predicted along with text (as in SLDA). Possibility of sparse topics. Incorporate additional prior knowledge. → Use variational autoencoder (VAE) style of inference (Kingma and Welling, 2014) 5

  11. Desired outcome Coherent groupings of words (something like topics), with offsets for observed metadata 6

  12. Desired outcome Coherent groupings of words (something like topics), with offsets for observed metadata Encoder to map from documents to latent representations 6

  13. Desired outcome Coherent groupings of words (something like topics), with offsets for observed metadata Encoder to map from documents to latent representations Classifier to predict labels from from latent representation 6

  14. Model k i generator network: p ( w i ) = f g ( ) words 7

  15. Model k i p ( w ) generator network: p ( w i ) = f g ( ) i words 8

  16. Model k i p ( w ) generator network: p ( w i ) = f g ( ) i q ( w ) i words 9

  17. Model k i p ( w ) generator network: p ( w i ) = f g ( ) i q ( w ) i words ELBO = E q [ log p ( words | θ i )] − D KL [ q ( θ i | words ) � p ( θ i )] 10

  18. Model words encoder network: q ( w ) = f e ( ) i k i generator network: p ( w i ) = f g ( ) words ELBO = E q [ log p ( words | θ i )] − D KL [ q ( θ i | words ) � p ( θ i )] 11

  19. Model words encoder network: q ( w ) = f e ( ) i r i k i = softmax ( r i ) k i generator network: p ( w i ) = f g ( ) words ELBO = E q [ log p ( words | r i )] − D KL [ q ( r i | words ) � p ( r i )] 12

  20. Model words encoder network: q ( w ) = f e ( ) i r i k i = softmax ( r i ) k i generator network: p ( w i ) = f g ( ) words � S s = 1 [ log p ( words | r ( s ) ELBO ≈ 1 )] − D KL [ q ( r i | words ) � p ( r i )] S i 13

  21. Model words (0, I ) encoder network: q ( w ) = f e ( ) i r i k i = softmax ( r i ) k i generator network: p ( w i ) = f g ( ) words � S s = 1 [ log p ( words | r ( s ) ELBO ≈ 1 )] − D KL [ q ( r i | words ) � p ( r i )] S i 14

  22. Model words (0, I ) encoder network: q ( w ) = f e ( ) i = q + ( s ) r i k q i = softmax ( r i ) k i generator network: p ( w i ) = f g ( ) words � S s = 1 [ log p ( words | r ( s ) ELBO ≈ 1 )] − D KL [ q ( r i | words ) � p ( r i )] S i 15

  23. Model words (0, I ) encoder network: q ( w ) = f e ( ) i = q + ( s ) r i k q i = softmax ( r i ) k i generator network: p ( w i ) = f g ( ) words Srivastava and Sutton, 2017, Miao et al, 2016 16

  24. Model words (0, I ) encoder network: q ( w ) = f e ( ) i = q + ( s ) r i k q i = softmax ( r i ) k i generator network: p ( w i ) = f g ( ) y i words 17

  25. Model words (0, I ) encoder network: q ( w ) = f e ( ) i = q + ( s ) r i k q i = softmax ( r i ) c i k i generator network: p ( w i ) = f g ( ) y i words 18

  26. Model , c i , y i words (0, I ) encoder network: q ( w ) = f e ( ) i = q + ( s ) r i k q i = softmax ( r i ) c i k i generator network: p ( w i ) = f g ( ) y i words 19

  27. Scholar Generator network: i B ( topic ) + c T p ( word | θ i , c i ) = softmax ( d + θ T i B ( cov ) ) 20

  28. Scholar Generator network: i B ( topic ) + c T p ( word | θ i , c i ) = softmax ( d + θ T i B ( cov ) ) Optionally include interactions between topics and covariates 20

  29. Scholar Generator network: i B ( topic ) + c T p ( word | θ i , c i ) = softmax ( d + θ T i B ( cov ) ) Optionally include interactions between topics and covariates p ( y i | θ i , c i ) = f y ( θ i , c i ) 20

  30. Scholar Generator network: i B ( topic ) + c T p ( word | θ i , c i ) = softmax ( d + θ T i B ( cov ) ) Optionally include interactions between topics and covariates p ( y i | θ i , c i ) = f y ( θ i , c i ) Encoder: µ i = f µ ( words , c i , y i ) log σ i = f σ ( words , c i , y i ) Optional incorporation of word vectors to embed input 20

  31. Optimization Stochastic optimization using mini-batches of documents Tricks from Srivastava and Sutton, 2017: Adam optimizer with high-learning rate to bypass mode collapse Batch-norm layers to avoid divergence Annealing away from batch-norm output to keep results interpretable 21

  32. Output of Scholar B ( topic ) , B ( cov ) : Coherent groupings of positive and negative deviations from background ( ∼ topics) 22

  33. Output of Scholar B ( topic ) , B ( cov ) : Coherent groupings of positive and negative deviations from background ( ∼ topics) f µ , f σ : Encoder network: mapping from words to topics: ˆ θ i = softmax ( f e ( words , c i , y i , ǫ )) 22

  34. Output of Scholar B ( topic ) , B ( cov ) : Coherent groupings of positive and negative deviations from background ( ∼ topics) f µ , f σ : Encoder network: mapping from words to topics: ˆ θ i = softmax ( f e ( words , c i , y i , ǫ )) f y : Classifier mapping from ˆ θ i to labels: ˆ y = f y ( θ i , c i ) 22

  35. Evaluation 1. Performance as a topic model, without metadata (perplexity, coherence) 2. Performance as a classifier, compared to SLDA 3. Exploratory data analysis 23

  36. Quantitative results: basic model 2000 Perplexity 1000 0 0.2 Coherence 0.1 0.0 Sparsity 0.5 0.0 LDA IMDB dataset (Maas, 2011) 24

  37. Quantitative results: basic model 2000 Perplexity 1000 0 0.2 Coherence 0.1 0.0 Sparsity 0.5 0.0 LDA SAGE IMDB dataset (Maas, 2011) 25

  38. Quantitative results: basic model 2000 Perplexity 1000 0 0.2 Coherence 0.1 0.0 Sparsity 0.5 0.0 LDA SAGE NVDM IMDB dataset (Maas, 2011) 26

  39. Quantitative results: basic model 2000 Perplexity 1000 0 0.2 Coherence 0.1 0.0 Sparsity 0.5 0.0 LDA SAGE NVDM Scholar IMDB dataset (Maas, 2011) 27

  40. Quantitative results: basic model 2000 Perplexity 1000 0 0.2 Coherence 0.1 0.0 Sparsity 0.5 0.0 LDA SAGE NVDM Scholar Scholar +wv IMDB dataset (Maas, 2011) 28

  41. Quantitative results: basic model 2000 Perplexity 1000 0 0.2 Coherence 0.1 0.0 Sparsity 0.5 0.0 LDA SAGE NVDM Scholar Scholar Scholar +wv +sparsity IMDB dataset (Maas, 2011) 29

  42. Classification results 1.0 0.9 0.8 Accuracy 0.7 0.6 0.5 LR SLDA Scholar Scholar (labels) (covariates) IMDB dataset (Maas, 2011) 30

  43. Exploratory Data Analysis Data: Media Frames Corpus (Card et al, 2015) Collection of thousands of news articles annotated in terms of tone and framing Relevant metadata: year of publication, newspaper, etc. 31

  44. Tone as a label english language city spanish community boat desert died men miles coast haitian visas visa applications students citizenship asylum judge appeals deportation court labor jobs workers percent study wages bush border president bill republicans state gov benefits arizona law bill bills arrested charged charges agents operation 0 1 p (pro-immigration | topic) 32

Recommend


More recommend