MiTextExplorer: Text Exploration using Linked Brushing and Mutual Information on Document Covariates http://brenocon.com/te Brendan O’Connor Carnegie Mellon → UMass Amherst June 2014 presentation, ILLVI WS at ACL http://nlp.stanford.edu/events/illvi2014/ 1 Monday, June 30, 14
How are X and Y related? (Anscombe 1973) x y x y 10 9.14 10 8.04 8 8.14 8 6.95 13 8.74 13 7.58 9 8.77 9 8.81 11 9.26 11 8.33 14 8.10 14 9.96 6 6.13 6 7.24 4 3.10 4 4.26 12 9.13 12 10.84 7 7.26 7 4.82 5 4.74 5 5.68 x y x y 8 6.58 10 7.46 8 5.76 8 6.77 8 7.71 13 12.74 8 8.84 9 7.11 8 8.47 11 7.81 8 7.04 14 8.84 8 5.25 6 6.08 19 12.50 4 5.39 8 5.56 12 8.15 8 7.91 7 6.42 8 6.89 5 5.73 2 Monday, June 30, 14
How are X and Y related? (Anscombe 1973) x y r = 0.82 r = 0.82 x y 10 9.14 10 8.04 8 8.14 8 6.95 13 8.74 13 7.58 9 8.77 9 8.81 11 9.26 11 8.33 14 8.10 14 9.96 6 6.13 6 7.24 4 3.10 4 4.26 12 9.13 12 10.84 7 7.26 7 4.82 5 4.74 5 5.68 x y r = 0.82 r = 0.82 x y 8 6.58 10 7.46 8 5.76 8 6.77 8 7.71 13 12.74 8 8.84 9 7.11 8 8.47 11 7.81 8 7.04 14 8.84 8 5.25 6 6.08 19 12.50 4 5.39 8 5.56 12 8.15 8 7.91 7 6.42 8 6.89 5 5.73 2 Monday, June 30, 14
How are X and Y related? (Anscombe 1973) x y r = 0.82 r = 0.82 x y 3/18/14 Anscombe's_quartet_3.svg 3/18/14 Anscombe's_quartet_3.svg 10 9.14 3/18/14 Anscombe's_quartet_3.svg 10 8.04 3/18/14 Anscombe's_quartet_3.svg 8 8.14 8 6.95 13 8.74 13 7.58 9 8.77 9 8.81 11 9.26 11 8.33 14 8.10 14 9.96 6 6.13 6 7.24 4 3.10 4 4.26 12 9.13 12 10.84 7 7.26 7 4.82 5 4.74 5 5.68 x y r = 0.82 r = 0.82 x y 8 6.58 10 7.46 8 5.76 8 6.77 8 7.71 13 12.74 8 8.84 9 7.11 8 8.47 11 7.81 8 7.04 14 8.84 8 5.25 6 6.08 19 12.50 4 5.39 8 5.56 12 8.15 8 7.91 7 6.42 8 6.89 5 5.73 2 Monday, June 30, 14 file:///Users/brendano/projects/textexplore/writing/Anscombe's_quartet_3.svg 1/1 file:///Users/brendano/projects/textexplore/writing/Anscombe's_quartet_3.svg 1/1 file:///Users/brendano/projects/textexplore/writing/Anscombe's_quartet_3.svg file:///Users/brendano/projects/textexplore/writing/Anscombe's_quartet_3.svg 1/1 1/1
How are X and Y related? (Anscombe 1973) x y r = 0.82 Pearson correlation 3/18/14 Anscombe's_quartet_3.svg 10 9.14 8 8.14 13 8.74 P i ( x i − ¯ x )( y i − ¯ y ) 9 8.77 r = 11 9.26 pP x ) 2 pP i ( x i − ¯ i ( y i − ¯ y ) 2 14 8.10 6 6.13 4 3.10 assumes ( x, y ) ∼ N ( µ, Σ ) 12 9.13 7 7.26 5 4.74 Scatterplot: Simple x = horizontal position Non-parametric y = vertical position Is there an analogue to the scatterplot, when text is a variable? Monday, June 30, 14 file:///Users/brendano/projects/textexplore/writing/Anscombe's_quartet_3.svg 1/1
Linking and brushing 3/18/14 Anscombe's_quartet_3.svg GGobi software (Cook and Swayne 2007, Buja et. al 1996, etc.) Is there an analogue to linking/brushing, when text is a variable? file:///Users/brendano/projects/textexplore/writing/Anscombe's_quartet_3.svg 1/1 Monday, June 30, 14
Text and document covariates • X : Text • Discrete, high-dimensional (e.g. bag of words) • Y : Document covariates (metadata) • Time, author attributes, social context, geography, community membership... • Discrete or continuous • Lower dimensional • Goal is exploratory data analysis: first-cut insight into relationship(X,Y) • Requirement: speed for interactivity 5 Monday, June 30, 14
Demo 6 Monday, June 30, 14
(A) Covariate display Linked views (C) Covariate-word associations of the data (E) Keyword-in-context text display Monday, June 30, 14
[A] → [C] : words related to covariate query Q Q selection: “brushing” Scatterplot Ranked list p ( w | Q ) where p ( w | Q ) ≥ TermProbThresh rank w p ( w ) count Q ( w ) ≥ TermCountThresh (Exponentiated) Pointwise Mutual Information (a.k.a. lift ) Monday, June 30, 14
[C] → [D] : word-word associations p ( v | w ∈ doc) rank v p ( v ) (Exponentiated) Pointwise Mutual Information (a.k.a. lift ) Monday, June 30, 14
10 Monday, June 30, 14
11 Monday, June 30, 14
KWIC (keyword-in-context) 12 Monday, June 30, 14
KWIC reveals word senses 13 Monday, June 30, 14
Covariate -- word analysis direct PMI topic model bottleneck -vs- words K topics covariates • p( text | covariates ): Dirichlet- Multinomial Regression, Author-Topic • Feature selection Model, Labeled LDA, Structural Topic Model ... • Monroe et al. (2008) • p( text, covariates ): Supervised LDA, MedLDA, GeoTM ... Monday, June 30, 14
Related work: Text Exploration • Voyant/Voyeur (Rockwell et al. 2010) • WordSeer (Shrikumar 2013) • Jigsaw (Görg et al. 2013) • Topical Guide (Gardner et al. 2010) • etc... 15 Monday, June 30, 14
• Other uses [tx Molly Roberts] • Figure out NLP models and parameters (what should be a stopword?) • Select documents to read in an intelligent way (by covariates) • What variables to use in a model? • Identify coding (hand labeling) errors in the data • Questions • Platform? • Interactive labeling? Demo session today Prototype available: http://brenocon.com/te Monday, June 30, 14
Recommend
More recommend