From Emotion Analysis and Topic Extraction to Narrative Modeling Andreea Kremm Mohammed Ibraaz Syed
About us q Andreea Kremm § Founder of Netex Group (www.netex.ai) § M.Sc. Psychology (University of Roehampton, London) § Research Interests: combining the power of AI, neural networks, and psychology for economic applications through Narrative Economics q Mohammed Ibraaz Syed § B.A. Economics, B.Sc. Mathematics (University of Maryland, College Park) § Master’s in Applied Economics (UCLA) § Research Interests: applying AI and machine learning to extract narratives from text
What are Narratives? How do Narratives affect the Economy? "This past year has been the most difficult and painful year of my career. It was excruciating." Elon Musk, New York Times interview, 08/16/2018 https://www.nytimes.com/2018/08/16/business/elon-musk-interview-tesla.html
Musk´s Narrative effect on Tesla´s stock: https://finance.yahoo.com/chart/TSLA
What is Narrative Economics?
How do Narratives spread? • Kermack-McKendrick (1927) mathematical theory of disease epidemics • SIR – Model: S=susceptible, I=infected, R=recovered, where N=S+I+R is assumed constant • Powerful narratives spread, mutate, and propagate like a virus
The Structure of a Narrative • The Plot (Overcoming the Monster, Rags to Riches, Voyage, Return, Comedy, Tragedy, Rebirth, etc.) • The Characters (Hero, Villain, Maiden, King, etc.) • Emotionally engaging • Take-Away / Lesson, Call to Action • A good story is easily remembered and gladly retold
Narrative Modeling Algorithm • Analyze a Narrative: ü Emotion Analysis ü Entity-Relation Extraction ü Topic Extraction ü Subject Modeling • Insert into an SIR Disease Epidemics Model Equation • Predicting Narrative Spread and Economic Consequences
Emotion Analysis Showcase • Task : recognize emotions in written English text • Solution : Bi-LSTM trained as a classifier • Resources : ü NRC-EmoLex (National Research Council Canada Word-Emotion Association Lexicon) ü Facebook´s FastText ü Training dataset: 7,665 emotion labeled sentences from the Association for the Advancement of Affective Computing (AAAC)
Methodology
Emotion Analysis Results • Random accuracy: 20% (Baseline) • Softmax (word counting) accuracy: 21% • IBM Watson Tone Analyzer: 39% • IBM Watson NLU: 58% • Bi-LSTM with 128 LSTM cells in one layer: 66% • Bi-LSTM with 32 LSTM cells in four layers: 71%
Visualizing Entity Embeddings
Challenges and Limitations • Limited size of the training dataset • Limited size of NRC-EmoLex • Single label emotions • No subject modeling • No information about the author´s context • Topic was disregarded
Topic Extraction Showcase • Four Key Problems to Solve: 1. Where do we find an appropriate data set of narrative-rich text? 2. How do we pre-process the data to facilitate narrative extraction? 3. How do we estimate the number of narratives (topics)? 4. How do we estimate narrative similarity and model their evolution?
Selecting an Appropriate Dataset • Politicians are often responsible for spreading narratives o Press releases issues by politicians o Politicians’ social media accounts • Social media messages often lack context • News data is often labeled with categories / topics and related issues • Social media and news data can complement each other
Data and Pre-Processing • Data sets selected (solution to 1 st problem) : o White House Press Briefings from January 20, 2017 onwards o Tweets by President Donald Trump from January 20, 2017 onwards • Narrative extraction-specific pre-processing (solution to 2 nd problem) : o Pre-existing labels incorporated into document strings o Summaries of documents also added to their strings • 2017 data divided into six 2-month time periods: Period 1 Period 2 Period 3 Period 4 Period 5 Period 6 September – November – January – February March – April May – June July – August October December
Methodology (1) • Additional pre-processing o Stop words removed o Terms appearing in 90%+ of documents ignored o Unigrams and bigrams considered • Conversion into TFIDF matrix – to filter out most important words o Documents as rows, Words as columns o Entries correspond to word counts in each document o Entries of words occurring in multiple documents downweighed o Different matrix for each time period
Methodology (2) Hierarchical (agglomerative) clustering algorithm (solution to 3 rd problem) : • o HAC used on each of the 6 TFIDF matrices o Linkage criterion: Ward’s method (minimizes variance of new clusters) o Cut-off of 70% of final merge used to estimate optimal number of clusters • Output:
Methodology (3) • Hierarchical clustering thresholds: # of clusters increase non-linearly 2 clusters 5 clusters 16 clusters
Methodology (4) • Latent Dirischlet Allocation (LDA) algorithm used to extract topics o Each topic comes with probabilities of generating particular words o Used separately for each time period (6 times total) o Cutoff from hierarchical clustering used to determine # of topics • Sample Outputs of LDA: o Supreme Court Nomination topic: o Federal Emergency topic:
Methodology (5) • Dissimilarity / distance measures can be used to compare: o Two points in space (straight line) o Two points on a sphere (great circle distance) o Even two probability distributions Hellinger Distance used to compare topics (solution to 4 th problem) : • o Can effectively determine similar topics o Can be applied to track topic evolution over time
Key Results (1) • Estimated # of clusters from HAC led to coherent topics • Similar topics (through Hellinger distance) could be compared over time to track topic evolution: Time period 1 (January & February, 2017) : Time period 2 (March & April, 2017) : Time period 3 (May & June, 2017) :
Key Results (2) • A “Make America Great” topic was generally the most common • Discovered the Supreme Court nomination process as a major topic in early 2017 • Criticism of the media was a major topic through multiple time periods • Model was able to distinguish between unique topics: o Various foreign policy topics o Natural disasters – Hurricanes Harvey (Aug. 2017) & Maria (Sep. 2017)
Conclusions and Limitations • Different tools can be effectively combined to model narratives • Can generate quantitative data on narratives and their evolution • Narrative Economics in a very young field • Various avenues for future research: o New data sources o Alternate pre-processing methods o Different thresholds / time intervals / other parameter tuning
Future Research • Analyze a Narrative: ü Emotion Analysis ü Entity-Relation Extraction ü Topic Extraction ü Subject Modeling • Insert into an SIR Disease Epidemics Model Equation • Predicting Narrative Spread and Economic Consequences
Acknowledgments www.narrativeeconomics.com • Naveed Ghaffar , co-founder Narrative Economics (naveedgh@gmail.com) • Dr. Rashed Iqbal , co-founder Narrative Economics (rashed_iqbal@econ.ucla.edu)
Thank you for listening! Any Questions? Get in touch: Mohammed Ibraaz Syed - ibraaz@g.ucla.edu Andreea Kremm – kremm@netex.ai
Recommend
More recommend