Detecting and Characterizing Events Unsupervised Machine Learning for Social Science Allison J.B. Chaney Princeton University
supervised learning unsupervised learning methods input output prediction exploration goals 9 3 1 ? 5 6 2 3 7 7 7 9 6 2 / 50
Examples of Unsupervised ML for Social Science: Recommendation Systems A Probabilistic Model for Using Social Networks in Personalized Item Recommendation. Chaney, Blei, and Eliassi-Rad. RecSys, 2015. A Large-scale Exploration of Group Viewing Patterns. Chaney, Gartrell, Hofman, Guiver, Koenigstein, Kohli, and Paquet. TVX, 2014. How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility Chaney, Stewart, Engelhardt. arXiv, 2017. 3 / 50
Examples of Unsupervised ML for Social Science: Recommendation Systems A Probabilistic Model for Using Social Networks in Personalized Item Recommendation. Chaney, Blei, and Eliassi-Rad. RecSys, 2015. A Large-scale Exploration of Group Viewing Patterns. Chaney, Gartrell, Hofman, Guiver, Koenigstein, Kohli, and Paquet. TVX, 2014. How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility Chaney, Stewart, Engelhardt. arXiv, 2017. 3 / 50
Examples of Unsupervised ML for Social Science: Recommendation Systems A Probabilistic Model for Using Social Networks in Personalized Item Recommendation. Chaney, Blei, and Eliassi-Rad. RecSys, 2015. A Large-scale Exploration of Group Viewing Patterns. Chaney, Gartrell, Hofman, Guiver, Koenigstein, Kohli, and Paquet. TVX, 2014. How Algorithmic Confounding in Recommendation Systems Increases Homogeneity and Decreases Utility Chaney, Stewart, Engelhardt. arXiv, 2017. 3 / 50
Examples of Unsupervised ML for Social Science: Text Analysis / Topic Models Visualizing topic models. Chaney and Blei. ICWSM, 2012. The Power of Aggregation for Topic Models Used For Measurement. Chaney, Shiraito, Stewart. Text as Data, 2017. Detecting and Characterizing Events. Chaney, Wallach, Blei, and Connelly. EMNLP, 2016. 4 / 50
Examples of Unsupervised ML for Social Science: Text Analysis / Topic Models Visualizing topic models. Chaney and Blei. ICWSM, 2012. The Power of Aggregation for Topic Models Used For Measurement. Chaney, Shiraito, Stewart. Text as Data, 2017. Detecting and Characterizing Events. Chaney, Wallach, Blei, and Connelly. EMNLP, 2016. 4 / 50
Examples of Unsupervised ML for Social Science: Text Analysis / Topic Models Visualizing topic models. Chaney and Blei. ICWSM, 2012. The Power of Aggregation for Topic Models Used For Measurement. Chaney, Shiraito, Stewart. Text as Data, 2017. Detecting and Characterizing Events. Chaney, Wallach, Blei, and Connelly. EMNLP, 2016. 4 / 50
Why are events important? 5 / 50
Our Task: Given a huge corpus of primary source documents, identify events of potential interest and characterize them with relevant words and sources. (Or, make life easier for historians dealing with millions of primary source documents.) 6 / 50
Matthew Connelly’s History Lab at Columbia U.S. State Department Cables Message content Date sent Authoring entity … 7 / 50
Matthew Connelly’s History Lab at Columbia U.S. State Department Cables 2,674,486 messages sent between 1973 and 1978 34,204 unique sending entities 8 / 50
What is an event? • Event detection from time series data. Guralnik and Srivastava. KDD, 1999. • Earthquake shakes Twitter users: real-time event detection by social sensors. Sakaki, et al. WWW, 2010. • A study of retrospective and on-line event detection. Yang, et al. SIGIR, 1998. • Text classification and named entities for new event detection. Kumaran and Allan. SIGIR, 2004. • A novel burst-based text representation model for scalable event detection . Zhao, et al. ACL, 2012. • Leadline: Interactive visual analysis of text data through event identification and exploration . Dou, et al. VAST, 2012. 9 / 50
What is an event? change point location cluster of sources temporary deviation from business-as-usual 10 / 50
Mayaguez Incident 200 HONG KONG → BANGKOK on 5/12/1975 U.S. FLAG VESSEL IN DISTRESS 1. HONG KONG REP OF SEALAND ORIENT HAS ADVISED ORIG THAT THEIR VESSEL SS MAYAGUEZ IS REPORTED UNDER FIRE AND IN DISTRESS STATE → BANGKOK on 5/13/1975 IN GULF OF THAILAND. LAST POSIT 102PT53E SS MAYAGUEZ FOR MASTERS FROM ZURHELLEN 150 94PT80N TIME OF POSIT UNKNOWN BUT REGRET FAST MOVING SITUATION HERE HAS MADE MEXICO → STATE on 5/16/1975 EST BY SEALAND TO BE APPROX 1800 LOCAL. IT IMPOSSIBLE TO KEEP YOU FULLY INFORMED AS REACTION TO AMAYAGUEZ INCIDENT 2. SEALAND REP IS NOT SURE OF TYPE WE WOULD OTHERWISE INTEND. MATTERS YOU 1. MOST MEXICO CITY NEWSPAPERS GAVE LEAD OF FIRE, I.E. SMALL ARMS OR MED CALIBER RAISE ARE CURRENTLY UNDER DISCUSSION AND TREATMENT TO MAYAGUEZ INCIDENT IN MAY 15 AND DOES NOT KNOW IF OTHER SHIPS WE HOPE TO HAVE WORD FOR YOU count EDITIONS. REPORTS, BASED ON WIRE SERVICE ARE INVOLVED. REP DOES STATE THAT SOON. MEANTIME, PLEASE DO NOT, REPEAT NOT, 100 DESPATCHES, WERE MOSTLY PLAYED STRAIGHT RAISE THIS MATTER FURTHER WITH THAIS. ALTHOUGH HEADLINE WRITERS GAVE REIN TO INGERSOLL NOT LIKELY TO PANIC. USUAL EDITORIAL BIASES TO CONVEY MORE OR LESS UNFAVORABLE IMPRESSION OF U.S. ACTION (E.G., INFLUENTIAL MODERATE-LEFT EXCELSIOR SUGGESTED MILITARY ACTIONS OCCURRED AFTER 50 PHNOM PENH HAD ALREADY ANNOUNCED IT WAS FREEING MAYAGUEZ AND ITS CREW; LEFT-LEANING 0 1974 1975 1976 1977 1978 1979 11 / 50
Carnation Revolution 400 count 200 0 1973 1974 1975 1976 1977 1978 1979 word coup portugal portuguese 12 / 50
common themes authoring entities events messages HONG KONG → BANGKOK on 5/12/1975 U.S. FLAG VESSEL IN DISTRESS 1. HONG KONG REP OF SEALAND ORIENT HAS STATE → BANGKOK on 5/13/1975 ADVISED ORIG THAT THEIR VESSEL SS MAYAGUEZ SS MAYAGUEZ FOR MASTERS FROM ZURHELLEN IS REPORTED UNDER FIRE AND IN DISTRESS REGRET FAST MOVING SITUATION HERE HAS MEXICO → STATE on 5/16/1975 IN GULF OF THAILAND. LAST POSIT 102PT53E MADE IT IMPOSSIBLE TO KEEP YOU FULLY REACTION TO AMAYAGUEZ INCIDENT 94PT80N TIME OF POSIT UNKNOWN BUT INFORMED AS WE WOULD OTHERWISE INTEND. 1. MOST MEXICO CITY NEWSPAPERS GAVE LEAD EST BY SEALAND TO BE APPROX 1800 LOCAL. MATTERS YOU RAISE ARE CURRENTLY UNDER TREATMENT TO MAYAGUEZ INCIDENT IN MAY 15 DISCUSSION AND WE HOPE TO HAVE WORD FOR EDITIONS. REPORTS, BASED ON WIRE SERVICE YOU SOON. MEANTIME, PLEASE DO NOT, REPEAT DESPATCHES, WERE MOSTLY PLAYED STRAIGHT ALTHOUGH HEADLINE WRITERS GAVE REIN TO USUAL EDITORIAL BIASES TO CONVEY MORE OR 13 / 50
modeling messages 14 / 50
modeling messages Latent Dirichlet allocation. Blei, Ng, and Jordan, 2003. 14 / 50
modeling messages MILITARY FORCE OFFICER ARMY DEFENSE NAVY … Latent Dirichlet allocation. Blei, Ng, and Jordan, 2003. 14 / 50
modeling messages GOVERNMENT PRESIDENT NATIONAL MINISTER … ARMY … Latent Dirichlet allocation. Blei, Ng, and Jordan, 2003. 14 / 50
modeling messages Latent Dirichlet allocation. Blei, Ng, and Jordan, 2003. GaP: A Factor Model for Discrete Data. Canny, 2004. 15 / 50
modeling messages vocabulary words 0.1 0.3 0.1 0.0 0.5 0.0 topics 0.7 0.0 0.1 0.2 0.0 0.0 0.0 0.1 0.2 0.4 0.0 0.3 β k ∼ Dirichlet( · ) Latent Dirichlet allocation. Blei, Ng, and Jordan, 2003. GaP: A Factor Model for Discrete Data. Canny, 2004. 15 / 50
modeling messages topics vocabulary words 0.2 0.6 0.2 0.1 0.3 0.1 0.0 0.5 0.0 topics messages 0.8 0.0 0.1 0.7 0.0 0.1 0.2 0.0 0.0 0.2 0.5 0.0 0.0 0.1 0.2 0.4 0.0 0.3 0.1 0.1 0.7 0.7 0.4 0.3 0.1 0.3 β k ∼ Dirichlet( · ) 0.6 0.1 0.2 θ dk ∼ Gamma( · ) Latent Dirichlet allocation. Blei, Ng, and Jordan, 2003. GaP: A Factor Model for Discrete Data. Canny, 2004. 16 / 50
modeling messages topics vocabulary words 0.2 0.6 0.2 0.1 0.3 0.1 0.0 0.5 0.0 topics messages 0.8 0.0 0.1 0.7 0.0 0.1 0.2 0.0 0.0 0.2 0.5 0.0 0.0 0.1 0.2 0.4 0.0 0.3 0.1 0.1 0.7 0.7 0.4 0.3 0.1 0.3 β k ∼ Dirichlet( · ) 0.6 0.1 0.2 θ dk ∼ Gamma( · ) Latent Dirichlet allocation. Blei, Ng, and Jordan, 2003. vocabulary words 3 1 2 GaP: A Factor Model for messages 1 4 1 Discrete Data. Canny, 2004. 1 3 1 5 1 2 2 2 2 1 3 1 1 ! dv ∼ Poisson n dv v · · · 17 / 50
modeling messages topics vocabulary words 0.2 0.6 0.2 0.1 0.3 0.1 0.0 0.5 0.0 topics messages 0.8 0.0 0.1 0.7 0.0 0.1 0.2 0.0 0.0 0.2 0.5 0.0 0.0 0.1 0.2 0.4 0.0 0.3 0.1 0.1 0.7 0.7 0.4 0.3 0.1 0.3 β k ∼ Dirichlet( · ) 0.6 0.1 0.2 θ dk ∼ Gamma( · ) Latent Dirichlet allocation. Blei, Ng, and Jordan, 2003. vocabulary words 3 1 2 GaP: A Factor Model for messages 1 4 1 Discrete Data. Canny, 2004. 1 3 1 5 1 2 2 2 2 1 3 1 1 K ! X θ dk β kv dv ∼ Poisson n dv v k =1 18 / 50
Recommend
More recommend