Why Is It Difficult to Detect Outbreaks in Twitter? Avaré Stewart, Nattiya Kanhabua, Sara Romano Ernesto Diaz-Aviles, Wolf Siberski, and Wolfgang Nejdl L3S Research Center / Leibniz Universität Hannover, Germany SIGIR 2013 Workshop on Health Search and Discovery 1 August 2013, Dublin, Ireland
Motivation • Numerous works use Twitter to infer the existence and magnitude of real-world events in real-time – Earthquake [Sakaki et al., 2010] – Predicting financial time series [Ruiz et al., 2012] – Influenza epidemics [Culotta, 2010; Lampos et al., 2011; Paul et al., 2011]
Early Warnings
Health related tweets • User status updates or news related to public health are common in Twitter – I have the mumps...am I alone? – my baby girl has a Gastroenteritis so great!! Please do not give it to meee – #Cholera breaks out in #Dadaab refugee camp in #Kenya http://t.co/.... – As many as 16 people have been found infected with Anthrax in Shahjadpur upazila of the Sirajganj district in Bangladesh.
Matching Tweets [Kanhabua et al., CIKM’12]
Matching Tweets [Kanhabua et al., CIKM’12]
Twitter vs. Official Source
M-Eco System Medical Ecosystem: Personalized Event-based Surveillance http://www.meco-project.eu/
Data Collection • Official outbreak reports – ~3,000 ProMED-mail reports from 2011 – WHO reports have very small coverage • Twitter data – ~1,200 health-related terms (i.e., infectious diseases, their synonyms, pathogens and symptoms) – Over 112 millions of tweets from 2011 • Series of NLP tools including – OpenNLP (tokenization, sentence splitting, POS tagging) – OpenCalais (named entity recognition) – HeidelTime (temporal expression extraction)
Ground Truths [Kanhabua et al., TAIA’ 12]
Event Extraction • An event is a sentence containing two entities – (1) medical condition and (2) geographic expression – A minimum requirement by domain experts • A victim and the time of an event can be identified from the sentence itself, or its surrounding context • Output: a set of event candidates Reported by World Health Organization (WHO) on 29 July 2012 about an ongoing Ebola outbreak in Uganda since the beginning of July 2012 [Kanhabua et al., TAIA’ 12]
Message Filtering: Challenges • Ambiguity – having several meanings – used in different contexts • Incompleteness – missing or under-reported events – data processing errors
Message Filtering: Challenges • Ambiguity Category Example tweet – having several meanings Literature A two hour train journey, Love In the Time of Cholera ... – used in different contexts Music Dengue Fever ’s “Uku,” Mixed by Paul Dreux Smith • Incompleteness Universal Audio... – missing or under-reported events Marketing Exclusive distributor of high quality # HIV/AIDS Blood & – data processing errors Urine and # Hepatitis #Self -testers. General Identification of genotype 4 Hepatitis E virus binding proteins on swine liver cells: Hepatitis E virus... Negative i dont have sniffles and no real coughing..well its coughing but not like an influenza cough. Joke Thought I had Bieber Fever . Ends up I just had a combo of the mumps, mono, measles & the hershey squ...
Challenge I. Noisy/evolving • Evolving data – Relevant features changes over time
Challenge I. Noisy/evolving
Approach for Noisy Data • MedISys 1 – providing a list of negative keywords created by medical experts • Urban Dictionary 2 – a Web-based dictionary of slang, ethnic culture words or phrases 1 http://medusa.jrc.it/medisys/homeedition/en/home.html 2 http://www.urbandictionary.com/
Approach for Noisy Data 1 http://medusa.jrc.it/medisys/homeedition/en/home.html 2 http://www.urbandictionary.com/
[Kanhabua and Nejdl, WOW’ 13]
[Kanhabua and Nejdl, WOW’ 13]
Approach for Feature Changes
Signal Generation: Challenges • Temporal Dynamics – seasonal infectious diseases – rare and spontaneous outbreaks • Location Dynamics – frequency and duration – levels of prevalence or severity
Signal Generation: Challenges • Temporal Dynamics – seasonal infectious diseases – rare and spontaneous outbreaks • Location Dynamics – frequency and duration – levels of prevalence or severity [Rortais et al., 2010 in Journal of Food Research International]
Signal Generation: Challenges • Temporal Dynamics – seasonal infectious diseases – rare and spontaneous outbreaks • Location Dynamics – frequency and duration – levels of prevalence or severity
Signal Generation: Challenges [Emch et al., 2008 in International Journal of Health Geographics]
Outbreak Categorization
Outbreak Categorization How to generate a reliable signal for low aggregate counts?
Approach [Kanhabua and Nejdl, WOW’ 13]
Temporal Diversity • Refined Jaccard Index (RDJ-index) – average Jaccard similarity of all object pairs 2 ∑ = RDJ JS ( O i O , ) ∩ U U − j n ( n 1 ) < i j ≤ < ≤ 1 i j n Jaccard similarity • Note : lower RDJ corresponds to higher diversity • Problem : “ All-Pair comparison ” • Solution : Estimation algorithms with probabilistic error bound guarantees [Deng et al., CIKM’ 12]
Temporal Diversity (1) Top-k terms • Refined Jaccard Index (RDJ-index) (2) Entities – average Jaccard similarity of all object pairs 2 ∑ = RDJ JS ( O i O , ) ∩ U U − j n ( n 1 ) < i j ≤ < ≤ 1 i j n Jaccard similarity • Note : lower RDJ corresponds to higher diversity • Problem : “ All-Pair comparison ” • Solution : Estimation algorithms with probabilistic error bound guarantees [Deng et al., CIKM’ 12]
Threat Assessment: Challenge • Overwhelming with the large number of tweets
Approach • Personalized Tweet Ranking for Epidemic Intelligence – Learning to rank and recommender systems – User's context as implicit criteria for recommendation [Diaz-Aviles et al., WWW’ 12, Diaz-Aviles et al., ICWSM’ 12]
Approach
Signal Search Prototype
Future Work • Real-Time Analysis of Big and Fast Social Web Streams – Scalable, efficient methods for filtering and generating signals in real-time – Effective methods for aggregating and visualizing information in a meaningful way
Thank you! kanhabua@L3S.de
References • [Culotta, 2010] A. Culotta. Towards detecting influenza epidemics by analyzing twitter messages . In Proceedings of the First Workshop on Social Media Analytics (SOMA’2010), 2010. • [Diaz-Aviles et al., 2012a] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Towards personalized learning to rank for epidemic intelligence based on social media streams . In Proceedings of the 21st World Wide Web Conference (WWW ‘2012), 2012. • [Diaz-Aviles et al., 2012b] E. Diaz-Aviles, A. Stewart, E. Velasco, K. Denecke, and W. Nejdl. Epidemic intelligence for the crowd, by the crowd . In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2012), 2012. • [Kanhabua et al., 2012a] N. Kanhabua, Sara Romano, and A. Stewart , Identifying Relevant Temporal Expressions for Real-world Events , In SIGIR 2012 Workshop on Time-aware Information Access (TAIA'2012), 2012. [Kanhabua et al., 2012b] N. Kanhabua, Sara Romano, and A. Stewart and W. Nejdl. Supporting • Temporal Analytics for Health Related Events in Microblogs . In Proceedings of CIKM'2012, 2012. • [Kanhabua and Nejdl 2013] N. Kanhabua and W. Nejdl . Understanding the Diversity of Tweets in the Time of Outbreaks. In Proceedings of the First International Web Observatory Workshop (WOW'2013) at WWW'2013, 2013. • [Lampos et al., 2011] V. Lampos and N. Cristianini. Nowcasting events from the social web with statistical learning . ACM TIST, 3, 2011. • [Paul et al., 2011] M. J. Paul and M. Dredze. You are what you tweet: Analyzing twitter for public health . In Proceedings of International AAAI Conference on Weblogs and Social Media (ICWSM’2011), 2011. • [Ruiz et al., 2012] E. J. Ruiz, V. Hristidis, C. Castillo, A. Gionis, and A. Jaimes. Correlating financial time series with micro-blogging activity . In Proceedings of WSDM’2012, 2012. • [Sakaki et al., 2010] T. Sakaki, M. Okazaki, and Y. Matsuo. Earthquake shakes twitter users: real-time event detection by social sensors. In Proceedings of WWW’2010, 2010.
Recommend
More recommend