overview of the nsf cdi project year 3 and research
play

Overview of the NSF-CDI project (Year-3) and Research Progress - PowerPoint PPT Presentation

Mapping Ideas from Cyberspace to Realspace. Funded by NSF Cyber- Enabled Discovery and Innovation ( CDI ) program. Award # 1028177. (2010-2014) http://mappingideas.sdsu.edu/ Overview of the NSF-CDI project (Year-3) and Research Progress


  1. Mapping Ideas from Cyberspace to Realspace. Funded by NSF Cyber- Enabled Discovery and Innovation ( CDI ) program. Award # 1028177. (2010-2014) http://mappingideas.sdsu.edu/ Overview of the NSF-CDI project (Year-3) and Research Progress • Ming-Hsiang (Ming) Tsou mtsou@mail.sdsu.edu, Professor (Geography), San Diego State University , PI of “Mapping Ideas” project. Co-PIs: Dr. Dipak K Gupta (Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communication), Dr. Li An (Geography) Principle Investigator: Dr. Ming-Hsiang Tsou mtsou@mail.sdsu.edu, (Geography), Co-Pis: Dr. Dipak K Gupta (Political Science), Dr. Jean Marc Gawron (Linguistic), Dr. Brian Spitzberg (Communication), Dr. Li An (Geography). San Diego State University, USA .

  2. Starting Date: October 1, 2010 (Four Years, $1.38M total) Goal 1: Establish a new multidisciplinary research framework to represent the spatiotemporal diffusion of ideas and the semantic web on the Internet. Goal 2: Create effective visualization and analysis methods for the dynamic geospatial information landscape with three selected topics (e.g. natural disasters, continuous threats for human beings, and radical social movements). Goal 3 : Build domain-specific ontology, citation, and (provocative) event knowledge bases with thesaurus and citation networks for the three selected topics and their Semantic Webs. Goal 4: Develop theoretical model(s) capable of integrating the individual (semantic usage, online motivations) and societal (diffusion) motives and practices associated with the spatiotemporal diffusion of ideas.

  3. Goal 1: Establish a new multidisciplinary research framework Knowledge Discovery in Cyberspace (KDC) Similar to the multidisciplinary research field, called “knowledge discovery in databases (KDD)” (Fayyad et al. 1996), this emerging research field, knowledge discovery in cyberspace (KDC), will focus on how to handle and analyze very large information and human messages collected from cyberspace and social media. The purpose of KDC is to scale up our research capability of handling millions of records and information items available in social media (such as Twitter) or web pages (searched by Google, Yahoo, or Bing search engines). (Cited from: Ming-Hsiang Tsou & Michael Leitner (2013): Visualization of social media: seeing a mirage or a message?, Cartography and Geographic Information Science, 40:2, 55-60 )

  4. (San Diego, New York, 92119, SDSU, Bus Stops, The Uniqueness Sea World…) of KDC Place (Scale, Space, context) Triangular Knowledge Base (Human centered) Interdependent! Messages Time ( content / function, who, (Dynamic) what, how, media) August 23, 2012 (snapshot), one week, two Tweets, web pages, emails, short messages months, before / after etc.

  5. KDC: Knowledge Discovery in Cyberspace (7 steps) Cyberspace Selection • Social Media (Twitter, Facebook, Flickr, Youtube) • Web Pages, Weblogs, News, RSS, Emails, etc. (research focus) Target Data: Collection • Tweets (keywords, regions, API types) • Web Pages (keywords, web search engines) (Tools, APIs) Collected Data: Preprocessing • SQL databases (Tweet contents) (reduce noises, data • Excel files (Web Search Results) clearing , select regions, Preprocessed Data: time scale/series ??? ) • SQL-output-Tweets – remove errors and duplicated • Geocoded Excels - add lat/long , Transformation • Improving geolocation results. (mapping + Transformed Data: graphs) • Graphics, Bar charts, WordCloud, etc. • Original Point Maps (each point represented one web page or one tweets).

  6. All seven steps are systematic, algorithm-based procedures. Transformed Data: (continued from last page) • Graphics, Bar charts, WordCloud, etc. • Original Point Maps (each point represented one web page or one Explore/Compare tweets). Methods (Select algorithms) Visualized Data: Information • Kernel Density maps, Differential KD maps, Point Density Maps. Mining • Excel files (Web Search Results) (Analyze Space-Time- Information Pattern Recognition: relationships) • Decision Trees and Rules • Nonlinear Regression and Classification Methods • Example-based Methods (nearest-neighbor classification) • Probabilistic Graphic Dependency Models Interpretation / • Relational Learning Models Evaluation Knowledge Formalization: • Verification (City Mayor Maps, Movie tweets) • Discovery • Prediction (Election) • Description (Outbreaks, election)

  7. Cyber Information Space (BIG DATA) Real (Web Pages, Social Media, Weblogs, Forums, News) World Information Mining Tools Twitter_GeoSearch_Tool CyberDiscovery Tools “V”isualizing • Search API • Yahoo API • Streaming API • Bing API “I”nformation • Google API “S”pace VISION Ontological Analysis Platform “I”n WHO, Where, When, What -  WHY? (Networks) Computational Spatial “O”ntological Linguistics Analysis Visualization Tools / Methods “N”etworks Tools / Methods Place – Time – Messages (content/functions) (VISION) New Theories (explanation) , New Models (simulation), New Knowledge

  8. Information Communication Channels in Cyberspace • Web Pages ( Semi-Public Information Communication ) • Social Media (Twitter: Semi-Private Information Communication) Web Pages: Use Web Search Engines (Google, Yahoo, and Bing) to retrieve up to 1,000 web pages per keyword . Then analyze their contents associated with their ranks and geolocations. Social Media (Tweets): Use Twitter APIs to retrieve tweets based on Keywords or #Hashtag and geolocations (self-defined home-towns or GPS locations).

  9. Collect Web Page Contents, Ranks, and Locations: We develop Cyber-Discovery Search Engine (Retrieve up to 1000 results from Yahoo or Bing)

  10. Twitter – Spatial search API Twitter APIs: • REST API • Stream API • Search API Center: 41.961295, -93.281859 Radius: 180 miles Search API Limitations: 1. Spatial Search can only trace back up to seven days. (Regular search can trace back to 14 days.) 2. Each search results can not exceed 1500 tweets.

  11. Web Pages Search Results vs. Tweets

  12. Web Page Visualization maps (using Google or Yahoo search egnine result to convert Web Page IP addresses into Lat/Lon. with Maxmind Lookup tables). IP geolocation -  the “registration location of Web server” (not the physical location of machines). (What is the veracity of geolocation ?).

  13. Classifying different types of web pages and social medias for content and linguistic analysis ; COMPARE between Bing engine and Yahoo engine: (“Jerry Sanders” keyword -- % in 12 different web page categories – defined by our team members) Bing search: more commercial and informational (wiki), social media. Yahoo search: more blogs and news, and educational pages. (But in general, there are some similarity between the two engines)

  14. Spatial Accuracy of Web Page Categories based on the IP IP address geo-convertion.  Highest:  Educational  73.86%  Social Media  68.97%  Government  60.98%  Lowest:  Blog  10.81%  Special Interest Group  12.81%  NGO  20.93%

  15. Geolo location Accuracy in in Dif ifferent Keywords • GREEN (correct) • Blue (incorrect) • Gray (unknown) • Highest Spatial Accuracy • McGinn – 33.57% • Lowest Spatial Accuracy • Santorum – 21.29% • Highest N/A • Flu – 35.52%

  16. Web Page Information Landscape (2012 Presidential Election) Ming-Hsiang Tsou , Jiue-An Yang , Daniel Lusher , Su Han , Brian Spitzberg , Jean Mark Gawron , Dipak Gupta & Li An (2013): Mapping social activities and concepts with social media (Twitter) and web search engines (Yahoo and Bing): a case study in 2012 US Presidential Election, Cartography and Geographic Information Science, DOI:10.1080/15230406.2013.799738

  17. http://mappingideas.sdsu.edu/mapshowcase/election/webpage/election3.html

  18. Twitter Case Study #1: 2012 Summer Comparing FIVE Movie Tweets & Box Office 1) Select 30 major U.S. Cities within 17 miles radius -- collect tweets with movie keywords (TED, Spider Man, etc.) 2) Compare the daily movie box office results and the number of tweets containing each movie keywords.

  19. Five Movies Correlation Test Daily Weekly TED 0.8826 0.9989 Spider-Man 0.9409 0.9725 Ice Age 0.8895 0.9528 Dark Knight 0.9523 0.9375 Step Up 0.8931 0.8123 Daily : Daily_Tweets vs. Daily_Box_Revenue Weekly : 8_to_13_days_before , one_week_before , release_day , one_week_after , two_weeks_after , three_weeks_after , four_weeks_after

  20. Tweet_Daily and Box_Daily (TED) Release 60,000 25,000,000 Tweet_Daily 50,000 Box_Daily 20,000,000 Box Office Revenue 40,000 Tweets One Week 15,000,000 30,000 Two Week 10,000,000 20,000 5,000,000 10,000 0 0 -13-12-11-10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28

  21. Tweet_Weekly and Box_Weekly (TED) 180,000 90000000 Box_weekly 160,000 80000000 Tweets_weekly 140,000 70000000 Box Office Revenue 120,000 60000000 100,000 50000000 Tweets 80,000 40000000 60,000 30000000 40,000 20000000 20,000 10000000 0 0 -2 -1 0 1 2 3 4 Week (releasing day as 0)

  22. Case Study #2: 2012 Presidential Election (Tweets) Before Hurricane Sandy After Hurricane Sandy

  23. Sentiment Analysis (case study: 2012 Presidential Election) (Before / After Hurricane Sandy)

Recommend


More recommend