analytics building blocks
play

Analytics Building Blocks Duen Horng (Polo) Chau Assistant Professor - PowerPoint PPT Presentation

http://poloclub.gatech.edu/cse6242 CSE6242 / CX4242: Data & Visual Analytics Analytics Building Blocks Duen Horng (Polo) Chau Assistant Professor Associate Director, MS Analytics Georgia Tech Partly based on materials by


  1. http://poloclub.gatech.edu/cse6242 
 CSE6242 / CX4242: Data & Visual Analytics 
 Analytics Building Blocks Duen Horng (Polo) Chau 
 Assistant Professor 
 Associate Director, MS Analytics 
 Georgia Tech Partly based on materials by 
 Professors Guy Lebanon, Jeffrey Heer, John Stasko, Christos Faloutsos

  2. What is Data & Visual Analytics? 2

  3. What is Data & Visual Analytics? No formal definition! 2

  4. What is Data & Visual Analytics? No formal definition! Polo’s definition: 
 the interdisciplinary science of combining 
 computation techniques and 
 interactive visualization 
 to transform and model data to aid 
 discovery, decision making, etc. 2

  5. What are the “ingredients”? 3

  6. What are the “ingredients”? Need to worry (a lot) about: storage, complex system design, scalability of algorithms, visualization techniques, interaction techniques, statistical tests, etc. Wasn’t this complex before this big data era. Why? 3

  7. http://spanning.com/blog/choosing-between-storage-based-and-unlimited-storage-for-cloud-data-backup/ 4

  8. What is big data ? Why care? (“big data” is buzz word, so is “IoT” - Internet of Things) • Many companies ’ businesses are based on big data (Google, Facebook, Amazon, Apple, Symantec, LinkedIn, and many more) • Web search • Rank webpages (PageRank algorithm) • Predict what you’re going to type • Advertisement (e.g., on Facebook) • Infer users’ interest; show relevant ads • Infer what you like, based on what your friends like • Recommendation systems (e.g., Netflix, Pandora, Amazon) • Online education • Health IT: patient records (EMR) • Bio and Chemical modeling: • Finance • Cybersecruity • Internet of Things (IoT)

  9. Good news! Many jobs! Most companies are looking for “data scientists” The data scientist role is critical for organizations looking to extract insight from information assets for ‘big data’ initiatives and requires a broad combination of skills that may be fulfilled better as a team 
 - Gartner (http://www.gartner.com/it-glossary/data-scientist) Breadth of knowledge is important. This course helps you learn some important skills.

  10. Analytics Building Blocks

  11. Collection Cleaning Integration Analysis Visualization Presentation Dissemination

  12. Building blocks, not “steps” • Can skip some Collection • Can go back (two-way street) Cleaning • Examples Integration • Data types inform visualization design • Data informs choice of algorithms Analysis • Visualization informs data cleaning Visualization (dirty data) Presentation • Visualization informs algorithm design (user finds that results don’t make Dissemination sense)

  13. How big data affects the process? The Vs of big data (3Vs, 4Vs, now 7Vs) Collection Volume : “billions”, “petabytes” are common Cleaning Velocity : think Twitter, fraud detection, etc. Integration Variety : text (webpages), video (youtube)… Analysis Veracity : uncertainty of data Variability Visualization Visualization Presentation Value Dissemination http://www.ibmbigdatahub.com/infographic/four-vs-big-data 
 http://dataconomy.com/seven-vs-big-data/

  14. Gartner's 2016 Hype Cycle http://www.gartner.com/newsroom/id/3412017 https://en.wikipedia.org/wiki/Hype_cycle

  15. “Artificial Intelligence”

  16. We’re in the 3rd wave of “AI” boom • Two “AI winters” before 
 https://en.wikipedia.org/wiki/History_of_artificial_intelligence • We should be cautiously optimistic (Polo’s motto)

  17. AI Safety

  18. 
 Good Read about AI: 
 White House Report Preparing for The Future of Artificial Intelligence 
 https://www.whitehouse.gov/sites/default/files/ whitehouse_files/microsites/ostp/NSTC/ preparing_for_the_future_of_ai.pdf

  19. “The Current State of AI Remarkable progress has been made on what is known as Narrow AI , which addresses specific application areas such as playing strategic games, language translation, self-driving vehicles, and image recognition. Narrow AI underpins many commercial services such as trip planning, shopper recommendation systems, and ad targeting, and is finding important applications in medical diagnosis, education, and scientific research. These have all had significant societal benefits and have contributed to the economic vitality of the Nation.

  20. General AI (sometimes called Artificial General Intelligence, or AGI) refers to a notional future AI system that exhibits apparently intelligent behavior at least as advanced as a person across the full range of cognitive tasks. A broad chasm seems to separate today’s Narrow AI from the much more difficult challenge of General AI. Attempts to reach General AI by expanding Narrow AI solutions have made little headway over many decades of research. The current consensus of the private-sector expert community, with which the NSTC Committee on Technology concurs, is that General AI will not be achieved for at least decades. ”

  21. No Matrix or SkyNet in Your Life Time

  22. Schedule Collection Cleaning Integration Analysis Visualization Presentation Dissemination

  23. Two Example Projects 
 from Polo Club

  24. Apolo Graph Exploration: 
 Machine Learning + Visualization 
 Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning . 
 Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. CHI 2011. 22

  25. 23

  26. Beautiful Hairball Death Star Spaghetti 23

  27. Finding More Relevant Nodes HCI Paper Data Mining 
 Paper Citation network 24

  28. Finding More Relevant Nodes HCI Paper Data Mining 
 Paper Citation network 24

  29. Finding More Relevant Nodes HCI Paper Data Mining 
 Paper Citation network Apolo uses guilt-by-association 
 (Belief Propagation) 24

  30. Demo : Mapping the Sensemaking Literature Nodes : 80k papers from Google Scholar (node size: #citation) Edges : 150k citations 25

  31. Key Ideas (Recap) Specify exemplars Find other relevant nodes (BP) 27

  32. What did Apolo go through? Scrape Google Scholar. No API :( Collection Cleaning Integration Design inference algorithm 
 Analysis (Which nodes to show next?) Interactive visualization you just saw Visualization Paper, talks, lectures Presentation Dissemination

  33. Apolo: Making Sense of Large Network Data by Combining Rich User Interaction and Machine Learning . Duen Horng (Polo) Chau, Aniket Kittur, Jason I. Hong, Christos Faloutsos. 29 ACM Conference on Human Factors in Computing Systems (CHI) 2011 . May 7-12, 2011.

  34. NetProbe : 
 Fraud Detection in Online Auction NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks. Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. WWW 2007

  35. NetProbe: The Problem Find bad sellers ( fraudsters ) on eBay who don’t deliver their items $$$ Buyer Seller Auction fraud is #3 online crime in 2010 source: www.ic3.gov 31

  36. 32

  37. NetProbe: Key Ideas § Fraudsters fabricate their reputation by “trading” with their accomplices § Fake transactions form near bipartite cores § How to detect them? 33

  38. NetProbe: Key Ideas Use Belief Propagation F A H Fraudster Darker means Accomplice more likely Honest 34

  39. NetProbe: Main Results 35

  40. 36

  41. 36

  42. “Belgian Police” 36

  43. 37

  44. What did NetProbe go through? Scraping (built a “scraper”/“crawler”) Collection Cleaning Integration Design detection algorithm Analysis Visualization Paper, talks, lectures Presentation Not released Dissemination

  45. NetProbe: A Fast and Scalable System for Fraud Detection in Online Auction Networks . Shashank Pandit, Duen Horng (Polo) Chau, Samuel Wang, Christos Faloutsos. International Conference on World Wide 39 Web (WWW) 2007 . May 8-12, 2007. Banff, Alberta, Canada. Pages 201-210.

  46. Homework 1 (out next week; tasks subject to change) • Simple “End-to-end” analysis Collection • Collect data using API) Cleaning • Movies (Actors, directors, related Integration movies, etc.) • Store in SQLite database Analysis • Transform data to movie-movie network Visualization • Analyze, using SQL queries (e.g., create graph’s degree distribution) Presentation • Visualize, using Gephi Dissemination • Describe your discoveries

Recommend


More recommend