part i introductory materials
play

Part I: Introductory Materials Introduction to Data Mining Dr. - PowerPoint PPT Presentation

Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer Science and Mathematics Division Oak Ridge National Laboratory What is common among


  1. Part I: Introductory Materials Introduction to Data Mining Dr. Nagiza F. Samatova Department of Computer Science North Carolina State University and Computer Science and Mathematics Division Oak Ridge National Laboratory

  2. What is common among all of them? 2

  3. Who are the data producers? What data? Application � Data • Application Category: Finance • Producer: Wall Street • Data: stocks, stock prices, stock purchases,… • Application Category: Academia • Producer: NCSU • Data: students admission data (name, DOB, GRE scores, transcripts, GPA, university/school attended, recommendation letters, personal statement, etc. 3

  4. Application Categories • Finance (e.g., banks) • Entertainment (e.g., games) • Science (e.g., weather forecasting) • Medicine (e.g., disease diagnostics) • Cybersecurity (e.g., terrorists, identity theft) • Commerce (e.g., e-Commerce) • … 4

  5. What questions to ask about the data? Data � Questions • Academia:NCSU:Admission data 1. Is there any correlation between the students’ GRE scores and their successful completion of a PhD program? 2. What are the groups of students that share common academic performance? 3. Are there any admitted students who would stand out as an anomaly? What type of anomaly is that? 4. If the student majors in Physics, what other major is he/she likely double-major? 5

  6. Questions by Types? • Correlation, similarity, comparison,… • Association, causality, co-occurrence,… • Grouping, clustering,… • Categorization, classification,… • Frequency or rarity of occurrence,… • Anomalous or normal objects, events, behaviors, • Forecasting: future classes, future activity,… • … 6

  7. What information we need to answer? Questions � Data Objects and Object Features • Academia:NCSU:Admission data – Objects: Students – Object’s Features=Variables=Attributes=Dimensions & Types • Name:String (e.g., Name=Neil Shah) • GPA:Numeric (e.g., GPA=5.0) • Recommendation:Text (e.g., … the top 2% in my career…) • Etc. 7

  8. How to compare two objects? Data Object � Object Pairs • Academia:NCSU:Admission data – Objects: Students – Based on a single feature: • Similar GPA • The same first letter in the last name – Based on a set of features: • Similar academic records (GPA, GRE, etc.) • Similar demographic records – Can you compute a numerical value for your similarity measure used for comparison? Why or Why not? 8

  9. How to represent data mathematically? Data Object & its Features � Data Model • What mathematical objects have you studied? – Scalar – Points – Vectors – Vector spaces – Matrices – Sets – Graphs, networks (maybe) – Tensors (maybe) – Time series (maybe) – Topological manifolds (maybe) – … 9 9

  10. Data object as vector with components… Vector components: • Features, or City=(Latitude, Longitude)-- 2-dimensional object • Attributes, or Raleigh=(35.46, 78.39) Boston=(42.21, 71.5) • Dimensions Proximity(Raleigh, Boston)=? • Geodesic distance • Euclidean distance • Length of the interstate route 10

  11. A set of data objects as vector spaces 3-dimensional vector space Altitude Moscow Raleigh Latitude Longitude Mining such data ~ studying vector spaces 11

  12. Multi-dimensional vectors… Vector components: • Features, or • Attributes, or • Dimensions Student=(Name, GPA, Weight, Height, Income in K, …) - mutli -dimensional S1=(John Smith, 5.0, 180, 6.0, 200) Proximity(S1, S2)=? S2=(Jane Doe, 3.0, 140, 5.4, 70) • How to compare when vector components are of heterogeneous type, or different scales? • How to show the results of the comparison? 12

  13. as matrices… Example: A collection of text documents on the Web Parsed Documents Original Documents D1: Child Safety at Home D1: Child Safety Home D2: Infant & Toddler First Aid D2: Infant Toddler Your Baby's Health and D3: Bab Health Safety Infant D3: Safety: From Infant to Toddler Toddler t-d term-document matrix Terms=Features=Dimensions D1: D2: D3: T1: Bab T1: 0 0 1 T2: Child T2: 1 0 0 T3: Health T3: 0 0 1 T4: Home T4: 1 0 0 T5: Infant T5: 0 1 1 T6: Safety T6: 1 0 1 T7: Toddler T7: 0 1 1 Mining such data ~ studying matrices 13

  14. or as trees t-d term-document matrix document president government party D1: D2: D3: election political elected T1: 0 0 1 national districts held T2: 1 0 0 district independence vice T3: 0 0 1 minister parties T4: 1 0 0 T5: 0 1 1 D2 T6: 1 0 1 T7: 0 1 1 Is D2 similar to D3? D3 What if there are 10,000 terms? population area terms climate city miles economy million province land products 1996 topography total growth copra season 1999 economic 1997 square rate food scale exports rice fish Mining such data ~ studying trees 14

  15. 0r as networks, or graphs w/ nodes & links president government party Nodes =Documents election political elected Links =Document similarity (e.g., if document national districts held references another document ) district independence vice population area minister parties climate city miles province land topography total season 1999 square rate economy million products 1996 growth copra economic 1997 food scale exports rice fish Mining such data ~ studying graphs, or graph mining 15

  16. What apps naturally deal w/ graphs? Semantic Web Social Networks World Wide Web Drug Design, Computer networks Sensor networks Chemical compounds 16 Credit: Images are from Google images via search of keywords

  17. What questions to ask about graph data? Graph Data � Graph Mining Questions • Academia:NCSU:Admission data 1. Nodes=students; links=similar academics/demographics 2. How many distinct academically performing groups of students admitted to NCSU? 3. Which academic group is the largest? 4. Given a new student applicant, can we predict which academic group the student will likely belong to? 5. Are groups of student with similar demographics usually share similar academic performance? 6. Over the last decade, has the diversity in demographics of accepted student groups increased or decreased? 7. … 17

  18. Recap: Data Mining and Graph Mining Application Data Questions Data Objects + Features Mathematical Data Representation (Data Model) Vectors Matrices Graphs Not one hat fits all Tensors Time series More than one models are needed Manifolds Models are related Sets 18

  19. How much data? 1PB/year 20-40TB/simulation 30TB/day 850TB Ecology Biology Climate Astrophysics Cosmology Web My laptop: 60 GB (GigaBytes) – 10 9 Bytes 1 TB (TeraByte) – 10 12 Bytes 1 PB (PetaByte) – 10 15 Bytes 19

  20. It is not just the Size – but the Complexity Petabytes Data 20

  21. Data Describes Complex Patterns/Phenomena How to untangle the riddles of the complexity? Single gene Complex regulation Analytical tools that find the “dots” from data significantly reduce data. ~30k genes 50 trans elements control single gene expression Challenge: How to “connect the dots ” to answer important science/business questions? 21

  22. Connecting the Dots Finding the Dots Connecting the Dots Understanding the Dots Providing Predictive Sheer Volume of Data Advanced Math+Algorithms Understanding Climate � Huge dimensional space Now: 20-40 Terabytes/year � Produce bioenergy � Combinatorial challenge 5 years: 5-10 Petabytes/year � Complicated by noisy data � Stabilize CO 2 Fusion Now: 100 Megabytes/15 min � Requires high-performance � Clean toxic waste 5 years: 1000 Megabytes/2 min computers 22

  23. Why Would Data Mining Matter? Enables solving many large-scale data problems Understanding the Dots Finding the Dots Connecting the Dots • How to effectively produce bioenergy? • How to stabilize carbon dioxide? • How to convert toxic into non-toxic waste? ... Science Questions 23

  24. How to Move and Access the Data? Technology trends are a rate limiting factor Most of these data will NEVER be touched! Data doubles every 9 months; CPU ―18 months. Naturally distributed Streaming/Dynamic but effectively immovable but not re-computable Latency and Speed – Storage Performance CPU, Disk, Network Trend Retrieval Rate Mbytes/s Doubling: CPU: every 1.2 years 10 5 Disk: every 1.4 years WAN: 0.7 years Memory Disk Tape MIPS/$M GB/$M log 10(Object Size Bytes) kB/s Src: Richard Mount, SLAC 24 J. W. Toigo, Avoiding a Data Crunch, Scientific American , May 2000

  25. How to Make Sense of Data? Know Your Limits & Be Smart Not humanly possible to browse a petabyte of data. Analysis must reduce data to quantities of interest. More data Ultrascale Computations: Must be smart about which probe combinations to see! Physical Experiments: Petabytes Must be smart about probe Terabytes placement! Gigabytes Megabytes More analysis To see 1 percent of a petabyte at 10 megabytes per second takes: 35 8-hour days! 25

Recommend


More recommend