introduction motivation business intelligence
play

Introduction Motivation: Business Intelligence Customer information - PowerPoint PPT Presentation

Introduction Motivation: Business Intelligence Customer information Product information (customer-id, gender, age, (Product-id, category, home-address, occupation, manufacturer, made-in, income, family-size, ) stock-price, ) Sales


  1. Introduction

  2. Motivation: Business Intelligence Customer information Product information (customer-id, gender, age, (Product-id, category, home-address, occupation, manufacturer, made-in, income, family-size, … ) stock-price, … ) Sales information (customer-id, product-id, #units, unit-price, sales-representative, … ) Business queries: Jian Pei: CMPT 741/459 Data Mining -- Introduction 2

  3. Techniques: Business Intelligence • Multidimensional data analysis • Online query answering • Interactive data exploration Jian Pei: CMPT 741/459 Data Mining -- Introduction 3

  4. Motivation: Store Layout Design http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg Jian Pei: CMPT 741/459 Data Mining -- Introduction 4

  5. Techniques: Store Layout Design • Customer purchase patterns • Business strategies Jian Pei: CMPT 741/459 Data Mining -- Introduction 5

  6. Motivation: Community Detection http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social- media-1-728.jpg?cb=1308736811 Jian Pei: CMPT 741/459 Data Mining -- Introduction 6

  7. Techniques: Community Detection • Similarity between objects • Partitioning objects into groups – No guidance about what a group is Jian Pei: CMPT 741/459 Data Mining -- Introduction 7

  8. Motivation: Disease Prediction What medical problems does this patient has? Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat … Jian Pei: CMPT 741/459 Data Mining -- Introduction 8

  9. Techniques: Disease Prediction • Features • Model Jian Pei: CMPT 741/459 Data Mining -- Introduction 9

  10. Motivation: Fraud Detection http://i.imgur.com/ckkoAOp.gif Jian Pei: CMPT 741/459 Data Mining -- Introduction 10

  11. Techniques: Fraud Detection • Features • Dissimilarity • Groups and noise http://i.stack.imgur.com/tRDGU.png Jian Pei: CMPT 741/459 Data Mining -- Introduction 11

  12. What Is Data Science About? • Data • Extraction of knowledge from data • Continuation of data mining and knowledge discovery from data (KDD) Jian Pei: CMPT 741/459 Data Mining -- Introduction 12

  13. What Is Data? • Values of qualitative or quantitative variables belonging to a set of items • Represented in a structure, e.g., tabular, tree or graph structure • Typically the results of measurements • As an abstract concept can be viewed as the lowest level of abstraction from which information and then knowledge are derived Jian Pei: CMPT 741/459 Data Mining -- Introduction 13

  14. What Is Information? • “Knowledge communicated or received concerning a particular fact or circumstance” • Conceptually, information is the message (utterance or expression) being conveyed • Cannot be predicted • Can resolve uncertainty Jian Pei: CMPT 741/459 Data Mining -- Introduction 14

  15. What Is Knowledge? • Familiarity with someone or something, which can include facts, information, descriptions, or skills acquired through experience or education • Implicit knowledge: practical skill or expertise • Explicit knowledge: theoretical understanding of a subject Jian Pei: CMPT 741/459 Data Mining -- Introduction 15

  16. Data Systems • A data system answers queries based on data acquired in the past • Base data – the rawest data not derived from anywhere else • Knowledge – information derived from the base data Jian Pei: CMPT 741/459 Data Mining -- Introduction 16

  17. Dealing with Data – Querying • Given a set of student records about name, age, courses taken and grades • Simple queries – What is John Doe’s age? • Aggregate queries – What is the average GPA of all students at this school? • Queries can be arbitrarily complicated – Find the students X and Y whose grades are less than 3% apart in as many courses as possible Jian Pei: CMPT 741/459 Data Mining -- Introduction 17

  18. Queries • A precise request for information • Subjects in databases and information retrieval – Databases: structured queries on structured (e.g., relational) data – Information retrieval: unstructured queries on unstructured (e.g., text, image) data • Important assumptions – Information needs – Query languages Jian Pei: CMPT 741/459 Data Mining -- Introduction 18

  19. Data-driven Exploration • What should be the next strategy of a company? – A lot of data: sales, human resource, production, tax, service cost, … • The question cannot be translated into a precise request for information (i.e., a query) • Developing familiarity (knowledge) and actionable items (decisions) by interactively analyzing data Jian Pei: CMPT 741/459 Data Mining -- Introduction 19

  20. Data-driven Thinking • Starting with some simple queries • New queries are raised by consuming the results of previous queries • No ultimate query in design! – But many queries can be answered using DB/IR techniques Jian Pei: CMPT 741/459 Data Mining -- Introduction 20

  21. The Art of Data-driven Thinking • The way of generating queries remains an art! – Different people may derive different results using the same data “ If you torture the data long enough, it will confess ” – Ronald H. Coase • More often than not, more data may be needed – datafication Jian Pei: CMPT 741/459 Data Mining -- Introduction 21

  22. Queries for Data-driven Thinking • Probe queries – finding information about specific individuals • Aggregation – finding information about groups • Pattern finding – finding commonality in population • Association and correlation – finding connections among individuals and groups • Causality analysis – finding causes and consequences Jian Pei: CMPT 741/459 Data Mining -- Introduction 22

  23. What Is Data Mining? • Broader sense: the art of data-driven thinking • Technical sense: the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96] – Methods and tools of answering various types of queries in the data mining process in the broader sense Jian Pei: CMPT 741/459 Data Mining -- Introduction 23

  24. Machine Learning “ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E ” – Tom M. Mitchell • Essentially, learn the distribution of data Jian Pei: CMPT 741/459 Data Mining -- Introduction 24

  25. Data mining vs. Machine Learning • Machine learning focuses on prediction, based on known properties learned from the training data • Data mining focuses on the discovery of (previously) unknown properties on the data Jian Pei: CMPT 741/459 Data Mining -- Introduction 25

  26. The KDD Process Knowledge Interpretation/ Patterns evaluation Transformed data Data mining Preprocessed data Transformation Preprocessing Selection Target data Data Jian Pei: CMPT 741/459 Data Mining -- Introduction 26

  27. Data Mining R&D • New problem identification • Data collection and transformation • Algorithm design and implementation • Evaluation – Effectiveness evaluation – Efficiency & scalability evaluation • Deployment and business solution Jian Pei: CMPT 741/459 Data Mining -- Introduction 27

  28. Data Mining on Big Data “ Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it ” – Hal Varian, Google’s Chief Economist Jian Pei: CMPT 741/459 Data Mining -- Introduction 28

  29. What Is Big Data? • No quantitative definition! • “Big data is like teenage sex – everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...” – Dan Ariely Jian Pei: CMPT 741/459 Data Mining -- Introduction 29

  30. Data Volume vs. Storage Cost • The unit cost of disk storage decreases dramatically Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB http://ns1758.ca/winch/winchest.html Jian Pei: CMPT 741/459 Data Mining -- Introduction 30

  31. Big Data – Volume “Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time” — Wikipedia Jian Pei: CMPT 741/459 Data Mining -- Introduction 31

  32. H1N1 Pandemic Crisis (2009) • A new flu virus combining elements of the viruses that cause bird flu and swine flu • The US Centers for Disease Control and Prevention (CDC) requested reports from doctors, tabulated the numbers once a week – a two-week lag in data collection • Google used user search keywords to predict the spread of winter flu – A supervised approach based on more than 3 billion search queries every day, examining 450 million different models, using 2007-2008 data from CDC • Some things can be done based on large scale data, but cannot be done on a smaller scale data Jian Pei: CMPT 741/459 Data Mining -- Introduction 32

Recommend


More recommend