Introduction
Motivation: Business Intelligence Customer information Product information (customer-id, gender, age, (Product-id, category, home-address, occupation, manufacturer, made-in, income, family-size, … ) stock-price, … ) Sales information (customer-id, product-id, #units, unit-price, sales-representative, … ) Business queries: Jian Pei: CMPT 741/459 Data Mining -- Introduction 2
Techniques: Business Intelligence • Multidimensional data analysis • Online query answering • Interactive data exploration Jian Pei: CMPT 741/459 Data Mining -- Introduction 3
Motivation: Store Layout Design http://buildipedia.com/images/masterformat/Channels/In_Studio/Todays_Grocery_Store/Todays_Grocery_Store_Layout-Figure_B.jpg Jian Pei: CMPT 741/459 Data Mining -- Introduction 4
Techniques: Store Layout Design • Customer purchase patterns • Business strategies Jian Pei: CMPT 741/459 Data Mining -- Introduction 5
Motivation: Community Detection http://image.slidesharecdn.com/communitydetectionitilecture22june2011-110622095259-phpapp02/95/community-detection-in-social- media-1-728.jpg?cb=1308736811 Jian Pei: CMPT 741/459 Data Mining -- Introduction 6
Techniques: Community Detection • Similarity between objects • Partitioning objects into groups – No guidance about what a group is Jian Pei: CMPT 741/459 Data Mining -- Introduction 7
Motivation: Disease Prediction What medical problems does this patient has? Symptoms: overweight, high blood pressure, back pain, short of breadth, chest pain, cold sweat … Jian Pei: CMPT 741/459 Data Mining -- Introduction 8
Techniques: Disease Prediction • Features • Model Jian Pei: CMPT 741/459 Data Mining -- Introduction 9
Motivation: Fraud Detection http://i.imgur.com/ckkoAOp.gif Jian Pei: CMPT 741/459 Data Mining -- Introduction 10
Techniques: Fraud Detection • Features • Dissimilarity • Groups and noise http://i.stack.imgur.com/tRDGU.png Jian Pei: CMPT 741/459 Data Mining -- Introduction 11
What Is Data Science About? • Data • Extraction of knowledge from data • Continuation of data mining and knowledge discovery from data (KDD) Jian Pei: CMPT 741/459 Data Mining -- Introduction 12
What Is Data? • Values of qualitative or quantitative variables belonging to a set of items • Represented in a structure, e.g., tabular, tree or graph structure • Typically the results of measurements • As an abstract concept can be viewed as the lowest level of abstraction from which information and then knowledge are derived Jian Pei: CMPT 741/459 Data Mining -- Introduction 13
What Is Information? • “Knowledge communicated or received concerning a particular fact or circumstance” • Conceptually, information is the message (utterance or expression) being conveyed • Cannot be predicted • Can resolve uncertainty Jian Pei: CMPT 741/459 Data Mining -- Introduction 14
What Is Knowledge? • Familiarity with someone or something, which can include facts, information, descriptions, or skills acquired through experience or education • Implicit knowledge: practical skill or expertise • Explicit knowledge: theoretical understanding of a subject Jian Pei: CMPT 741/459 Data Mining -- Introduction 15
Data Systems • A data system answers queries based on data acquired in the past • Base data – the rawest data not derived from anywhere else • Knowledge – information derived from the base data Jian Pei: CMPT 741/459 Data Mining -- Introduction 16
Dealing with Data – Querying • Given a set of student records about name, age, courses taken and grades • Simple queries – What is John Doe’s age? • Aggregate queries – What is the average GPA of all students at this school? • Queries can be arbitrarily complicated – Find the students X and Y whose grades are less than 3% apart in as many courses as possible Jian Pei: CMPT 741/459 Data Mining -- Introduction 17
Queries • A precise request for information • Subjects in databases and information retrieval – Databases: structured queries on structured (e.g., relational) data – Information retrieval: unstructured queries on unstructured (e.g., text, image) data • Important assumptions – Information needs – Query languages Jian Pei: CMPT 741/459 Data Mining -- Introduction 18
Data-driven Exploration • What should be the next strategy of a company? – A lot of data: sales, human resource, production, tax, service cost, … • The question cannot be translated into a precise request for information (i.e., a query) • Developing familiarity (knowledge) and actionable items (decisions) by interactively analyzing data Jian Pei: CMPT 741/459 Data Mining -- Introduction 19
Data-driven Thinking • Starting with some simple queries • New queries are raised by consuming the results of previous queries • No ultimate query in design! – But many queries can be answered using DB/IR techniques Jian Pei: CMPT 741/459 Data Mining -- Introduction 20
The Art of Data-driven Thinking • The way of generating queries remains an art! – Different people may derive different results using the same data “ If you torture the data long enough, it will confess ” – Ronald H. Coase • More often than not, more data may be needed – datafication Jian Pei: CMPT 741/459 Data Mining -- Introduction 21
Queries for Data-driven Thinking • Probe queries – finding information about specific individuals • Aggregation – finding information about groups • Pattern finding – finding commonality in population • Association and correlation – finding connections among individuals and groups • Causality analysis – finding causes and consequences Jian Pei: CMPT 741/459 Data Mining -- Introduction 22
What Is Data Mining? • Broader sense: the art of data-driven thinking • Technical sense: the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data [Fayyad, Piatetsky-Shapiro, Smyth, 96] – Methods and tools of answering various types of queries in the data mining process in the broader sense Jian Pei: CMPT 741/459 Data Mining -- Introduction 23
Machine Learning “ A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E ” – Tom M. Mitchell • Essentially, learn the distribution of data Jian Pei: CMPT 741/459 Data Mining -- Introduction 24
Data mining vs. Machine Learning • Machine learning focuses on prediction, based on known properties learned from the training data • Data mining focuses on the discovery of (previously) unknown properties on the data Jian Pei: CMPT 741/459 Data Mining -- Introduction 25
The KDD Process Knowledge Interpretation/ Patterns evaluation Transformed data Data mining Preprocessed data Transformation Preprocessing Selection Target data Data Jian Pei: CMPT 741/459 Data Mining -- Introduction 26
Data Mining R&D • New problem identification • Data collection and transformation • Algorithm design and implementation • Evaluation – Effectiveness evaluation – Efficiency & scalability evaluation • Deployment and business solution Jian Pei: CMPT 741/459 Data Mining -- Introduction 27
Data Mining on Big Data “ Data is so widely available and so strategically important that the scarce thing is the knowledge to extract wisdom from it ” – Hal Varian, Google’s Chief Economist Jian Pei: CMPT 741/459 Data Mining -- Introduction 28
What Is Big Data? • No quantitative definition! • “Big data is like teenage sex – everyone talks about it, – nobody really knows how to do it, – everyone thinks everyone else is doing it, – so everyone claims they are doing it...” – Dan Ariely Jian Pei: CMPT 741/459 Data Mining -- Introduction 29
Data Volume vs. Storage Cost • The unit cost of disk storage decreases dramatically Year Unit cost 1956 $10,000/MB 1980 $193/MB 1990 $9/MB 2000 $6.9/GB 2010 $0.08/GB 2013 0.06/GB http://ns1758.ca/winch/winchest.html Jian Pei: CMPT 741/459 Data Mining -- Introduction 30
Big Data – Volume “Data sets with sizes beyond the ability of commonly-used software tools to capture, curate, manage, and process the data within a tolerable elapsed time” — Wikipedia Jian Pei: CMPT 741/459 Data Mining -- Introduction 31
H1N1 Pandemic Crisis (2009) • A new flu virus combining elements of the viruses that cause bird flu and swine flu • The US Centers for Disease Control and Prevention (CDC) requested reports from doctors, tabulated the numbers once a week – a two-week lag in data collection • Google used user search keywords to predict the spread of winter flu – A supervised approach based on more than 3 billion search queries every day, examining 450 million different models, using 2007-2008 data from CDC • Some things can be done based on large scale data, but cannot be done on a smaller scale data Jian Pei: CMPT 741/459 Data Mining -- Introduction 32
Recommend
More recommend