introduction
play

Introduction What is data mining? to Data mining functionalities - PowerPoint PPT Presentation

Introduction Motivation: Why data mining? Introduction What is data mining? to Data mining functionalities Data Mining Major issues in data mining 2 Motivation: Necessity is the Mother of Motivation: Necessity is the


  1. Introduction • Motivation: Why data mining? Introduction • What is data mining? to • Data mining functionalities Data Mining • Major issues in data mining 2 Motivation: “Necessity is the Mother of Motivation: “Necessity is the Mother of Invention” Invention” • Data explosion problem • We are drowning in data, but starving for knowledge! • Automated data collection tools and mature database technology • “The greatest problem of today is how to teach people to ignore the lead to tremendous amounts of data stored in databases, data irrelevant, how to refuse to know things, before they are suffocated. warehouses and other information repositories For too many facts are as bad as none at all.” (W.H. Auden) • There is a tremendous increase in the amount of data recorded • Solution: Data warehousing and data mining and stored on digital media • Data warehousing and On-Line Analytical Processing (OLAP) • We are producing over two exabites (10 18 ) of data per year • Extraction of interesting knowledge (rules, regularities, patterns, • Storage capacity, for a fixed price, appears to be doubling constraints) from data in large databases approximately every 9 months 3 4

  2. Largest databases in 2003 Data Growth Rate • Commercial databases: • Twice as much information was created in 2002 as in 1999 (~30% growth rate) • Winter Corp. 2003 Survey: France Telecom has largest decision- support DB, ~30TB; AT&T ~ 26 TB • Other growth rate estimates even higher • Web • Very little data will ever be looked at by a human • Alexa internet archive: 7 years of data, 500 TB • Google searches 4+ Billion pages, many hundreds TB • Knowledge Discovery is NEEDED to make sense and • IBM WebFountain, 160 TB (2003) use of data. • Internet Archive (www.archive.org),~ 300 TB 5 6 “The key in business is to know something that nobody else knows.” “Every time the amount of data increases by a — Aristotle Onassis factor of ten, we should totally rethink the PHOTO: LUCINDA DOUGLAS-MENZIES way we analyze it” PHOTO: HULTON-DEUTSCH COLL Jerome Friedman, Data Mining and Statistics: What’s the Connection (paper 1997) “To understand is to perceive patterns.” — Sir Isaiah Berlin 7 8

  3. Problems Suitable for Data-Mining An Application Example • Require knowledge-based decisions • A person buys a book (product) at Amazon.com. • Have a changing environment • Task: Recommend other books (products) this person is likely to buy • Have sub-optimal current methods • Amazon does clustering based on books bought: • Have accessible, sufficient, and relevant data • customers who bought “ Advances in Knowledge Discovery and Data • Provides high payoff for the right decisions! Mining ”, also bought “ Data Mining: Practical Machine Learning Tools and Techniques with Java Implementations ” • Recommendation program is quite successful Privacy considerations important if personal data is involved 9 10 What is Data Mining? What Is Data Mining? • Knowledge Discovery in Databases • Alternative names: • Is the non-trivial process of identifying • Data Mining: a misnomer? • implicit (by contrast to explicit) (knowledge mining from data?) • valid (patterns should be valid on new data) • novel (novelty can be measured by comparing to expected values) • Knowledge discovery (mining) in databases (KDD), • potentially useful (should lead to useful actions) • knowledge extraction, • data/pattern analysis, • understandable (to humans) • data archeology, • patterns in data • data dredging, • information harvesting, • Data Mining • business intelligence, etc. • Is a step in the KDD process 11 12

  4. Data Mining and the Knowledge Knowledge Discovery Evaluation and Presentation Process KDD Process Data Mining Selection and Transformation Cleaning and DW Integration DB 14 Steps of a KDD Process • Data cleaning: missing values, noisy data, and inconsistent data More on the KDD Process • Data integration: merging data from multiple data stores • Data selection: select the data relevant to the analysis • 60 to 80% of the KDD effort is about preparing the data and the • Data transformation: aggregation (daily sales to weekly or monthly remaining 20% is about mining sales) or generalisation (street to city; age to young, middle age and senior) • Data mining: apply intelligent methods to extract patterns • Pattern evaluation: interesting patterns should contradict the user’s belief or confirm a hypothesis the user wished to validate • Knowledge presentation: visualisation and representation techniques to present the mined knowledge to the users 15 16

  5. More on the KDD Process • A data mining project should always start with an analysis of the data with traditional query tools Data Mining Applications • 80% of the interesting information can be extracted using SQL • how many transactions per month include item number 15? • show me all the items purchased by Sandy Smith. • 20% of hidden information requires more advanced techniques • which items are frequently purchased together by my customers? • how should I classify my customers in order to decide whether future loan applicants will be given a loan or not? 17 Data Mining - Applications Data Mining - Applications • Fraud detection and management • Market analysis and management • Use historical data to build models of fraudulent behavior and use • Target marketing, customer relation management, market basket data mining to help identify similar instances analysis, cross selling, market segmentation • Examples • Find clusters of “model” customers who share the same characteristics: interest, income level, spending habits, etc. • auto insurance: detect a group of people who stage accidents to • Determine customer purchasing patterns over time collect on insurance • money laundering: detect suspicious money transactions (US • Risk analysis and management Treasury's Financial Crimes Enforcement Network) • medical insurance: detect professional patients and ring of doctors • Forecasting, customer retention, improved underwriting, quality control, and ring of references (ex. doc. prescribes expensive drug to a Medicare competitive analysis, credit scoring patient. Patient gets prescription filled, gets drug and sells drug unopened, which is sold back to pharmacy) 19 20

  6. Fraud Detection and Management Fraud Detection and Management • Detecting inappropriate medical treatment • Detecting telephone fraud • Charging for unnecessary services, e.g. performing $400,000 worth • Telephone call model: destination of the call, duration, time of day of heart & lung tests on people suffering from no more than a or week. Analyze patterns that deviate from an expected norm. common cold. These tests are done either by the doctor himself or by associates who are part of the scheme. A more common variant • British Telecom identified discrete groups of callers with frequent involves administering more expensive blanket screening tests, intra-group calls, especially mobile phones, and broke a multimillion rather than tests for specific symptoms dollar fraud. • ex. an inmate in prison has a friend on the outside set up an account at a local abandoned house. Calls are forwarded to inmate’s girlfriend three states away. Free calling until phone company shuts down account 90 days later. 21 22 Other Applications Data Mining: On What Kind of Data? • Sports • DM should be applicable to any kind of info. repository. • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks • Relational databases and Miami Heat • Data warehouses • Space Science: • Transactional databases • SKICAT automated the analysis of over 3 Terabytes of image data for • Advanced DB and information repositories a sky survey with 94% accuracy Object-oriented and object-relational databases • • Internet Web Surf-Aid • Spatial databases Time-series data and temporal data • • IBM Surf-Aid applies data mining algorithms to Web access logs for • Text databases and multimedia databases market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site Heterogeneous and legacy databases • organization, etc. • WWW Scientific data (DNA) • 23 24

Recommend


More recommend