Introduction • Motivation: Why data mining? Introduction • What is data mining? to • Data Mining: On what kind of data? Data Mining • Data mining functionalities • Major issues in data mining 2 Motivation: “Necessity is the Mother of Motivation: “Necessity is the Mother of Invention” Invention” • Data explosion problem • We are drowning in data, but starving for knowledge! • Automated data collection tools and mature database technology • “The greatest problem of today is how to teach people to ignore the lead to tremendous amounts of data stored in databases, data irrelevant, how to refuse to know things, before they are suffocated. warehouses and other information repositories For too many facts are as bad as none at all.” (W.H. Auden) • There is a tremendous increase in the amount of data recorded • Solution: Data warehousing and data mining and stored on digital media • Data warehousing and On-Line Analytical Processing (OLAP) • We are producing over two exabites (10 18 ) of data per year • Extraction of interesting knowledge (rules, regularities, patterns, • storage capacity, for a fixed price, appears to be doubling constraints) from data in large databases approximately every 9 months 3 4
Evolution of Database Technology • 1960s • Data collection, database creation, files “Every time the amount of data increases by a factor • 70’s -Data Access, of ten, we should totally rethink the way we analyze • Relational data model, (Codd 1970) ,relational DBMS implementation it” • 1980s: • SQL (1979 – produced the first system with SQL) • RDBMS as a standard, advanced data models (extended-relational, OO, deductive, etc.) and application-oriented DBMS (spatial, temporal, Jerome Friedman, Data Mining and Statistics: What’s the Connection (paper 1997) multimedia, etc.) • 1990s—2000s: • Data warehousing (1993 Codd white paper coined the OLAP term) • Data mining – Association Rules 1994 5 6 Why Data Mining? “The key in business is to know something that nobody else knows.” — Aristotle Onassis PHOTO: LUCINDA DOUGLAS-MENZIES PHOTO: HULTON-DEUTSCH COLL “To understand is to perceive patterns.” We are data rich, but information poor. — Sir Isaiah Berlin 7 8
What is Data Mining? What Is Data Mining? • Knowledge Discovery in Databases • Alternative names: • Is the non-trivial process of identifying • Data Mining: a misnomer? • implicit (by contrast to explicit) (knowledge mining from data?) • valid (patterns should be valid on new data) • novel (novelty can be measured by comparing to expected values) • Knowledge discovery (mining) in databases (KDD), • potentially useful (should lead to useful actions) • knowledge extraction, • data/pattern analysis, • understandable (to humans) • data archeology, • patterns in data • data dredging, • information harvesting, • Data Mining • business intelligence, etc. • Is a step in the KDD process 9 10 Data Mining and the Knowledge Knowledge Discovery Evaluation and Presentation Process KDD Process Data Mining Selection and Transformation Cleaning and DW Integration DB 12
Steps of a KDD Process • Data cleaning: missing values, noisy data, and inconsistent data More on the KDD Process • Data integration: merging data from multiple data stores • Data selection: select the data relevant to the analysis • 60 to 80% of the KDD effort is about preparing the data and the • Data transformation: aggregation (daily sales to weekly or monthly remaining 20% is about mining sales) or generalisation (street to city; age to young, middle age and senior) • Data mining: apply intelligent methods to extract patterns • Pattern evaluation: interesting patterns should contradict the user’s belief or confirm a hypothesis the user wished to validate • Knowledge presentation: visualisation and representation techniques to present the mined knowledge to the users 13 14 More on the KDD Process • A data mining project should always start with an analysis of the data with traditional query tools Data Mining Applications • 80% of the interesting information can be extracted using SQL • how many transactions per month include item number 15? • show me all the items purchased by Sandy Smith. • 20% of hidden information requires more advanced techniques • which items are frequently purchased together by my customers? • how should I classify my customers in order to decide whether future loan applicants will be given a loan or not? 15
Data Mining - Applications Data Mining - Applications • Market analysis and management • Fraud detection and management • Target marketing, customer relation management, market basket • Use historical data to build models of fraudulent behavior and use analysis, cross selling, market segmentation data mining to help identify similar instances • Find clusters of “model” customers who share the same • Examples characteristics: interest, income level, spending habits, etc. • auto insurance: detect a group of people who stage accidents to • Determine customer purchasing patterns over time collect on insurance • money laundering: detect suspicious money transactions (US • Risk analysis and management Treasury's Financial Crimes Enforcement Network) • Forecasting, customer retention, improved underwriting, quality control, • medical insurance: detect professional patients and ring of doctors competitive analysis, credit scoring and ring of references (ex. doc. prescribes expensive drug to a Medicare patient. Patient gets prescription filled, gets drug and sells drug unopened, which is sold back to pharmacy) 17 18 Fraud Detection and Management Fraud Detection and Management • Detecting inappropriate medical treatment • Detecting telephone fraud • Charging for unnecessary services, e.g. performing $400,000 worth • Telephone call model: destination of the call, duration, time of day of heart & lung tests on people suffering from no more than a or week. Analyze patterns that deviate from an expected norm. common cold. These tests are done either by the doctor himself or by associates who are part of the scheme. A more common variant • British Telecom identified discrete groups of callers with frequent involves administering more expensive blanket screening tests, intra-group calls, especially mobile phones, and broke a multimillion rather than tests for specific symptoms dollar fraud. • ex. an inmate in prison has a friend on the outside set up an account at a local abandoned house. Calls are forwarded to inmate’s girlfriend three states away. Free calling until phone company shuts down account 90 days later. 19 20
Other Applications Data Mining: On What Kind of Data? • Sports • DM should be applicable to any kind of info. repository. • IBM Advanced Scout analyzed NBA game statistics (shots blocked, assists, and fouls) to gain competitive advantage for New York Knicks • Relational databases and Miami Heat • Data warehouses • Space Science: • Transactional databases • SKICAT automated the analysis of over 3 Terabytes of image data for • Advanced DB and information repositories a sky survey with 94% accuracy Object-oriented and object-relational databases • • Internet Web Surf-Aid • Spatial databases Time-series data and temporal data • • IBM Surf-Aid applies data mining algorithms to Web access logs for • Text databases and multimedia databases market-related pages to discover customer preference and behavior pages, analyzing effectiveness of Web marketing, improving Web site Heterogeneous and legacy databases • organization, etc. WWW • • Scientific data (DNA) 21 22 Data Mining ─ On What Kind of Data Data Mining ─ On What Kind of Data Transactional database: consists of a file where each record represents a • transaction. Relational database: is a collection of tables, each of which is assigned a unique • name. Each table consists of a set of attributes (columns or fields) and usually stores a large set of tuples (records or rows). Each tuple in a relational table represents an object identified by a unique key and described by a set of attribute values. Flat Files: most common data source; can be text (or HTML) or binary, may • contain transactions, statistical data, measurements, etc. Data warehouse: is a repository of information collected from multiple sources, • stored under a unified schema, and which usually resides at a single site. Object-oriented databases: are based on the object-oriented programming • paradigm, where in general terms, each entity is considered as an object. • Multimedia databases: usually very high-dimensional 23 24
Recommend
More recommend