Summary of Last Chapter Principles of Knowledge Discovery in Data • What is the motivation for ad-hoc mining process? • What defines a data mining task? Fall 2004 Chapter 5: Data Summarization • Can we define an ad-hoc mining language? Dr. Osmar R. Zaïane Source: Dr. Jiawei Han University of Alberta Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 1 Principles of Knowledge Discovery in Data University of Alberta 2 Course Content Chapter 4 Objectives • Introduction to Data Mining • Data warehousing and OLAP Understand Characterization and • Data cleaning Discrimination of data. • Data mining operations • Data summarization • Association analysis See some examples of data summarization. • Classification and prediction • Clustering • Web Mining • Spatial and Multimedia Data Mining • Other topics if time permits Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 3 4 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta
Data Summarization Descriptive vs. Predictive Data Mining Outline • Descriptive mining: describe concepts or task-relevant data sets in concise, informative, discriminative forms. • What are summarization and generalization? • Predictive mining: Based on data and analysis, • What are the methods for descriptive data mining? construct models for the database, and predict the trend • What is the difference with OLAP? and properties of unknown data. Concept description: • Can we discriminate between data classes? • Characterization: provides a concise and succinct summarization of the given collection of data. • Comparison: provides descriptions comparing two or more collections of data. Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 5 Principles of Knowledge Discovery in Data University of Alberta 6 Need for Hierarchies in Descriptive Mining Creating Hierarchies • Schema hierarchy • Defined by database schema: – Ex: house_number < street < city < province < country – Some attributes naturally form a hierarchy: • define hierarchy as [ house_number, street, city, province, country ] • Instance-based (Set-Grouping Hierarchy): • Address (street, city, province, country, continent) – Ex: { freshman, ..., senior } ⊂ undergraduate . – Some hierarchies are formed with different attribute define hierarchy statusHier as • combinations: level2: {freshman, sophomore, junior, senior} < level1:undergraduate; • food ( category, brand, content _spec, package _size, price ). level2: {M.Sc, Ph.D} < level1:graduate; level1: {undergraduate, graduate} < level0: allStatus • Defined by set-grouping operations (by users/experts). • Rule-based: • { chemistry, math, physics } ⊂ science. – undergraduate(x) ∧ gpa(x) > 3.5 � good(x). • Generated automatically by data distribution analysis. • Operation-based: • Adjusted automatically based on the existing hierarchy . – aggregation, approximation, clustering, etc. Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 7 8 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta
Automatic Generation of Numeric Hierarchies Methods for Automatic Generation of Hierarchies 40 • Categorical hierarchies: (Cardinality heuristics) 35 Count – Observation: the higher hierarchy, the smaller cardinality. 30 • card(city) < card(state) < card (country). 25 – There are exceptions, e.g., {day, month, quarter, year}. 20 15 – Automatic generation of categorical hierarchies based on 10 cardinality heuristic: 5 • location: {country, street, city, region, big-region, province}. Amount 0 • Numerical hierarchies: 10000 30000 50000 70000 90000 – Many algorithms are applicable for generation of hierarchies 2000-97000 based on data distribution. 2000-25000 25000-97000 – Range-based vs. distribution-based (different binning methods) 2000-12000 12000-25000 25000-38000 38000-97000 Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 9 Principles of Knowledge Discovery in Data University of Alberta 10 Dynamic Adjustment of Concept Automatic Hierarchy Adjustment Hierarchies • Why adjusting hierarchies dynamically? Original concept Hierarchy CANADA – Different applications may view data differently. Maritime Western Central – Example: Geography in the eyes of politicians, researchers, 68 212 97 15 9 9 B.C. Prairies Ontario Quebec Nova Scotia New Brunswick New Foundland and merchants. 40 8 15 • How to adjust the hierarchy? Alberta Manitoba Saskatchewan – Maximally preserve the given hierarchy shape . Adjusted Concept Hierarchy CANADA – Node merge and split based on certain weighted measure (such as count, sum, etc.) (Maritime) Western Central 33 68 40 23 212 97 • E.g., small nodes (such as small provinces) should be Maritime B.C. Man+Sas Ontario Quebec Alberta merged and big nodes should be split. 8 15 15 9 9 Manitoba Saskatchewan Nova Scotia New Brunswick New Foundland Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 11 12 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta
Data Summarization Methods of Descriptive Data Mining Outline • Data cube-based approach: – Dimensions: Attributes form concept hierarchies – Measures: sum, count, avg, max, standard-deviation, etc. • What are summarization and generalization? – Drilling: generalization and specialization. • What are the methods for descriptive data mining? – Limitations: dimension/measure types, intelligent analysis. • What is the difference with OLAP? • Can we discriminate between data classes? • Attribute-oriented induction: – Proposed in 1989 (KDD’89 workshop). – Not confined to categorical data nor particular measures. – Can be presented in both table and rule forms. Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 13 Principles of Knowledge Discovery in Data University of Alberta 14 Basic Principles of Attribute-Oriented Basic Algorithm for Attribute-Oriented Induction Induction • Data focusing: task-relevant data, including dimensions, and the result is the initial relation . • InitialRel: Query processing of task-relevant data, deriving the • Attribute-removal: remove attribute A if there is a large set of initial relation . distinct values for A but (1) there is no generalization operator on • PreGen: Based on the analysis of the number of distinct values A , or (2) A ’s higher level concepts are expressed in terms of other in each attribute, determine generalization plan for each attribute: attributes. removal? or how high to generalize? • Attribute-generalization: If there is a large set of distinct values • PrimeGen: Based on the PreGen plan, perform generalization to for A , and there exists a set of generalization operators on A , then the right level to derive a “prime generalized relation”. select an operator and generalize A . • Presentation: User interaction: (1) adjust levels by drilling, (2) • Attribute-threshold control: typical 2-8, specified/default. pivoting, (3) mapping into rules, cross tabs, visualization • Generalized relation threshold control: control the final presentations. relation/rule size. Dr. Osmar R. Zaïane, 1999-2004 Dr. Osmar R. Zaïane, 1999-2004 15 16 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta
Recommend
More recommend