Summary of Last Chapter Principles of Knowledge Discovery in Data - PowerPoint PPT Presentation

Summary of Last Chapter Principles of Knowledge Discovery in Data • What is the motivation for ad-hoc mining process? • What defines a data mining task? Fall 2004 Chapter 5: Data Summarization • Can we define an ad-hoc mining language? Dr. Osmar R. Zaïane Source: Dr. Jiawei Han University of Alberta  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 1 Principles of Knowledge Discovery in Data University of Alberta 2 Course Content Chapter 4 Objectives • Introduction to Data Mining • Data warehousing and OLAP Understand Characterization and • Data cleaning Discrimination of data. • Data mining operations • Data summarization • Association analysis See some examples of data summarization. • Classification and prediction • Clustering • Web Mining • Spatial and Multimedia Data Mining • Other topics if time permits  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 3 4 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

Data Summarization Descriptive vs. Predictive Data Mining Outline • Descriptive mining: describe concepts or task-relevant data sets in concise, informative, discriminative forms. • What are summarization and generalization? • Predictive mining: Based on data and analysis, • What are the methods for descriptive data mining? construct models for the database, and predict the trend • What is the difference with OLAP? and properties of unknown data. Concept description: • Can we discriminate between data classes? • Characterization: provides a concise and succinct summarization of the given collection of data. • Comparison: provides descriptions comparing two or more collections of data.  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 5 Principles of Knowledge Discovery in Data University of Alberta 6 Need for Hierarchies in Descriptive Mining Creating Hierarchies • Schema hierarchy • Defined by database schema: – Ex: house_number < street < city < province < country – Some attributes naturally form a hierarchy: • define hierarchy as [ house_number, street, city, province, country ] • Instance-based (Set-Grouping Hierarchy): • Address (street, city, province, country, continent) – Ex: { freshman, ..., senior } ⊂ undergraduate . – Some hierarchies are formed with different attribute define hierarchy statusHier as • combinations: level2: {freshman, sophomore, junior, senior} < level1:undergraduate; • food ( category, brand, content _spec, package _size, price ). level2: {M.Sc, Ph.D} < level1:graduate; level1: {undergraduate, graduate} < level0: allStatus • Defined by set-grouping operations (by users/experts). • Rule-based: • { chemistry, math, physics } ⊂ science. – undergraduate(x) ∧ gpa(x) > 3.5 � good(x). • Generated automatically by data distribution analysis. • Operation-based: • Adjusted automatically based on the existing hierarchy . – aggregation, approximation, clustering, etc.  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 7 8 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

Automatic Generation of Numeric Hierarchies Methods for Automatic Generation of Hierarchies 40 • Categorical hierarchies: (Cardinality heuristics) 35 Count – Observation: the higher hierarchy, the smaller cardinality. 30 • card(city) < card(state) < card (country). 25 – There are exceptions, e.g., {day, month, quarter, year}. 20 15 – Automatic generation of categorical hierarchies based on 10 cardinality heuristic: 5 • location: {country, street, city, region, big-region, province}. Amount 0 • Numerical hierarchies: 10000 30000 50000 70000 90000 – Many algorithms are applicable for generation of hierarchies 2000-97000 based on data distribution. 2000-25000 25000-97000 – Range-based vs. distribution-based (different binning methods) 2000-12000 12000-25000 25000-38000 38000-97000  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 9 Principles of Knowledge Discovery in Data University of Alberta 10 Dynamic Adjustment of Concept Automatic Hierarchy Adjustment Hierarchies • Why adjusting hierarchies dynamically? Original concept Hierarchy CANADA – Different applications may view data differently. Maritime Western Central – Example: Geography in the eyes of politicians, researchers, 68 212 97 15 9 9 B.C. Prairies Ontario Quebec Nova Scotia New Brunswick New Foundland and merchants. 40 8 15 • How to adjust the hierarchy? Alberta Manitoba Saskatchewan – Maximally preserve the given hierarchy shape . Adjusted Concept Hierarchy CANADA – Node merge and split based on certain weighted measure (such as count, sum, etc.) (Maritime) Western Central 33 68 40 23 212 97 • E.g., small nodes (such as small provinces) should be Maritime B.C. Man+Sas Ontario Quebec Alberta merged and big nodes should be split. 8 15 15 9 9 Manitoba Saskatchewan Nova Scotia New Brunswick New Foundland  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 11 12 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

Data Summarization Methods of Descriptive Data Mining Outline • Data cube-based approach: – Dimensions: Attributes form concept hierarchies – Measures: sum, count, avg, max, standard-deviation, etc. • What are summarization and generalization? – Drilling: generalization and specialization. • What are the methods for descriptive data mining? – Limitations: dimension/measure types, intelligent analysis. • What is the difference with OLAP? • Can we discriminate between data classes? • Attribute-oriented induction: – Proposed in 1989 (KDD’89 workshop). – Not confined to categorical data nor particular measures. – Can be presented in both table and rule forms.  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 Principles of Knowledge Discovery in Data University of Alberta 13 Principles of Knowledge Discovery in Data University of Alberta 14 Basic Principles of Attribute-Oriented Basic Algorithm for Attribute-Oriented Induction Induction • Data focusing: task-relevant data, including dimensions, and the result is the initial relation . • InitialRel: Query processing of task-relevant data, deriving the • Attribute-removal: remove attribute A if there is a large set of initial relation . distinct values for A but (1) there is no generalization operator on • PreGen: Based on the analysis of the number of distinct values A , or (2) A ’s higher level concepts are expressed in terms of other in each attribute, determine generalization plan for each attribute: attributes. removal? or how high to generalize? • Attribute-generalization: If there is a large set of distinct values • PrimeGen: Based on the PreGen plan, perform generalization to for A , and there exists a set of generalization operators on A , then the right level to derive a “prime generalized relation”. select an operator and generalize A . • Presentation: User interaction: (1) adjust levels by drilling, (2) • Attribute-threshold control: typical 2-8, specified/default. pivoting, (3) mapping into rules, cross tabs, visualization • Generalized relation threshold control: control the final presentations. relation/rule size.  Dr. Osmar R. Zaïane, 1999-2004  Dr. Osmar R. Zaïane, 1999-2004 15 16 Principles of Knowledge Discovery in Data University of Alberta Principles of Knowledge Discovery in Data University of Alberta

Summary of Last Chapter Principles of Knowledge Discovery in Data - PowerPoint PPT Presentation

Summary of Last Chapter Principles of Knowledge Discovery in Data What is the motivation for ad-hoc mining process? What defines a data mining task? Fall 2004 Chapter 5: Data Summarization Can we define an ad-hoc mining language?

Summary of Last Chapter Principles of Knowledge Discovery in Databases What kind of

UNESCO Discovery Centre reference image of education space UNESCO Discovery Centre Discovery

Summary of Last Chapter Principles of Knowledge Discovery in Data What is a data warehouse

Knowledge-Based Agents knowledge knowledge representation, knowledge base, types of knowledge

Principles of Knowledge Discovery in Data Fall 2002 Dr. Osmar R. Zaane University of Alberta

Knowledge discovery in large Knowledge discovery in large biological data sets using hybrid

Where are we? Knowledge Engineering In the last few lectures . . . Semester 2, 2004-05

Where are we? Knowledge Engineering Last time . . . Semester 2, 2004-05 we attempted a

Proposals for Proposals for principles of knowledge principles of knowledge engineering

Plan for today Knowledge-based systems 1 Explicit knowledge Knowledge Representation Inferred

Plan for today Knowledge-based systems 1 Tacit knowledge Knowledge Representation Inferred

26:198:722 Expert Systems I Knowledge representation I Knowledge acquisition I Machine learning I

From Search to Discovery in our Future Library From Search to Discovery W e see a spectrum of

Watson Discovery Spring 2020 Discovery pipeline Using NLU, document conversion, and UI tools

Tunnel End-point Discovery Tunnel End-point Discovery draft-palet-v6ops-tun-auto-disc-03.txt

VPN Discovery VPN Discovery Design Team Discussions and Options Design Team Discussions and

Towards a linear algebra semantics for columnar data storage Institute of Cybernetics Tallinn

Income Mobility in the Developing World: Navigating and Interpreting the Empirical Evidence

BUSINESS ANALYTICS CHAPTER 29 LECTURE OUTLINE Data warehouses Comparison with

Immersive Analytics CMPM 290A, F2018 Prof. Angus Forbes (instructor) angus@ucsc.edu

Seismic landslide hazard zonation By: M.T.J. Terlien Department of Earth Resources Surveys,

Cross-Validation Machine Learning 1 Model selection Very broadly: Choosing the best model using

Machine Learning July 20, 2016 Basic Concepts: Review Example machine learning problem: Decide

Bayesian leave-one-out cross-validation for large data Mns Magnusson (Aalto University) Michael