Hierarchy An arrangement or classification of things according to - PowerPoint PPT Presentation

Hierarchy • An arrangement or classification of things according to inclusiveness • A natural way of abstraction, summarization, compression, and simplification for understanding • Typical setting: organize a given set of objects to a hierarchy – No or very little supervision – Some heuristic quality guidances on the quality of the hierarchy Jian Pei: CMPT 459/741 Clustering (2) 1

Hierarchical Clustering • Group data objects into a tree of clusters • Top-down versus bottom-up Step 3 Step 4 Step 1 Step 2 Step 0 agglomerative (AGNES) a a b b a b c d e c c d e d d e e divisive Step 1 Step 0 (DIANA) Step 3 Step 2 Step 4 Jian Pei: CMPT 459/741 Clustering (2) 2

AGNES (Agglomerative Nesting) • Initially, each object is a cluster • Step-by-step cluster merging, until all objects form a cluster – Single-link approach – Each cluster is represented by all of the objects in the cluster – The similarity between two clusters is measured by the similarity of the closest pair of data points belonging to different clusters Jian Pei: CMPT 459/741 Clustering (2) 3

Dendrogram • Show how to merge clusters hierarchically • Decompose data objects into a multi- level nested partitioning (a tree of clusters) • A clustering of the data objects: cutting the dendrogram at the desired level – Each connected component forms a cluster Jian Pei: CMPT 459/741 Clustering (2) 4

DIANA (Divisive ANAlysis) • Initially, all objects are in one cluster • Step-by-step splitting clusters until each cluster contains only one object 10 10 10 9 9 9 8 8 8 7 7 7 6 6 6 5 5 5 4 4 4 3 3 3 2 2 2 1 1 1 0 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 459/741 Clustering (2) 5

Distance Measures d ( C , C ) min d ( p , q ) • Minimum distance = min i j p C , q C ∈ ∈ i j • Maximum distance d ( C , C ) max d ( p , q ) = max i j p C , q C ∈ ∈ i j • Mean distance d ( C , C ) d ( m , m ) = mean i j i j • Average distance 1 d ( C , C ) d ( p , q ) ∑ ∑ = avg i j n n p C q C i j ∈ ∈ i j m: mean for a cluster C: a cluster n: the number of objects in a cluster Jian Pei: CMPT 459/741 Clustering (2) 6

Challenges • Hard to choose merge/split points – Never undo merging/splitting – Merging/splitting decisions are critical • High complexity O(n 2 ) • Integrating hierarchical clustering with other techniques – BIRCH, CURE, CHAMELEON, ROCK Jian Pei: CMPT 459/741 Clustering (2) 7

BIRCH • Balanced Iterative Reducing and Clustering using Hierarchies • CF (Clustering Feature) tree: a hierarchical data structure summarizing object info – Clustering objects à clustering leaf nodes of the CF tree Jian Pei: CMPT 459/741 Clustering (2) 8

Clustering Feature Vector Clustering Feature: CF = (N, LS, SS) N : Number of data points LS: ∑ N i=1 =o i CF = (5, (16,30),(54,190)) SS: ∑ N i=1 =o i 2 (3,4) 10 9 (2,6) 8 7 6 (4,5) 5 4 3 (4,7) 2 1 (3,8) 0 0 1 2 3 4 5 6 7 8 9 10 Jian Pei: CMPT 459/741 Clustering (2) 9

CF-tree in BIRCH • Clustering feature: – Summarize the statistics for a cluster – Many cluster quality measures (e.g., radium, distance) can be derived – Additivity: CF 1 +CF 2 =(N 1 +N 2 , L 1 +L 2 , SS 1 +SS 2 ) • A CF tree: a height-balanced tree storing the clustering features for a hierarchical clustering – A nonleaf node in a tree has descendants or “children” – The nonleaf nodes store sums of the CFs of children Jian Pei: CMPT 459/741 Clustering (2) 10

CF Tree B = 7 CF 1 CF 2 CF 3 CF 6 Root L = 6 child 1 child 2 child 3 child 6 Non-leaf node CF 1 CF 2 CF 3 CF 5 child 1 child 2 child 3 child 5 Leaf node Leaf node prev CF 1 CF 2 CF 6 next prev CF 1 CF 2 CF 4 next Jian Pei: CMPT 459/741 Clustering (2) 11

Parameters of a CF-tree • Branching factor: the maximum number of children • Threshold: max diameter of sub-clusters stored at the leaf nodes Jian Pei: CMPT 459/741 Clustering (2) 12

BIRCH Clustering • Phase 1: scan DB to build an initial in- memory CF tree (a multi-level compression of the data that tries to preserve the inherent clustering structure of the data) • Phase 2: use an arbitrary clustering algorithm to cluster the leaf nodes of the CF-tree Jian Pei: CMPT 459/741 Clustering (2) 13

Pros & Cons of BIRCH • Linear scalability – Good clustering with a single scan – Quality can be further improved by a few additional scans • Can handle only numeric data • Sensitive to the order of the data records Jian Pei: CMPT 459/741 Clustering (2) 14

Drawbacks of Square Error Based Methods • One representative per cluster – Good only for convex shaped having similar size and density • K: the parameter of number of clusters – Good only if k can be reasonably estimated Jian Pei: CMPT 459/741 Clustering (2) 15

CURE: the Ideas • Each cluster has c representatives – Choose c well scattered points in the cluster – Shrink them towards the mean of the cluster by a fraction of α – The representatives capture the physical shape and geometry of the cluster • Merge the closest two clusters – Distance of two clusters: the distance between the two closest representatives Jian Pei: CMPT 459/741 Clustering (2) 16

Cure: The Algorithm • Draw random sample S • Partition sample to p partitions • Partially cluster each partition • Eliminate outliers – Random sampling + remove clusters growing too slowly • Cluster partial clusters until only k clusters left – Shrink representatives of clusters towards the cluster center Jian Pei: CMPT 459/741 Clustering (2) 17

Data Partitioning and Clustering y y y x x x y y x x Jian Pei: CMPT 459/741 Clustering (2) 18

Shrinking Representative Points • Shrink the multiple representative points towards the gravity center by a fraction of α • Representatives capture the shape y y è x x Jian Pei: CMPT 459/741 Clustering (2) 19

Clustering Categorical Data: ROCK • Robust Clustering using links – # of common neighbors between two points – Use links to measure similarity/proximity – Not distance based O n ( 2 nm m n 2 log ) n – + + m a • Basic ideas: T T ∩ – Similarity function and neighbors: Sim T T ( , ) 1 2 = 1 2 T T ∪ 1 2 • Let T1 = {1,2,3}, T2={3,4,5} { } 3 1 Sim T ( 1 , T 2 ) 0 2 . = = = { , , , , } 1 2 3 4 5 5 Jian Pei: CMPT 459/741 Clustering (2) 20

Limitations • Merging decision based on static modeling – No special characteristics of clusters are considered C1 C2 C1 ’ C2 ’ CURE and BIRCH merge C1 and C2 C1 ’ and C2 ’ are more appropriate for merging Jian Pei: CMPT 459/741 Clustering (2) 21

Chameleon • Hierarchical clustering using dynamic modeling • Measures the similarity based on a dynamic model – Interconnectivity & closeness (proximity) between two clusters vs interconnectivity of the clusters and closeness of items within the clusters • A two-phase algorithm – Use a graph partitioning algorithm: cluster objects into a large number of relatively small sub-clusters – Find the genuine clusters by repeatedly combining sub- clusters Jian Pei: CMPT 459/741 Clustering (2) 22

Overall Framework of CHAMELEON Construct Sparse Graph Partition the Graph Data Set Merge Partition Final Clusters Jian Pei: CMPT 459/741 Clustering (2) 23

To-Do List • Read Chapter 10.3 • (for thesis-based graduate students only) read the paper “BIRCH: an efficient data clustering method for very large databases” Jian Pei: CMPT 459/741 Clustering (2) 24

Hierarchy An arrangement or classification of things according to - PowerPoint PPT Presentation

Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize a given set of objects to a

Hierarchy of School Marketing Needs Leadership Day - February 16, 2018 Maslows Hierarchy of

Extensions of the Caucal Hierarchy? Pawe Parys University of Warsaw LATA 2019 Caucal

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

Hierarchy of Provider Edge Devices in Hierarchy of Provider Edge Devices in BGP/MPLS VPN

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Hierarchy Builder C.Cohen, K.Sakaguchi and E.Tassi Disclaimer: this talk is an advertisement

Polynomial Hierarchy A polynomial-bounded version of Kleenes Arithmetic Hierarchy becomes

XIV. Arithmetic Hierarchy Yuxi Fu BASICS, Shanghai Jiao Tong University We introduce a hierarchy

Aural Pattern Recognition Experiments and the Subregular Hierarchy James Rogers and Geoffrey K.

Class Hierarchy II Discussion E Hierarchy A mail order business sells catalog merchandise all

ICAO Operations Panel RNAV / Ground-based Charting Symbol Hierarchy OVERVIEW NEED FOR

Intro to Cell Mechanics 1.28.16 Biological Hierarchy- Whole Body Tennis Anatomy (2011)

CS 557 Landmark Routing The Landmark Hierarchy: A New Hierarchy For Routing in Very Large

New Solutions to the Hierarchy Problem What to Expect at the TeV Scale Gustavo Burdman Instituto

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

Number Theory and Representation Theory A conference in honor of the 60th birthday of Benedict

Support Vector Machines Alex Leblang and Sam Birch ML Framework Data projected into feature

From the Birch and Swinnerton Dyer Conjecture to the GL 2 Main Conjecture for elliptic curves by

Probabilistic Programming in Birch www.birch-lang.org Lawrence Murray Department of Information

CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha Rajeev Rastogi Kyuseok

Chapter 4 Extra Examples: Solving Systems + Determining Rank Chapter 4 Example 1

The Probabilistic Method Week 12: P vs NP Joshua Brody CS49/Math59 Fall 2015 Reading Quiz

The Future of Spreadsheets in in the Big ig Data Era David Birch 1* , David Lyford-Smith 2 &

Hierarchy An arrangement or classification of things according to - PowerPoint PPT Presentation

Hierarchy An arrangement or classification of things according to inclusiveness A natural way of abstraction, summarization, compression, and simplification for understanding Typical setting: organize a given set of objects to a

Hierarchy of School Marketing Needs Leadership Day - February 16, 2018 Maslows Hierarchy of

Extensions of the Caucal Hierarchy? Pawe Parys University of Warsaw LATA 2019 Caucal

Memory Hierarchy Motivation, Definitions, Four Questions about Memory Hierarchy Soner Onder

Hierarchy of Provider Edge Devices in Hierarchy of Provider Edge Devices in BGP/MPLS VPN

What Is Memory Hierarchy A typical memory hierarchy today: Lecture 13: Cache Basics and Cache

Hierarchy Builder C.Cohen, K.Sakaguchi and E.Tassi Disclaimer: this talk is an advertisement

Polynomial Hierarchy A polynomial-bounded version of Kleenes Arithmetic Hierarchy becomes

XIV. Arithmetic Hierarchy Yuxi Fu BASICS, Shanghai Jiao Tong University We introduce a hierarchy

Aural Pattern Recognition Experiments and the Subregular Hierarchy James Rogers and Geoffrey K.

Class Hierarchy II Discussion E Hierarchy A mail order business sells catalog merchandise all

ICAO Operations Panel RNAV / Ground-based Charting Symbol Hierarchy OVERVIEW NEED FOR

Intro to Cell Mechanics 1.28.16 Biological Hierarchy- Whole Body Tennis Anatomy (2011)

CS 557 Landmark Routing The Landmark Hierarchy: A New Hierarchy For Routing in Very Large

New Solutions to the Hierarchy Problem What to Expect at the TeV Scale Gustavo Burdman Instituto

Memory Hierarchy: Cache Memory hierarchy Cache basics Locality Cache organization Cache-aware

Abstractions for Practical Systems Caching and the memory hierarchy Operating systems and the

Number Theory and Representation Theory A conference in honor of the 60th birthday of Benedict

Support Vector Machines Alex Leblang and Sam Birch ML Framework Data projected into feature

From the Birch and Swinnerton Dyer Conjecture to the GL 2 Main Conjecture for elliptic curves by

Probabilistic Programming in Birch www.birch-lang.org Lawrence Murray Department of Information

CURE: An Efficient Clustering Algorithm for Large Databases Sudipto Guha Rajeev Rastogi Kyuseok

Chapter 4 Extra Examples: Solving Systems + Determining Rank Chapter 4 Example 1

The Probabilistic Method Week 12: P vs NP Joshua Brody CS49/Math59 Fall 2015 Reading Quiz

The Future of Spreadsheets in in the Big ig Data Era David Birch 1* , David Lyford-Smith 2 &amp;

The Future of Spreadsheets in in the Big ig Data Era David Birch 1* , David Lyford-Smith 2 &