BIRCH : An Efficient Data Clustering Method For Very Large - PowerPoint PPT Presentation

BIRCH : An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Kendric Wang Discussion Leader: Sophia (Xueyao) Liang HelenJr, “Birches”. Online Image. Flickr. 07 Nov 2009. Sunday, November 8, 2009

Outline ‣ What is Data Clustering? ‣ Advantages of BIRCH Algorithm ‣ Clustering Feature (CF) and CF Tree ‣ BIRCH Clustering Algorithm ‣ Applications of BIRCH ‣ Conclusion CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

What is Data Clustering? ‣ Given a large set of multi-dimensional data points - data space is usually not uniformly occupied ‣ Can group closely related points into a “cluster” - points are similar according to some distance-based measurement f’n - choose desired number of clusters, K ‣ discover distribution patterns in the dataset ‣ help visualize data and guide data analysis CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

What is Data Clustering? ‣ Popular data mining problem in - Machine Learning -- probability-based - Statistics -- distance-based - Database -- limited memory ‣ Define problem as: - Partitioning dataset to minimize “size” of cluster - Data set size may be larger than memory - Minimize I/O costs CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

Advantages of BIRCH vs. Other Clustering Algorithms ‣ BIRCH is “local” - clusters a point without having to check against all other data points or clusters ‣ Can remove outliers (“noise”) ‣ Produce good clusters with a single scan of dataset ‣ Linearly scalable - minimizes running time - adjusts quality of result with regard to available memory CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

Clustering Feature (CF) ‣ Compact - no need to store individual pts belonging to a cluster ‣ Three parts: CF i = (N i , LS i , SS i ) , i = 1,2,...,M (no. of clusters) N → number of data pts in cluster LS → linear sum of N data pts SS → square sum of N data pts ‣ Sufficient to compute distance between two clusters ‣ When merging two clusters, add the CFs CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

CF Tree Branching Factor B = max no. of CF in non-leaf node Entry: [CF, child] L = max no. of CF in leaf node Threshold requirement: CF 1 CF 2 ... CF B Root T = max radius/diameter of CF child 1 child 2 child B (in leaf) Non-leaf CF 1 CF 2 ... CF B .............. node child 1 child 2 child B Leaf node prev CF 1 CF 2 ... CF L next prev CF 1 CF 2 ... CF L next CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

CF Tree ‣ Tree size is a function of T - larger T, more points in each cluster, smaller tree ‣ good choice reduces number of rebuilds - if T too low, can be increased dynamically - if T too high, less detailed CF tree ‣ heuristic approach used to estimate next threshold value ‣ CF tree built dynamically as data is scanned and inserted CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

CF Tree Insertion ‣ Identify the appropriate leaf: - Start with CF list at root node, find the closest cluster (by using CF values) - Look at all the children of the cluster, find the closest - And so on, until you reach a leaf node ‣ Modify the leaf: - Find closest leaf entry and test whether it can absorb new entry without violating threshold condition - If not, add new entry to leaf - Leaves have a max size; may need to be split ‣ Modifying the path: - Once the point has been added, must update the CF of all ancestors CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

BIRCH Clustering Algorithm ‣ Phase 1: Load data into memory by building a CF tree ‣ Phase 2 (optional): Condense into desirable range by building smaller CF trees ‣ Phase 3: Global Clustering ‣ Phase 4 (optional): Cluster Refining CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

BIRCH Phase 1 ‣ Start with initial threshold T and insert points into tree ‣ If we run out of memory, increase T and rebuild - Re-insert leaf entries from old tree into new tree - remove outliers ‣ Methods for initializing and adjusting T are adhoc ‣ After phase 1: - data “reduced” to fit in memory - subsequent processing occurs entirely in memory (no I/O) CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

BIRCH Phase 2 ‣ Optional ‣ No. of clusters produced in Phase 1 may be not be suitable for algorithms used in Phase 3 ‣ Shrink tree as necessary - remove more outliers - crowded subclusters are merged CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

BIRCH Phase 3 ‣ Problems after Phase 1: - input order affect results - splitting triggered ‣ Use leaf nodes of CF tree as input to a standard (“global”) clustering algorithm - KMeans, HC ‣ Phase 1 has reduced the size of the input dataset enough so that the standard algorithm can work entirely in memory CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

BIRCH Phase 4 ‣ Optional ‣ Scan through data again and assign each data point to a cluster - choose cluster whose centroid is closest ‣ This redistributes data points amongst clusters in more accurate fashion than original CF cluster ‣ Can be repeated for improved refinement of clusters CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

Apps of Data Clustering ‣ Helps identify natural groupings that exist within a dataset ‣ Image processing - separate similar properties in an image CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

Apps of Data Clustering ‣ Bioinformatics - identifying genes that are regulated by common mechanisms ‣ Market analysis - distinguish groups of consumers with similar tastes CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

Conclusion ‣ BIRCH performs better than other existing algorithms on large datasets - reduces I/O - accounts for memory constraint ‣ Produces good clustering from only one scan of entire dataset: O(n) ‣ Handles outliers CPSC 504 Data Management (Fall, 2009) Kendric Wang Sunday, November 8, 2009

BIRCH : An Efficient Data Clustering Method For Very Large - PowerPoint PPT Presentation

BIRCH : An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Kendric Wang Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image. Flickr. 07 Nov

BIRCH STREET PLAZA: ROSLINDALE VILLAGE Birch St Project Team Birch St Process (18 months)

Probabilistic Programming in Birch www.birch-lang.org Lawrence Murray Department of Information

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Profiles Profiles Dr Diana Birch Dr Diana Birch Youth Support Youth Support Introduction

Automated learning with a probabilistic programming language: Birch 1. The Birch probabilistic

Getting Crowds to Work Leah Birch Naor Brown October 24, 2012 Leah Birch Naor Brown Getting

k -means clustering Method to automatically separate data sets into distinct groups. Clustering

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Asum.ys ----Changes to memory and registers Ruiqin Tian 5.init: irmovl Stack, %esp # Set up stack

On the conditioning of subensembles Dustin G. Mixon Jubilee of Fourier Analysis and Applications

CS356 Unit 5 x86 Control Flow 5.2 JUMP/BRANCHING OVERVIEW 5.3 Concept of Jumps/Branches

1 2/14/2020 Reading Condition Codes x86-64 Integer Registers %rax %r8 %al %r8b SetX

FUSE-IT: Facility Using smart Secured Energy & Information Technology Adrien BECUE Cassidian

Recommender Systems From Content to Latent Factor Analysis Michael Hahsler Intelligent Data

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric

Fast Escape Analysis for Region-Based Memory Management Guillaume Salagnac 1 , Sergio Yovine 1 ,

BIRCH : An Efficient Data Clustering Method For Very Large - PowerPoint PPT Presentation

BIRCH : An Efficient Data Clustering Method For Very Large Databases Tian Zhang, Raghu Ramakrishnan, Miron Livny CPSC 504 Presenter: Kendric Wang Discussion Leader: Sophia (Xueyao) Liang HelenJr, Birches. Online Image. Flickr. 07 Nov

BIRCH STREET PLAZA: ROSLINDALE VILLAGE Birch St Project Team Birch St Process (18 months)

Probabilistic Programming in Birch www.birch-lang.org Lawrence Murray Department of Information

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

Profiles Profiles Dr Diana Birch Dr Diana Birch Youth Support Youth Support Introduction

Automated learning with a probabilistic programming language: Birch 1. The Birch probabilistic

Getting Crowds to Work Leah Birch Naor Brown October 24, 2012 Leah Birch Naor Brown Getting

k -means clustering Method to automatically separate data sets into distinct groups. Clustering

CHAPTER VIII VIII CHAPTER Data Clustering and Data Clustering and Self- -Organizing Feature

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Asum.ys ----Changes to memory and registers Ruiqin Tian 5.init: irmovl Stack, %esp # Set up stack

On the conditioning of subensembles Dustin G. Mixon Jubilee of Fourier Analysis and Applications

CS356 Unit 5 x86 Control Flow 5.2 JUMP/BRANCHING OVERVIEW 5.3 Concept of Jumps/Branches

1 2/14/2020 Reading Condition Codes x86-64 Integer Registers %rax %r8 %al %r8b SetX

FUSE-IT: Facility Using smart Secured Energy &amp; Information Technology Adrien BECUE Cassidian

Recommender Systems From Content to Latent Factor Analysis Michael Hahsler Intelligent Data

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric

Fast Escape Analysis for Region-Based Memory Management Guillaume Salagnac 1 , Sergio Yovine 1 ,

FUSE-IT: Facility Using smart Secured Energy & Information Technology Adrien BECUE Cassidian