Scalable Clustering of Categorical Data and Applications University of Trento Periklis Andritsos periklis@dit.unitn.it
Problem Definition o Clustering is a procedure that groups members of a population into similar categories, or clusters o Why is clustering important ? Get insight in the way data is distributed Preprocess an initial data set January 9, 2006 Periklis Andritsos 2
Exercise from the real world January 9, 2006 Periklis Andritsos 3
o In September 2004 The New York Times reported the launch of Clusty (http://www.clusty.com/) Given a query, the meta-search engine places relevant web documents into groups Clusty uses clustering technology on ten different types of web content including material from the web, image, news and shopping databases o Example: January 9, 2006 Periklis Andritsos 4
… of Digital Cameras o IDC ( http://www.idc.com/ ) employs clustering for customer segmentation purposes o Example: Digital camera companies investigated ways to increase sales during Christmas holidays o IDC surveyed 1,000 U.S. consumers at 50 malls “Likely buyers will be more motivated to buy a digital camera knowing that digital images can be displayed on a TV, printed using a PC-less photo quality printer, or printed at traditional film developer outlets “ “Likely buyers capture more images per month on their film cameras than unlikely buyers” o Companies were able to target the proper market segment Source: http://www.imaging-resource.com/NEWS/1037573998.html January 9, 2006 Periklis Andritsos 5
Understanding software systems main Event Main boxParserClass boxScanner Generate boxParserClass Event Class Prolog boxClass error stackOfAnyWithTop boxScannerClass Globals boxClass error Lexer stackOfAnyWithTop Fonts Lexer Globals Mathlib EdgeClass ColorTable stackOfAny hashedBoxes stackOfAny hashedBoxes edgaClass MathLib colorTable Fonts hashGlobals GenerateProlog NodeOfAny NodeOfAny hashGlobals [MMBRCG ’ 98] January 9, 2006 Periklis Andritsos 6
What if we have …. Some information cannot be depicted [Andritsos, Miller: IEEE Int ’ l Workshop on Program Comprehension, 2001] January 9, 2006 Periklis Andritsos 7
Integrated Information Information O-O cust emp dept System Database dno dna select all Integrated XML Relational Customer Information Order Product <title>... Repository Database Scheduled Delivery <author>... Salesperson <year>... o We deal with data that: are stored in heterogeneous sources exist under different formats are available online (with schemas) Schema : A type specification of a collection of data o We often need to integrate data, which introduces errors January 9, 2006 Periklis Andritsos 8
Cluster Analysis Stages Initial Represen- Data Collection Screening tation Focus of my work Interpre- Clustering Validation tation Strategy o Intention was not to build yet another clustering algorithm, but one that adheres to real-world constraints January 9, 2006 Periklis Andritsos 9
Requirements o Perform good quality clustering on different data types The majority of existing commercial algorithms perform clustering of objects expressed over numerical values o Scalability The optimal solution to clustering is hard to find, and existing heuristic techniques do not necessarily perform well with large inputs. o Parameter setting Many algorithms expect the user to give a set of (sometimes) unintuitive parameters o Inclusion of descriptive information in software clustering Software clustering techniques use structural information exclusively January 9, 2006 Periklis Andritsos 10
Calculating Distance Numerical Data Categorical data o o L p metrics defined no single ordering of values Euclidean, Manhattan movie director actor genre employee salary age Godfather II De Niro Scorcese Crime John $5,000 25 De Niro Good Fellas Crime Coppola Mary $6,000 26 Vertigo J. Stewart Thriller Hitchcock Peter $2,500 30 C. Grant N by NW Thriller Hitchcock Jenny $60,000 32 Bishop’s Wife Koster C. Grant Comedy Comedy Harvey J. Stewart Koster January 9, 2006 Periklis Andritsos 11
Agglomerative Clustering Agglomerative, or Hierarchical clustering in Agglomerative, or Hierarchical clustering in Euclidean space on 6 points. Euclidean space on 6 points. F E A B C D A B C D E F Need to compute distance between objects as well as between objects and sub-clusters January 9, 2006 Periklis Andritsos 12
Contributions Developed LIMBO, an algorithm that o is hierarchical clusters categorical data using a small number of parameters is scalable as the size of the input increases International Conference on Extending Data Base Technology, (EDBT’04) Studied software systems using both structural and non-structural o information The algorithm incorporates information such as the Developer, Lines Of Code or Directory structure International Working Conference on Reverse Engineering, (WCRE’03) IEEE Transactions on Software Engineering, (TSE’05) Proposed a set if Information-Theoretic tools to discover structure in large o data sets ACM International Conference on the Management of Data, (SIGMOD’04) International Workshop on Information Integration on the Web, (WEB’02) IEEE Data Engineering Bulletin 2002, 2003 January 9, 2006 Periklis Andritsos 13
Roadmap Introduction Contributions Motivating example LIMBO Algorithm Studying Software Systems Identifying Structure Conclusions & Future Work January 9, 2006 Periklis Andritsos 14
Clustering Categorical Data o Cluster rows (objects) in order to preserve as much information as possible about the attribute values movie director actor genre Preserves Information Godfather II Scorcese De Niro Crime for actor , genre De Niro Good Fellas Crime Coppola Two choices for Vertigo J. Stewart Thriller director Hitchcock N by NW C. Grant Hitchcock Thriller Two choices for director , actor , and Bishop’s Wife Koster C. Grant Comedy genre Comedy Harvey J. Stewart Koster January 9, 2006 Periklis Andritsos 15
Clustering Categorical Data o Cluster rows (objects) in order to preserve as much information as possible about the attribute values movie director actor genre Godfather II Scorcese De Niro Crime Three choices for De Niro director , actor , Good Fellas Crime Coppola and two for genre Vertigo J. Stewart Thriller Hitchcock Preserves N by NW C. Grant Hitchcock Thriller Information for Bishop’s Wife Koster C. Grant Comedy director , genre Two choices for Comedy Harvey J. Stewart Koster actor January 9, 2006 Periklis Andritsos 16
Roadmap Introduction Contributions Motivating example LIMBO Algorithm Studying Software Systems Identifying Structure Conclusions & Future Work January 9, 2006 Periklis Andritsos 17
Information Theory Basics ( ) H X p ( x ) log p ( x ) o Entropy : ∑ = − Measures the Uncertainty in a random variable ( ) H X | Y o Conditional Entropy : Measures the Uncertainty of one variable knowing the values of another. ( ) I X ; Y H ( X ) H ( X | Y ) o Mutual Information : = − Measures the Dependence of two random variables January 9, 2006 Periklis Andritsos 18
Information Theoretic Clustering o T : a random variable that ranges over the rows o V : a random variable that ranges over the attribute values o I(T;V) : mutual information of T and V o Information Bottleneck Method [TPB ’ 99] Compress T into a clustering C k so that the information preserved about V is maximum ( k =number of clusters). o Optimization criterion: Minimize{ I(T;V) - I(C k ;V)} i.e. , minimization of Information Loss January 9, 2006 Periklis Andritsos 19
Computing Information Loss Representation: Every cluster, c i , is represented by o Its probability p(c i )=n(c i )/n Conditional probability of the values in V given the cluster, p(V|c i ) This information is sufficient to compute the Information Loss o January 9, 2006 Periklis Andritsos 20
Agglomerative IB [ST99] o Computes an ( n x n) Distance Matrix using Information Loss as distance o Merge sub-clusters with the minimum Information Loss movie director actor genre Godfather II De Niro Scorcese Crime De Niro Good Fellas Coppola Crime Vertigo J. Stewart Thriller Hitchcock N by NW C. Grant Hitchcock Thriller Bishop’s Wife Koster C. Grant Comedy Comedy Harvey J. Stewart Koster January 9, 2006 Periklis Andritsos 21
sca L able Infor M ation BO ttleneck o Agglomerative approach ( AIB ) has quadratic complexity since we need to compute an (n x n) distance matrix. o LIMBO algorithm produce a summary of the data apply agglomerative clustering on the summary o Summary=Distributional Cluster Features DCF(c) = (n(c) , p(V|c) ) o DCFs can be computed incrementally n ( c ) n ( c ) DCF ( c *) n ( c ) n ( c ), 1 p ( V | c ) 2 p ( V | c ) = + + 1 2 1 2 n ( c ) n ( c ) n ( c ) n ( c ) + + 1 2 1 2 January 9, 2006 Periklis Andritsos 22
Recommend
More recommend