symbolic data analysis
play

Symbolic data analysis Symbolic data analysis Clustering of large - PowerPoint PPT Presentation

Symbolic data analysis V. Batagelj Symbolic data analysis Symbolic data analysis Clustering of large data sets of mixed units Clustering and optimization Leaders method Vladimir Batagelj Agglomerative method Examples IMFM Ljubljana


  1. Symbolic data analysis V. Batagelj Symbolic data analysis Symbolic data analysis Clustering of large data sets of mixed units Clustering and optimization Leaders method Vladimir Batagelj Agglomerative method Examples IMFM Ljubljana and IAM UP Koper References CRoNoS meeting 16 - 17 April 2016, Colegio de Espa˜ na, Paris V. Batagelj Symbolic data analysis

  2. Outline Symbolic data analysis V. Batagelj Symbolic data analysis 1 Symbolic data analysis Clustering and optimization 2 Clustering and optimization Leaders 3 Leaders method method 4 Agglomerative method Agglomerative method 5 Examples Examples 6 References References V. Batagelj Symbolic data analysis

  3. Symbolic data Symbolic data Data table analysis V. Batagelj · · · variable j · · · Symbolic data · · · · · · · · · · · · analysis · · · · · · unit i value i , j Clustering and optimization · · · · · · · · · · · · Leaders method Agglomerative In classical data analysis value i , j is a single element (number or method label) measured in standard measurement scales (absolute, Examples ratio, interval, ordinal, nominal). References In symbolic data table value i , j can be also a complex data such as: interval, set of values, distribution (in general sense), time series, tree, text, function, etc. Rules linking variables and taxonomies of values can be specified. V. Batagelj Symbolic data analysis

  4. Symbolic data analysis Symbolic data Symbolic data analysis aims to extend existing data analysis methods analysis to symbolic data and to develop new ones. V. Batagelj It was introduced in 1987 by Edwin Diday [Diday, E. (1987)]. Symbolic data analysis Three European projects: Clustering and • SODAS - Symbolic Official Data Analysis System (1996-99), optimization Leaders • ISO-3D - Interpretation of Symbolic Objects with 3D method Representation (1998-01), Agglomerative method • ASSO – Analysis System of Symbolic Official Data (2001-03). Examples References resulted in a program for symbolic data analysis SODAS 2. The results were published in many papers in conference proceedings and scientific journals, and three books [Bock, H-H.,Diday, E. (2000), Billard, L., Diday, E. (2006), Diday, E., Noirhomme, M. (2008)]. Additional two books are to appear soon. V. Batagelj Symbolic data analysis

  5. Symbolic data analysis Symbolic data analysis V. Batagelj The SDA group regularly meets at workshops: Wienerwaldhof Symbolic data (2009), Namur (2011), Beijing (2011), Madrid (2012), Taipei (2014), analysis Orl´ eans (2015). Clustering and optimization Three packages for SDA are available in R: Leaders method • RSDA, Agglomerative method • symbolicDA, Examples • Clamix References Symbolic data analysis group at LinkedIn V. Batagelj Symbolic data analysis

  6. Symbolic data analysis and big data Symbolic data analysis aggregation big data symbolic data table − − − − − − − − − − → V. Batagelj Symbolic data analysis Aggregating data into symbolic data preserves much more Clustering and information than the standard approach using mean values. optimization Let Σ( S , V ) denote a summary – a symbolic value of variable V over Leaders method the subset of units S . Agglomerative method A good summary satisfies the condition: for S 1 ∩ S 2 = ∅ it holds Examples Σ( S 1 ∪ S 2 , V ) = f (Σ( S 1 , V ) , Σ( S 2 , V )) References With my collaborators (Simona Korenjak-ˇ Cerne and Nataˇ sa Kejˇ zar) we are developing the clustering algorithms for symbolic objects described by modal valued symbolic data [Korenjak-ˇ Cerne, S., Batagelj, V. (1998), Korenjak-ˇ Cerne, S., Batagelj, V.. (2002), Batagelj, V. et al. (2014)]. V. Batagelj Symbolic data analysis

  7. Clustering symbolic data Symbolic data analysis For clustering of SOs we adapted two classical clustering methods: V. Batagelj • leaders method (a generalization of k-means method Symbolic data [Hartigan, J. A. (1975)], dynamic clouds [Diday, E. (1979)]). analysis Clustering and • Ward’s hierarchical clustering method [Ward, J. H. (1963)]. optimization Leaders Both adapted methods are based on the same criterion function – method they are solving the same clustering problem. Agglomerative method With the leaders method the size of the sets of units is reduced to a Examples manageable number of leaders. References The obtained leaders can be further clustered with the compatible agglomerative hierarchical clustering method to reveal relations among them and using the dendrogram also to decide upon the right number of clusters. V. Batagelj Symbolic data analysis

  8. Symbolic objects described with distributions An SO X is described by a list X = [ x i ] of descriptions of Symbolic data analysis variables V i . The values NA (not available) are treated as an V. Batagelj additional category for each variable. In our model, each Symbolic data variable is described with frequency distribution ( bar chart ) of analysis its values Clustering and optimization Leaders f xi = [ f xi 1 , f xi 2 , . . . , f xik i ] . method Agglomerative With method Examples x i = [ p xi 1 , p xi 2 , . . . , p xik i ] References we denote the corresponding probability distribution. k i � p xij = 1 , i = 1 , . . . , m j =1 V. Batagelj Symbolic data analysis

  9. Clustering and optimization Symbolic data analysis V. Batagelj We approach the clustering problem as an optimization problem Symbolic data analysis over the set of feasible clusterings Φ k – partitions of units into Clustering and k clusters. The criterion function has the following form optimization Leaders method � P ( C ) = p ( C ) . (1) Agglomerative method C ∈ C Examples The total error P ( C ) of the clustering C is a sum of cluster References errors p ( C ). V. Batagelj Symbolic data analysis

  10. The cluster error Symbolic data analysis V. Batagelj There are many possibilities how to express the cluster error Symbolic data p ( C ). In this paper we shall assume a model in which the error analysis Clustering and of a cluster is a sum of differences of its units from the cluster’s optimization representative T Leaders method � Agglomerative p ( C , T ) = d ( X , T ) . (2) method X ∈ C Examples References Note that in general the representative needs not to be from the same ”space” (set) as units. V. Batagelj Symbolic data analysis

  11. Representatives Symbolic data analysis V. Batagelj The best representative is called a leader Symbolic data T C = argmin p ( C , T ) . (3) analysis T Clustering and optimization Then we define Leaders method � Agglomerative p ( C ) = p ( C , T C ) = min d ( X , T ) . (4) method T X ∈ C Examples References The SO X is described by a list X = [ x i ]. Assume that also representatives are described in the same way T = [ t i ], t i = [ t i 1 , t i 2 , . . . , t ik i ]. V. Batagelj Symbolic data analysis

  12. Dissimilarity between SOs Symbolic data analysis We introduce a dissimilarity measure between SOs with V. Batagelj � � d ( X , T ) = α i d ( x i , t i ) , α i ≥ 0 , α i = 1 , (5) Symbolic data analysis i i Clustering and optimization where Leaders k i method � d ( x i , t i ) = w xij δ ( p xij , t ij ) , w xij ≥ 0 . (6) Agglomerative method j =1 Examples This is a kind of a generalization of the squared Euclidean References distance. The weight w xij can be for the same unit X different for each variable V i (needed in descriptions of ego-centric networks, population pyramids, etc.). V. Batagelj Symbolic data analysis

  13. Leaders method Symbolic data analysis Leaders method is a generalization of a popular nonhierarchical V. Batagelj clustering k-means method. Symbolic data The idea is to get ”optimal” clustering into a pre-specified analysis number of clusters with the following iterative procedure: Clustering and optimization Leaders method determine an initial clustering Agglomerative repeat method determine leaders of the clusters in the current clustering; Examples assign each unit to the nearest new leader – producing a References new clustering until the leaders stabilize. V. Batagelj Symbolic data analysis

  14. Selection of the new leaders Symbolic data analysis V. Batagelj Given a cluster C , the corresponding leader T C is the solution of the problem Symbolic data analysis � m Clustering and � � � T C = argmin d ( X , T ) = argmin d ( x i , t i ) optimization i =1 t i T X ∈ C X ∈ C Leaders method Therefore T C = [ t ∗ i ] and t ∗ � i = argmin t i X ∈ C d ( x i , t i ). To Agglomerative method simplify the notation we omit the index i . Examples References t ∗ = argmin � k � � � d ( x , t ) = w xj δ ( p xj , t j ) argmin j =1 t t j ∈ R X ∈ C X ∈ C V. Batagelj Symbolic data analysis

  15. Leaders Symbolic data analysis Again we omit the index j V. Batagelj t ∗ = argmin � w x δ ( p x , t ) Symbolic data analysis t ∈ R X ∈ C Clustering and optimization This is a standard optimization problem with one real variable. Leaders The solution has to satisfy the condition method Agglomerative method ∂ � w x δ ( p x , t ) = 0 Examples ∂ t X ∈ C References or ∂δ ( p x , t ) � = 0 (7) w x ∂ t X ∈ C V. Batagelj Symbolic data analysis

Recommend


More recommend