Outils Statistiques pour Data Science Part II : Unsupervised Learning Massih-Reza Amini Université Grenoble Alpes Laboratoire d’Informatique de Grenoble Massih-Reza.Amini@imag.fr
2 Clustering observations within a given collection. observations that are close one to another, and separating the best those that are different observations. An element of G is called group (or cluster ). gravity r k , is called prototype. Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ The aim of clustering is to identify disjoint groups of ⇒ The aim is to find homogenous groups, by assembling ❑ Let G be a partition found over the collection C of N A group, G k , where 1 ≤ k ≤ | G | , corresponds to a subset of observations in C . ❑ A representative of a group G k , generally its center of
3 aim is to find homogeneous clusters or groups reflecting the Massih-Reza.Amini@imag.fr depends on the initialization. arbitrary value, to be found and it is generally fixed before hand to some space, found with the disposition of examples in the characteristic Classification vs. Clustering relationship between observations. the ERM or the SRM principle association between the inputs and the outputs following by observations and their associated class labels Introduction to Data-Science ❑ In classification : we have pairs of examples constituted ( x , y ) ∈ R d × { 1 , . . . , K } . ❑ The class information is provided by an expert and the aim is to find a prediction function f : R d → Y that makes the ❑ In clustering : the class information does not exist and the ❑ The main hypothesis here is that this relationship can be ❑ The exact number of groups for a problem is very difficult ❑ The partitioning is usually done iteratively and it mainly
4 K -means algorithm [MacQueen, 1967] Massih-Reza.Amini@imag.fr found. closest, resulting in new clusters; iteratively 2 Introduction to Data-Science K minimised: which the average distance between different groups is argmin G ❑ The K -means algorithm tends to find the partition for ∑ ∑ || x − r k || 2 k = 1 d ∈ G k ❑ From a given set of centroids, the algorithm then ❑ affects each observation to the centroid to which it is the ❑ estimates new centroids for the clusters that have been
5 Clustering with K -means Massih-Reza.Amini@imag.fr Introduction to Data-Science
5 Clustering with K -means Massih-Reza.Amini@imag.fr Introduction to Data-Science
5 Clustering with K -means Massih-Reza.Amini@imag.fr Introduction to Data-Science
5 Clustering with K -means Massih-Reza.Amini@imag.fr Introduction to Data-Science
5 Clustering with K -means Massih-Reza.Amini@imag.fr Introduction to Data-Science
6 But also ... Massih-Reza.Amini@imag.fr Introduction to Data-Science
6 But also ... Massih-Reza.Amini@imag.fr Introduction to Data-Science
6 But also ... Massih-Reza.Amini@imag.fr Introduction to Data-Science
6 But also ... Massih-Reza.Amini@imag.fr Introduction to Data-Science
7 Different forms of clustering There are two main forms of clustering: 1. Flat partitioning, where groups are supposed to be independent one from another. The user then chooses a number of clusters and a threshold over the similarity measure. 2. Hierarchical partitioning, where the groups are structured in the form of a taxonomy, which in general is a binary tree (each group has two siblings). Massih-Reza.Amini@imag.fr Introduction to Data-Science
8 Hierarchical partitioning realized observations (agglomerative techniques), or top-down , by creating a tree from its root (divisives techniques). require that a number of groups to be fixed before hand. In opposite, their complexity is in general quadratique in the number of observations ( N ) ! Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ The hierarchical tends to construct a tree and it can be ❑ in bottom-up manner, by creating a tree from the ❑ Hierarchical methods are purely determinists and do not
8 Hierarchical partitioning realized observations (agglomerative techniques), or top-down , by creating a tree from its root (divisives techniques). require that a number of groups to be fixed before hand. the number of observations ( N ) ! Massih-Reza.Amini@imag.fr Introduction to Data-Science ❑ The hierarchical tends to construct a tree and it can be ❑ in bottom-up manner, by creating a tree from the ❑ Hierarchical methods are purely determinists and do not ❑ In opposite, their complexity is in general quadratique in
9 Steps of clustering Clustering is an iterative process including the following steps: 1. Choose a similarity measure and eventually compute a similarity matrix. 2. Clustering. a. Choose a family of partitioning methods. b. Choose an algorithm within that family. 3. Validate the obtained groups. 4. Return to step 2, by modifying the parameters of the clustering algorithm or the family of the partitioning family. Massih-Reza.Amini@imag.fr Introduction to Data-Science
10 i Massih-Reza.Amini@imag.fr x 2 d i d i Similarity measures d Introduction to Data-Science termes within two documents. In the case where the feature There exists several similarity measures or distance, the most d common ones are: characteristics are between 0 and 1, this measure takes the form: ❑ Jaccard measure, which estimates the proportion of common ∑ x i x ′ simJaccard ( x , x ′ ) = i = 1 ∑ x i + x ′ i − x i x ′ i = 1 ❑ Dice coefficient takes the form: ∑ x i x ′ simDice ( x , x ′ ) = i = 1 ∑ i + ( x ′ i ) 2 i = 1
11 d Massih-Reza.Amini@imag.fr by using for example its opposite. This distance is then transformed into a similarity measure, d d Similarity measures i x 2 Introduction to Data-Science d i ❑ cosine similarity, writes: ∑ x i x ′ simcos ( x , x ′ ) = i = 1 � � � � � � ∑ ∑ � � ( x ′ i ) 2 i = 1 i = 1 ❑ Euclidean distance is given by: � � � ∑ disteucl ( x , x ′ ) = || x − x ′ || 2 = ( x i − x ′ i ) 2 � i = 1
12 K Massih-Reza.Amini@imag.fr mixture models fits the best the observations the mixture. Mixture models Introduction to Data-Science x is then supposed to be generated with a probability ❑ With the probabilistic approaches, we suppose that each group G k is generated by a probability density of parameters θ k ❑ Following the formula of total probabilities, an observation ∑ P ( x , Θ) = P ( y = k ) P ( x | y = k , θ k ) � �� � k = 1 π k where Θ = { π k , θ k ; k ∈ { 1 , . . . , K }} are the parameters of ❑ The aim is then to find the parameters Θ with which the
13 ln Massih-Reza.Amini@imag.fr because it implies a sum of a logarithm of a sum. this criterion Mixture models (2) Introduction to Data-Science N log-likelihood writes ❑ If we have a collection of N observations, x 1 : N , the [ K ] ∑ ∑ L M (Θ) = π k P ( x i | y = k , θ k ) i = 1 k = 1 ❑ The aim is then to find the parameters Θ ∗ that maximize Θ ∗ = argmax L M (Θ) Θ ❑ The direct maximisation of this criterion is impossible
14 Bayesian decision rule: Massih-Reza.Amini@imag.fr Mixture models (3) where Introduction to Data-Science each document is then assigned to a group following the the EM algorithm). ❑ We use then iterative methods for its maximisation (e.g. ❑ Once the optimal parameters of the mixture are found, x ∈ G k ⇔ P ( y = k | x , Θ ∗ ) = argmax P ( y = ℓ | x , Θ ∗ ) ℓ π ∗ ℓ P ( x | y = ℓ, θ ∗ k ) ∀ ℓ ∈ { 1 , . . . , K } , P ( y = ℓ | x , Θ ∗ ) = P ( x , Θ ∗ ) π ∗ ℓ P ( x | y = ℓ, θ ∗ ∝ k )
15 Z Massih-Reza.Amini@imag.fr Z EM algorithm [?] Introduction to Data-Science be find: parameters maximizing the likelihood would be simple to random variables Z such that if Z were known, the value of ❑ The idea behind the algorithm is to introduce hidden ∑ L M (Θ) = ln P ( x 1 : N | Z , Θ) P ( Z | Θ) ❑ by denoting the current estimates of the parameters at time t by Θ ( t ) , the next iteration t + 1 consists in finding the new parameters Θ that maximize L M (Θ) − L M (Θ ( t ) ) ∑ P ( x 1 : N | Z , Θ) P ( Z | Θ) L M (Θ) − L M (Θ ( t ) ) = ln P ( Z | x 1 : N , Θ ( t ) ) P ( Z | x 1 : N , Θ ( t ) ) P ( x 1 : N | Θ ( t ) )
16 Z Massih-Reza.Amini@imag.fr Z EM algorithm [?] Introduction to Data-Science logarithm it comes: ❑ From the Jensen inequality and the concavity of the ∑ P ( x 1 : N | Z , Θ) P ( Z | Θ) L M (Θ) − L M (Θ ( t ) ) ≥ P ( Z | x 1 : N , Θ ( t ) ) ln P ( x 1 : N | Θ ( t ) ) P ( Z | x 1 : N , Θ ( t ) ) ❑ Let ∑ P ( x 1 : N | Z , Θ) P ( Z | Θ) Q (Θ , Θ ( t ) ) = L M (Θ ( t ) )+ P ( Z | x 1 : N , Θ ( t ) ) ln P ( x 1 : N | Θ ( t ) ) P ( Z | x 1 : N , Θ ( t ) )
17 EM algorithm [?] Massih-Reza.Amini@imag.fr Introduction to Data-Science L M (Θ ( t + 1 ) ) L M (Θ ( t ) ) L M (Θ) Q (Θ , Θ ( t ) ) Θ ( t + 1 ) Θ ( t ) Θ
Recommend
More recommend