lecture 6 clustering
play

Lecture 6: Clustering Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 6: Clustering Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 5th April 2019 Projects assumptions (all groups) disadvantage if you cannot present because there is not enough time) 1/23 Focus on


  1. Lecture 6: Clustering Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 5th April 2019

  2. Projects assumptions (all groups) disadvantage if you cannot present because there is not enough time) 1/23 ▶ Focus on challenging the algorithms and their ▶ Keep your presentations short ( ∼ 10 min) ▶ Send in your presentation and code by 10.00 on Friday ▶ There are 30 groups across 3 rooms, i.e. ▶ Not every group might get to present (it is not to your ▶ We will group similar topics to allow for better discussion

  3. Importance of standardisation (I) The overall issue: Subjectivity vs Objectivity 𝑚=1 ∑ 𝑜 𝑜 − 1 1 = 𝑚=1 ∑ 𝑜 𝑜 − 1 1 If 𝑦 is scald by a factor 𝑑 , i.e. 𝑨 = 𝑑 ⋅ 𝑦 , then 𝑚=1 ∑ 𝑜 𝑜 − 1 1 of variables 𝑦 and 𝑧 , then their empirical covariance is 2/23 (Co-)variance is scale dependent: If we have a sample (size 𝑜 ) 𝑡 𝑦𝑧 = (𝑦 𝑚 − 𝑦)(𝑧 𝑚 − 𝑧) 𝑡 𝑨𝑧 = (𝑨 𝑚 − 𝑨)(𝑧 𝑚 − 𝑧) (𝑑 ⋅ 𝑦 𝑚 − 𝑑 ⋅ 𝑦)(𝑧 𝑚 − 𝑧) = 𝑑 ⋅ 𝑡 𝑦𝑧

  4. Importance of standardisation (II) large/influential or small/insignificant as we want, which and reach an objective point-of-view samples for a variable fall into that range, then it is not very informative after all therefore there will still be dominating directions after standardisation 3/23 (Co-)variance is scale dependent: 𝑡 𝑨𝑧 = 𝑑 ⋅ 𝑡 𝑦𝑧 where 𝑨 = 𝑑 ⋅ 𝑦 ▶ By scaling variables we can therefore make them as is a very subjective process ▶ By standardising variables we can get of rid of scaling ▶ Do we get rid of information? ▶ The typical range of a variable is compressed, but if most ▶ Real data is not a perfect Gaussian point cloud and ▶ Outliers will still be outliers

  5. Importance of standardisation (III) UCI Wine dataset (Three different types of wine with 𝑞 = 13 4/23 characteristics) Raw Centred + Standardised 3 ● ● 1500 ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Proline ● ● ● Proline ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 500 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 11 12 13 14 15 −2 −1 0 1 2 Alcohol Alcohol ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● −40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −60 ● −4 ● −1000 −500 0 500 −2.5 0.0 2.5 PC1 PC1

  6. Class-related dimension reduction

  7. Better data projection for classification? Idea: Find directions along which projections result in minimal within-class scatter and maximal between-class separation. Projection onto first principal component Projection onto first discriminant LDA decision boundary 5/23 LD1 1 C P

  8. Classification and principal components 𝑜 𝜈 3 𝜈 2 𝜈 1 on these directions can by problematic. account. Classification after projection directions do not take class-labels into Note: The principal component 𝐲 𝑚 𝑚=1 In LDA the covariance matrix of the features within each class 𝑜 ∑ 6/23 where 𝑗=1 ∑ 𝐿 ˆ ˆ 𝚻 . In addition define is ˆ 𝚻 . Now we will consider the within-class scatter matrix 𝚻 𝑋 = (𝑜 − 𝐿)ˆ 𝝂 = 1 𝚻 𝐶 = 𝑜 𝑗 (𝝂 𝑗 − 𝝂)(𝝂 𝑗 − 𝝂) 𝑈 , the between-class scatter matrix . ˆ 𝚻 𝑋 f o C 1 P

  9. Fisher’s Problem Recall: The variance of the data projected on a direction given called Fisher’s problem . ‖𝐬‖ = 1 subject to 𝚻 𝑋 𝐬 𝐬 𝑈 ˆ 𝚻 𝐶 𝐬 Optimization goal: Maximize over 𝐬 simultaneously minimizing variance within each class. The goal is to maximize variance between class centres while 𝚻 𝐶 𝐬 . calculated as 𝐬 𝑈 ˆ In analogy, the variance between class centres along 𝐬 is 𝚻 𝑋 𝐬 . by 𝐬 can be calculated as 𝑇(𝐬) = 𝐬 𝑈 ˆ 7/23 𝐾(𝐬) = 𝐬 𝑈 ˆ which is a more general form of a Rayleigh Quotient and is

  10. Solving Fisher’s Problem Note: There are maximum 𝐿 − 1 solutions 𝐬 on the orthogonal complement of the first 𝑘 − 1 solutions) (as with PCA the 𝑘 -th solution maximizes Fisher’s problem 𝐖 . The columns of 𝐒 solve Fisher’s problem 𝑋 𝚻 −1/2 = 𝐖𝐄𝐖 𝑈 𝑋 𝚻 −1/2 𝚻 𝐶 ˆ ˆ 𝑋 𝚻 −1/2 ˆ symmetric) Computation of solutions: 8/23 𝑘 to Fisher’s problem (because ˆ 𝚻 𝐶 has rank ≤ 𝐿 − 1 ). 1. Compute the eigen-decomposition (the matrix is real and where 𝐖 ∈ ℝ 𝑞×𝑞 orthogonal and 𝐄 ∈ ℝ 𝑞×𝑞 diagonal. 2. Set 𝐒 = ˆ

  11. Discriminant Variables and Reduced-rank LDA the optimal separation of projected class centroids 9/23 ▶ The vectors 𝐬 𝑘 determined by solving Fisher’s problem can be used like PCA, but are aware of class labels and give ▶ Projecting the data onto the 𝑘 -th solution gives the 𝑘 -th discriminant variable 𝐬 𝑈 𝑘 𝐲 ▶ Using only the 𝑛 < 𝐿 − 1 first is called reduced-rank LDA

Recommend


More recommend