Lecture 6: Clustering Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 6: Clustering Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 5th April 2019

Projects assumptions (all groups) disadvantage if you cannot present because there is not enough time) 1/23 ▶ Focus on challenging the algorithms and their ▶ Keep your presentations short ( ∼ 10 min) ▶ Send in your presentation and code by 10.00 on Friday ▶ There are 30 groups across 3 rooms, i.e. ▶ Not every group might get to present (it is not to your ▶ We will group similar topics to allow for better discussion

Importance of standardisation (I) The overall issue: Subjectivity vs Objectivity 𝑚=1 ∑ 𝑜 𝑜 − 1 1 = 𝑚=1 ∑ 𝑜 𝑜 − 1 1 If 𝑦 is scald by a factor 𝑑 , i.e. 𝑨 = 𝑑 ⋅ 𝑦 , then 𝑚=1 ∑ 𝑜 𝑜 − 1 1 of variables 𝑦 and 𝑧 , then their empirical covariance is 2/23 (Co-)variance is scale dependent: If we have a sample (size 𝑜 ) 𝑡 𝑦𝑧 = (𝑦 𝑚 − 𝑦)(𝑧 𝑚 − 𝑧) 𝑡 𝑨𝑧 = (𝑨 𝑚 − 𝑨)(𝑧 𝑚 − 𝑧) (𝑑 ⋅ 𝑦 𝑚 − 𝑑 ⋅ 𝑦)(𝑧 𝑚 − 𝑧) = 𝑑 ⋅ 𝑡 𝑦𝑧

Importance of standardisation (II) large/influential or small/insignificant as we want, which and reach an objective point-of-view samples for a variable fall into that range, then it is not very informative after all therefore there will still be dominating directions after standardisation 3/23 (Co-)variance is scale dependent: 𝑡 𝑨𝑧 = 𝑑 ⋅ 𝑡 𝑦𝑧 where 𝑨 = 𝑑 ⋅ 𝑦 ▶ By scaling variables we can therefore make them as is a very subjective process ▶ By standardising variables we can get of rid of scaling ▶ Do we get rid of information? ▶ The typical range of a variable is compressed, but if most ▶ Real data is not a perfect Gaussian point cloud and ▶ Outliers will still be outliers

Importance of standardisation (III) UCI Wine dataset (Three different types of wine with 𝑞 = 13 4/23 characteristics) Raw Centred + Standardised 3 ● ● 1500 ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● Proline ● ● ● Proline ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 1000 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 500 ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −1 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 11 12 13 14 15 −2 −1 0 1 2 Alcohol Alcohol ● ● 20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● PC2 ● ● PC2 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● 0 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −20 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ●● ● −2 ● ● ● ● ● ● ● ● ● ● ● ● −40 ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● −60 ● −4 ● −1000 −500 0 500 −2.5 0.0 2.5 PC1 PC1

Class-related dimension reduction

Better data projection for classification? Idea: Find directions along which projections result in minimal within-class scatter and maximal between-class separation. Projection onto first principal component Projection onto first discriminant LDA decision boundary 5/23 LD1 1 C P

Classification and principal components 𝑜 𝜈 3 𝜈 2 𝜈 1 on these directions can by problematic. account. Classification after projection directions do not take class-labels into Note: The principal component 𝐲 𝑚 𝑚=1 In LDA the covariance matrix of the features within each class 𝑜 ∑ 6/23 where 𝑗=1 ∑ 𝐿 ˆ ˆ 𝚻 . In addition define is ˆ 𝚻 . Now we will consider the within-class scatter matrix 𝚻 𝑋 = (𝑜 − 𝐿)ˆ 𝝂 = 1 𝚻 𝐶 = 𝑜 𝑗 (𝝂 𝑗 − 𝝂)(𝝂 𝑗 − 𝝂) 𝑈 , the between-class scatter matrix . ˆ 𝚻 𝑋 f o C 1 P

Fisher’s Problem Recall: The variance of the data projected on a direction given called Fisher’s problem . ‖𝐬‖ = 1 subject to 𝚻 𝑋 𝐬 𝐬 𝑈 ˆ 𝚻 𝐶 𝐬 Optimization goal: Maximize over 𝐬 simultaneously minimizing variance within each class. The goal is to maximize variance between class centres while 𝚻 𝐶 𝐬 . calculated as 𝐬 𝑈 ˆ In analogy, the variance between class centres along 𝐬 is 𝚻 𝑋 𝐬 . by 𝐬 can be calculated as 𝑇(𝐬) = 𝐬 𝑈 ˆ 7/23 𝐾(𝐬) = 𝐬 𝑈 ˆ which is a more general form of a Rayleigh Quotient and is

Solving Fisher’s Problem Note: There are maximum 𝐿 − 1 solutions 𝐬 on the orthogonal complement of the first 𝑘 − 1 solutions) (as with PCA the 𝑘 -th solution maximizes Fisher’s problem 𝐖 . The columns of 𝐒 solve Fisher’s problem 𝑋 𝚻 −1/2 = 𝐖𝐄𝐖 𝑈 𝑋 𝚻 −1/2 𝚻 𝐶 ˆ ˆ 𝑋 𝚻 −1/2 ˆ symmetric) Computation of solutions: 8/23 𝑘 to Fisher’s problem (because ˆ 𝚻 𝐶 has rank ≤ 𝐿 − 1 ). 1. Compute the eigen-decomposition (the matrix is real and where 𝐖 ∈ ℝ 𝑞×𝑞 orthogonal and 𝐄 ∈ ℝ 𝑞×𝑞 diagonal. 2. Set 𝐒 = ˆ

Discriminant Variables and Reduced-rank LDA the optimal separation of projected class centroids 9/23 ▶ The vectors 𝐬 𝑘 determined by solving Fisher’s problem can be used like PCA, but are aware of class labels and give ▶ Projecting the data onto the 𝑘 -th solution gives the 𝑘 -th discriminant variable 𝐬 𝑈 𝑘 𝐲 ▶ Using only the 𝑛 < 𝐿 − 1 first is called reduced-rank LDA

Lecture 6: Clustering Felix Held, Mathematical Sciences - PowerPoint PPT Presentation

Lecture 6: Clustering Felix Held, Mathematical Sciences MSA220/MVE440 Statistical Learning for Big Data 5th April 2019 Projects assumptions (all groups) disadvantage if you cannot present because there is not enough time) 1/23 Focus on

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Web Information Retrieval Lecture 15 Clustering Todays Topic: Clustering Document

MATERIAL COMPATIBILITY PROGRAM Sune unej Hans ns Minf infang ang Yeh, Wai Wai-Ting ing

The All-Sky Automated Survey for SuperNovae (

An Introduction to plotly IN TERACTIVE DATA VIS UALIZ ATION W ITH P LOTLY IN R Adam Loy

Design and in-vitro testing of new antimicrobial peptides based on QSAR modelling Boris

Community The landing page takes the user to the first tab The Community Tab 1. Parent 2.

Disclosures Interventional Cardiology for the Non-Cardiologist: No Conflicts of Interest New

Time dependent multivariate distributions for piecewise-deterministic models of gene networks

Python Hype? Brian Ray Hi, Im Brian Ray Indy Consulting Years Directive Years