Unsupervised learning D. Dubhashi D. Dubhashi Introduction - PowerPoint PPT Presentation

Introduction Introduction Unsupervised learning D. Dubhashi D. Dubhashi Introduction Introduction ◮ Everything we’ve seen so far has been supervised K-means K-means TDA231 Kernel K-means Kernel K-means ◮ We were given a set of x n and associated t n . Mixture models Mixture models Clustering and Mixture Models ◮ What if we just have x n ? ◮ For example: ◮ x n is a binary vector indicating products customer n has Devdatt Dubhashi bought. ◮ Can group customers that buy similar products. dubhashi@chalmers.se ◮ Can group products bought together. Dept. of Computer Science and Engg. ◮ Known as Clustering Chalmers University ◮ And is an example of unsupervised learning. March 2016 Supervised Learning is just the icing on the ◮ cake which is unsupervised learning. Yann Le CUn, NIPS 2016 Introduction Introduction Clustering What we’ll cover D. Dubhashi D. Dubhashi 5 5 Introduction Introduction 4 4 K-means K-means 3 3 Kernel K-means Kernel K-means 2 2 Mixture models Mixture models 1 1 ◮ 2 algorithms: 0 0 ◮ K-means −1 −1 ◮ Mixture models −2 −2 −3 −3 ◮ The two are somewhat related. 0 2 4 6 0 2 4 6 ◮ We’ll also see how K-means can be kernelised. ◮ In this example each object has two attributes: x n = [ x n 1 , x n 2 ] T ◮ Left: data. ◮ Right: data after clustering (points coloured according to cluster membership).

Introduction Introduction K-means How do we find µ k ? D. Dubhashi D. Dubhashi ◮ Assume that there are K clusters. Introduction Introduction ◮ Each cluster is defined by a position in the input space: ◮ No analytical solution – we can’t write down µ k as a K-means K-means µ k = [ µ k 1 , µ k 2 ] T Kernel K-means function of X . Kernel K-means Mixture models Mixture models ◮ Use an iterative algorithm: ◮ Each x n is assigned to its closest cluster: 1. Guess µ 1 , µ 2 , . . . , µ K 6 2. Assign each x n to its closest µ k 3. z nk = 1 if x n assigned to µ k (0 otherwise) 4 4. Update µ k to average of x n s assigned to µ k : 2 x 2 0 � N n =1 z nk x n µ k = −2 � N n =1 z nk −4 −6 5. Return to 2 until assignments do not change. −2 0 2 4 6 x 1 ◮ Algorithm will converge....it will reach a point where the ◮ Distance is normally Euclidean distance: assignments don’t change. d nk = ( x n − µ k ) T ( x n − µ k ) Introduction Introduction K-means – example K-means – example D. Dubhashi D. Dubhashi Introduction Introduction K-means K-means 6 Kernel K-means Kernel K-means 6 4 Mixture models Mixture models 4 2 2 x 2 x 2 0 0 −2 −2 −4 −4 −6 −2 0 2 4 6 −6 x 1 −2 0 2 4 6 x 1 ◮ Cluster means randomly assigned (top left). ◮ Cluster means updated to mean of assigned points. ◮ Points assigned to their closest mean.

Introduction Introduction K-means – example K-means – example D. Dubhashi D. Dubhashi Introduction Introduction K-means K-means Kernel K-means Kernel K-means 6 6 Mixture models Mixture models 4 4 2 2 x 2 x 2 0 0 −2 −2 −4 −4 −6 −6 −2 0 2 4 6 −2 0 2 4 6 x 1 x 1 ◮ Points re-assigned to closest mean. ◮ Cluster means updated to mean of assigned points. Introduction Introduction K-means – example Two Issues with K-Means D. Dubhashi D. Dubhashi Introduction Introduction K-means K-means Kernel K-means Kernel K-means 6 Mixture models Mixture models 4 2 ◮ What value of k should we use? x 2 0 ◮ How should we pick the initial centers? −2 ◮ Both these significantly affect resulting clustering. −4 −6 −2 0 2 4 6 x 1 ◮ Solution at convergence.

Introduction Introduction Initializing Centers k–Means++ (D. Arthur and S. Vassilvitskii D. Dubhashi D. Dubhashi (2007) Introduction Introduction K-means K-means Kernel K-means Kernel K-means ◮ Start with C 1 := { x } where x is chosen at random from Mixture models Mixture models ◮ Pick k random points. input points. ◮ Pick k points at random from input points. ◮ For i ≥ 2, pick a point x according to a probability ◮ Assign points at random to k groups and then take distribution ν i : centers of these groups. d 2 ( x , C i − 1 ) ◮ Pick a random input point for first center, next center ν i ( x ) = y d 2 ( y , C i − 1 ) � at a point as far away from this as possible, next as far away from first two ... and set C i := C i − 1 ∪ { x } . Gives a provably good O (log n ) approximation to optimal clustering. Introduction Introduction Choosing k Sum of Norms (SON) Convex Relaxation D. Dubhashi D. Dubhashi Introduction Introduction SON Relaxation (Lindsten et al 2011) K-means K-means µ � x i − µ i � 2 + λ � Kernel K-means Kernel K-means min � µ i − µ j � 2 . Mixture models Mixture models ◮ Intra-cluster variance: i < j 1 � ( x − µ k ) 2 . ◮ If you take only first term ... W k := | C k | x ∈ C k ◮ ... µ i = x i for all i . ◮ If you take only second term ... ◮ W := � k W k . ◮ ... µ i = µ j for all i , j . ◮ Pick k to minimize W k ◮ By varying λ , we steer between these two extremes. ◮ Elbow heuristic, Gap Statistic ... ◮ Do not need to know k in advance and do not need to do careful intialization. ◮ Fast scalable algorithm with guarantees under submission later today to ICML ...

Introduction Introduction When does K-means break? When does K-means break? D. Dubhashi D. Dubhashi Introduction Introduction K-means K-means 1.5 1.5 Kernel K-means Kernel K-means 1 1 Mixture models Mixture models 0.5 0.5 x 2 x 2 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 x 1 x 1 ◮ Data has clear cluster structure. ◮ Data has clear cluster structure. ◮ Outer cluster can not be represented as a single point. ◮ Outer cluster can not be represented as a single point. Introduction Introduction Kernelising K-means Kernelising K-means D. Dubhashi D. Dubhashi Introduction Introduction ◮ Maybe we can kernelise K-means? K-means K-means Kernel K-means Kernel K-means ◮ Distances: ◮ Multiply out: Mixture models Mixture models ( x n − µ k ) T ( x n − µ k ) N ◮ Cluster means: n x n − 2 N − 1 � m x n + N − 2 � x T z mk x T z mk z lk x T m x l k k m =1 m , l � N m =1 z mk x m µ k = � N ◮ Kernel substitution: m =1 z mk N N ◮ Distances can be written as (defining N k = � n z nk ): � � k ( x n , x n ) − 2 N − 1 z mk k ( x n , x m )+ N − 2 z mk z lk k ( x m , x l ) k k m =1 m , l =1 � T � � N N � ( x n − µ k ) T ( x n − µ k ) = x n − N − 1 � x n − N − 1 � z mk x m z mk x m k k m =1 m =1

Introduction Introduction Kernel K-means Kernel K-means – example D. Dubhashi D. Dubhashi Introduction Introduction ◮ Algorithm: K-means K-means 1.5 1. Choose a kernel and any necessary parameters. Kernel K-means Kernel K-means 1 2. Start with random assignments z nk . Mixture models Mixture models 3. For each x n assign it to the nearest ‘center’ where 0.5 distance is defined as: x 2 0 N N k ( x n , x n ) − 2 N − 1 � z mk k ( x n , x m )+ N − 2 � z mk z lk k ( x m , x l ) k k −0.5 m =1 m , l =1 −1 4. If assignments have changed, return to 3. −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 ◮ Note – no µ k . This would be N − 1 x 1 � n z nk φ ( x n ) but we k don’t know φ ( x n ) for kernels. We only know φ ( x n ) T φ ( x m ) (last week)... ◮ Continue re-assigning until convergence. Introduction Introduction Kernel K-means – example Kernel K-means – example D. Dubhashi D. Dubhashi Introduction Introduction K-means K-means 1.5 1.5 Kernel K-means Kernel K-means 1 1 Mixture models Mixture models 0.5 0.5 x 2 x 2 0 0 −0.5 −0.5 −1 −1 −1.5 −1.5 −1.5 −1 −0.5 0 0.5 1 1.5 −1.5 −1 −0.5 0 0.5 1 1.5 x 1 x 1 ◮ Continue re-assigning until convergence. ◮ Continue re-assigning until convergence.

Introduction Introduction Kernel K-means K-means – summary D. Dubhashi D. Dubhashi Introduction Introduction K-means K-means Kernel K-means Kernel K-means ◮ Simple (and effective) clustering strategy. Mixture models Mixture models ◮ Converges to (local) minima of: ◮ Makes simple K-means algorithm more flexible. � � z nk ( x n − µ k ) T ( x n − µ k ) ◮ But, have to now set additional parameters. n k ◮ Very sensitive to initial conditions – lots of local optima. ◮ Sensitive to initialisation. ◮ How do we choose K ? ◮ Tricky: Quantity above always decreases as K increases. ◮ Can use CV if we have a measure of ‘goodness’. ◮ For clustering these will be application specific. Introduction Introduction Mixture models – thinking generatively A generative model D. Dubhashi D. Dubhashi Introduction Introduction 6 ◮ Assumption:Each x n comes from one of different K K-means K-means 4 distributions. Kernel K-means Kernel K-means Mixture models ◮ To generate X : Mixture models 2 ◮ For each n : x 2 0 1. Pick one of the K components. −2 2. Sample x n from this distribution. −4 ◮ We already have X −6 −2 −1 0 1 2 3 4 5 ◮ Define parameters of all these distributions as ∆. x 1 ◮ We’d like to reverse-engineer this process learn ∆ which we can then use to find which component each point ◮ Could we hypothesis a model that could have created came from. this data? ◮ Maximise the likelihood! ◮ Each x n seems to have come from one of three distributions.

Unsupervised learning D. Dubhashi D. Dubhashi Introduction - PowerPoint PPT Presentation

Introduction Introduction Unsupervised learning D. Dubhashi D. Dubhashi Introduction Introduction Everything weve seen so far has been supervised K-means K-means TDA231 Kernel K-means Kernel K-means We were given a set of x n

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised learning introduction October 7, 2019 Unsupervised learning introduction

Juan Garca-Bellido Galileo Galilei Institute Fsica Terica UAM Florence, 2006 6th

Finite Markov Chains I Daisuke Oyama Topics in Economic Theory November 12, 2014 Classification

Group Key Exchange and Provable Security joint work with E. Bresson and O. Chevassut David

Provable Security Authenticated Key Exchange Joint work with Emmanuel Bresson and Olivier

Welcome to Launceston College Mrs Burn is our Principal Mrs Burn is looking forward to you

1 Julia, my new computing friend? | 14 June 2018, IETR@Vannes | By: L. Besson & P.

Global Small Solutions of the 3D Kerr-Debye Model Mohamed Kanso Institut de Mathmatiques de

On the average size of independent sets in triangle-free graphs Ewan Davies London School of

Unsupervised learning D. Dubhashi D. Dubhashi Introduction - PowerPoint PPT Presentation

Introduction Introduction Unsupervised learning D. Dubhashi D. Dubhashi Introduction Introduction Everything weve seen so far has been supervised K-means K-means TDA231 Kernel K-means Kernel K-means We were given a set of x n

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

Unsupervised Learning Introduction Nakul Verma Unsupervised Learning What can we learn from

12. Unsupervised Deep Learning CS 535 Deep Learning, Winter 2018 Fuxin Li With materials from

Machine Learning for NLP Unsupervised Learning Aurlie Herbelot 2019 Centre for Mind/Brain

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

Unsupervised Learning Unsupervised vs Supervised Learning: Most of this course focuses on

Unsupervised Learning Shan-Hung Wu shwu@cs.nthu.edu.tw Department of Computer Science, National

On the Limitations of Unsupervised Bilingual Dictionary Induction Anders Sgaard Sebastian

Unsupervised learning introduction October 7, 2019 Unsupervised learning introduction

Juan Garca-Bellido Galileo Galilei Institute Fsica Terica UAM Florence, 2006 6th

Finite Markov Chains I Daisuke Oyama Topics in Economic Theory November 12, 2014 Classification

Group Key Exchange and Provable Security joint work with E. Bresson and O. Chevassut David

Provable Security Authenticated Key Exchange Joint work with Emmanuel Bresson and Olivier

Welcome to Launceston College Mrs Burn is our Principal Mrs Burn is looking forward to you

1 Julia, my new computing friend? | 14 June 2018, IETR@Vannes | By: L. Besson &amp; P.

Global Small Solutions of the 3D Kerr-Debye Model Mohamed Kanso Institut de Mathmatiques de

On the average size of independent sets in triangle-free graphs Ewan Davies London School of

1 Julia, my new computing friend? | 14 June 2018, IETR@Vannes | By: L. Besson & P.