Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department - PowerPoint PPT Presentation

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department of ORFE   Princeton University May 8 th 2020

Collaborators Yuling Yan Mateo Díaz Princeton ORFE Cornell CAM

Clustering 3

Spherical Clusters { x i } n i =1 ⇠ 1 2 N ( µ , I d ) + 1 2 N ( � µ , I d ) 4

Spherical Clusters { x i } n i =1 ⇠ 1 2 N ( µ , I d ) + 1 2 N ( � µ , I d ) P n • PCA: max β 2 S d − 1 1 i =1 ( β > x i ) 2 n P n • k-means: min µ 1 , µ 2 , y 1 i =1 k x i � µ y i k 2 2 n • SDP relaxations of k-means, etc • Density-based methods require large samples 5

Finding a Needle in a Haystack They are powerful but not omnipotent . µµ > + Σ 1 2 N ( µ , Σ ) + 1 : covariance 2 N ( � µ , Σ ) • Max variance useful 6 = • PCA: or k µ k 2 2 / k Σ k 2 � 1 Σ ≈ I Reduction to the spherical case? • Estimation of is difficult! , Σ ) 6

Headaches • PCA and many: nice shapes & large separations. • Learning with non-convex losses: 1. Initialization (e.g. spectral methods ); 2. Refinement (e.g. gradient descent). 5 0 - 5 Stretched mixtures can be catastrophic . - 5 0 5 1 0 Commonly-used: isotropic, Gaussian, uniform, etc. 7

C lustering via U ncoupled RE gression • The CURE methodology • Theoretical guarantees

Vanilla CURE { x i } n i =1 ✓ R d β ∈ R d Given centered , want such that β > x i ⇡ y i , i 2 [ n ] . 9

Vanilla CURE { x i } n i =1 ✓ R d β ∈ R d Given centered , want such that β > x i ⇡ y i , i 2 [ n ] . C lustering via U ncoupled RE gression: n X 1 � β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 10

Vanilla CURE { x i } n i =1 ✓ R d β ∈ R d Given centered , want such that β > x i ⇡ y i , i 2 [ n ] . C lustering via U ncoupled RE gression: n X 1 � β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 f ( x ) = ( x 2 − 1) 2 . CURE : take with valleys at , e.g. ; f ( ± 1 n X 1 y i = sgn( ˆ β > x i ) f ( β > x i ) solve ; return . ˆ min n β 2 R d i =1 11

Vanilla CURE P n 1 i =1 f ( β > x i ) is non-convex by nature. n • Projection pursuit (Friedman and Tukey, 1974), ICA (Hyvärinen and Oja, 2000) ‣ Maximize deviation from the null (Gaussian); ‣ Limited algorithmic guarantees. • Phase retrieval (Candès et al. 2011) ‣ Isotropic measurements, spectral initialization. 12

Vanilla CURE with Intercept { x i } n i =1 ✓ R d Given , find and s.t. β ∈ R d α ∈ R n X 1 � ↵ + β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 P The naïve extension n X 1 f ( α + β > x i ) . min n α 2 R , β 2 R d i =1 α , ˆ yields trivial solutions . (ˆ β ) = ( ± 1 , 0 ) | ↵ + β > x i | ⇡ 1 It only forces rather than # # { i : α + β > x i ⇡ 1 } ⇡ n/ 2 . 13

Vanilla CURE with Intercept { x i } n i =1 ✓ R d Given , find and s.t. β ∈ R d α ∈ R n X 1 � ↵ + β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 ⇢ 1 � n X f ( ↵ + β > x i ) + 1 2( ↵ + β > ¯ x ) 2 CURE : min . n ↵ 2 R , β 2 R d i =1 14

Vanilla CURE with Intercept { x i } n i =1 ✓ R d Given , find and s.t. β ∈ R d α ∈ R n X 1 � ↵ + β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 ⇢ 1 � n X f ( ↵ + β > x i ) + 1 2( ↵ + β > ¯ x ) 2 CURE : min . n ↵ 2 R , β 2 R d i =1 ⇢ 1 n X • : ; f ( ↵ + β > x i ) + | ↵ + β > x i | ⇡ 1 # n R i =1 � 1 2( ↵ + β > ¯ • : . x ) 2 # { i : α + β > x i ⇡ 1 } ⇡ n/ 2 . ‣ Moment matching . Extension: imbalanced cases. 15

Loss Function Clip to improve ( x 2 − 1) 2 / 4 • concentration and robustness for statistics ; • growth condition and smoothness for optimization . 16

Example: Fashion-MNIST 70000 fashion products, 10 categories (Xiao et al. 2017). • T-shirts/tops • Pullovers Visualization by PCA 17

Example: Fashion-MNIST Goal: cluster 1000 T-shirts/tops and 1000 Pullovers. Alg.: gradient descent, random initialization from unit sphere. Err.: CURE 5.2% , kmeans 44.3%, spectral (vanilla) 41.9%; spectral (Gaussian kernel) 10.5%. Also works when the classes are imbalanced . 18

General CURE { x i } n i =1 ✓ X Given , find in s.t. f : X ! Y 2 F n K 1 X X δ f ( x i ) ⇡ π j δ y j . n i =1 j =1 5 0 - 5 - 5 0 5 1 0 19

General CURE { x i } n i =1 ✓ X Given , find in s.t. f : X ! Y 2 F n K 1 X X δ f ( x i ) ⇡ π j δ y j . n i =1 j =1 f 2 F D ( f # ˆ min ⇢ n , ⌫ ) . CURE : • Discrepancy measure: divergence; MMD; W p ; • Fashion ( 10 classes), CNN + W 1 : state-of-the-art; • Bridle et al. (1992), Krause et al. (2010), Springenberg (2015), Xie et al. (2016), Yang et al. (2017), Hu et al. (2017), Shaham et al. (2018). 20

Clustering Algorithms • Generative: (X, Y) -> (Y | X) ‣ Distribution learning (EM, DBSCAN) ‣ ~ Linear discriminant analysis • Discriminative: (Y | X) — CURE belongs to this. ‣ Criterion opt. (projection pursuit, Transductive SVM) ‣ ~ Logistic regression 21

Clustering Algorithms Drawbacks of generative approaches • Model dependency • Unnecessary parameters • Computational challenges • Strong conditions 22

Clustering Algorithms { x i } n i =1 ⇠ 1 2 N ( µ , I d ) + 1 Example : with 2 N ( � µ , I d ) d � n p • Parameter estimation: k µ k 2 � d/n y µ • Clustering: k µ k 2 � ( d/n ) 1 / 4 Never ask for more than you need! 23

C lustering via U ncoupled RE gression • The CURE methodology • Theoretical guarantees

Elliptical Mixture Model Main Assumptions ( 5 0 ( µ 1 , Σ ) , if y i = 1 - 5 x i ⇠ if y i = � 1 . ( µ � 1 , Σ ) , � - 5 ( • , ; 0 x i = µ y i + Σ 1 / 2 z i P ( y i = 1) = P ( y i = � 1) = 1 / 2 5 1 0 • spherically symmetric, leptokurtic, sub-Gaussian. z i ⇢ 1 � n X f ( ↵ + β > x i ) + 1 2( ↵ + β > ¯ CURE: x ) 2 min . n α 2 R , β 2 R d i =1 25

Theoretical Guarantees Theorem (WYD’20) Suppose is large. The perturbed gradient descent alg. n/d (Jin et al. 2017) starting from 0 achieves stat. precision within ✓ n ◆ d _ d 2 e O n iterations (hiding polylog factors). 26

Theoretical Guarantees Theorem (WYD’20) Suppose is large. The perturbed gradient descent alg. n/d (Jin et al. 2017) starting from 0 achieves stat. precision within ✓ n ◆ d _ d 2 e O n iterations (hiding polylog factors). • Efficient clustering for stretched mixtures without warm start; • Two terms: prices for accuracy (stat.) and smoothness (opt.) ; p e • Angular error: ; excess risk: . e O ( d/n ) O ( d/n ) 27

Proof Sketch: Population Consider the centered case : x i ⇠ ( ± µ , Σ ) n X 1 f ( β > x i ) . min n β 2 R d i =1 Theorem (population landscape) Let . For the infinite-sample loss: f ( x ) = ( x 2 − 1) 2 / 4 • Two minima , where , locally strongly cvx; re β ⇤ ∝ Σ � 1 µ ± β ∗ • Local maximum ; all saddles are strict. 0 28

Loss Function Clip to improve ( x 2 − 1) 2 / 4 • concentration and robustness for statistics ; • growth condition and smoothness for optimization . 29

Proof Sketch: Finite Samples Theorem (empirical landscape) P n b Suppose is large and let . W.h.p., L ( β ) = 1 i =1 f ( β > x i ) n/d n • Approx. second-order stationary points are good: • is -Lipschitz, is -Lipschitz. e r b e r 2 b d O (1) O (1 _ √ n ) L L Nice landscape ensures e ffi ciency and accuracy of optimization. 30

Proof Sketch: Finite Samples Theorem (empirical landscape) P n b Suppose is large and let . W.h.p., L ( β ) = 1 i =1 f ( β > x i ) n/d n • Approx. second-order stationary points are good: kr b λ min [ r 2 b if then L ( β ) k 2  δ , L ( β )] � � δ , r ⇣ n ⌘ d k β � β ∗ k 2 . kr b L ( β ) k 2 + n log ; | {z } d | {z } opt err . stat err . • is -Lipschitz, is -Lipschitz. e r b e r 2 b d O (1) O (1 _ √ n ) L L Nice landscape ensures e ffi ciency and accuracy of optimization. 31

Summary A general CURE for clustering problems . Wang , Yan and Díaz. E ffi cient clustering for stretched mixtures: landscape and optimality. Submitted. ‣ Clustering -> classification ; ‣ Flexible choices of transforms, OOS-extensions; ‣ Stat. and comp. guarantees under mixture models. Extensions ‣ High dim., significance testing, model selection; ‣ Representation learning, semi-supervised version. 32

Thank you!

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department - PowerPoint PPT Presentation

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department of ORFE Princeton University May 8 th 2020 Collaborators Yuling Yan Mateo Daz Princeton ORFE Cornell CAM Clustering 3 Spherical Clusters { x i } n i =1 1 2 N (

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

DSPACE CLUSTERING DSPACE CLUSTERING VIA PUPPET, HAPROXY AND CEPHFS VIA PUPPET, HAPROXY AND

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Y P O DIY and Regulatory Aspects of Transcranial Stimulation C T O N O D E S A Anthony

RECURRENCE WHAT CAUSES CLUSTER HEADACHES? Occasionally referred to as alarm headaches

1 Enforcement 4 Enforcement Survey of Letters from FDA Between October 2012 and September 2013

1 $"

EDUCATION WITH INNOVATIVE, INTEGRATED CURRICULA Yen-Ping Kuo, PhD School of Osteopathic Medicine

Synchronous multi-master clusters with MySQL: an introduction to Galera Henrik Ingo OUGF

Virtualizing the Philippine e-Science Grid International Symposium on Grids and Clouds 2011 25

Lessons Learned in Deploying the World's Largest Scale Lustre File System Presented by David

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department - PowerPoint PPT Presentation

Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department of ORFE Princeton University May 8 th 2020 Collaborators Yuling Yan Mateo Daz Princeton ORFE Cornell CAM Clustering 3 Spherical Clusters { x i } n i =1 1 2 N (

Graph Clustering Graph Clustering What is clustering? What is clustering? Finding patterns

Subspace Clustering Ensemble Clustering Subspace Clustering, Ensemble Clustering, Alternative

Evolutionary Clustering Presenter: Lei Tang Evolutionary Clustering Evolutionary Clustering

Clustering A Categorization of Major Clustering Methods Partitioning Methods

DSPACE CLUSTERING DSPACE CLUSTERING VIA PUPPET, HAPROXY AND CEPHFS VIA PUPPET, HAPROXY AND

Trust based Clustering for Group Trust based Clustering for Group Trust based Clustering for

Finding Clusters Types of Clustering Approaches: Linkage Based, e.g. Hierarchical Clustering

Clustering Hierarchical clustering and k-mean clustering Genome 373 Genomic Informatics

Cl Clustering t i A Categorization of Major Clustering Methods Partitioning Methods

Clustering Hierarchical clustering, k-mean clustering Genome 559: Introduction to Statistical and

CSCE 478/878 Lecture 8: Stephen Scott Clustering Introduction Outline Clustering Stephen

Clustering and Dimensionality Reduction Preview Clustering K -means clustering

Clustering kMeans, Expectation Maximization, Self-Organizing Maps Outline K-means

Lecture 23: Spectral clustering Hierarchical clustering What is a good clustering?

PAC-Bayesian Analysis of Co-clustering, Graph Clustering and Pairwise Clustering Yevgeny Seldin

Introduction to Machine Learning, Clustering and EM Barnab s P czos Contents Clustering

Y P O DIY and Regulatory Aspects of Transcranial Stimulation C T O N O D E S A Anthony

RECURRENCE WHAT CAUSES CLUSTER HEADACHES? Occasionally referred to as alarm headaches

1 Enforcement 4 Enforcement Survey of Letters from FDA Between October 2012 and September 2013

1 $&quot;

EDUCATION WITH INNOVATIVE, INTEGRATED CURRICULA Yen-Ping Kuo, PhD School of Osteopathic Medicine

Synchronous multi-master clusters with MySQL: an introduction to Galera Henrik Ingo OUGF

Virtualizing the Philippine e-Science Grid International Symposium on Grids and Clouds 2011 25

Lessons Learned in Deploying the World's Largest Scale Lustre File System Presented by David

1 $"