Clustering via Uncoupled REgression (CURE) Kaizheng Wang Department of ORFE Princeton University May 8 th 2020
Collaborators Yuling Yan Mateo Díaz Princeton ORFE Cornell CAM
Clustering 3
Spherical Clusters { x i } n i =1 ⇠ 1 2 N ( µ , I d ) + 1 2 N ( � µ , I d ) 4
Spherical Clusters { x i } n i =1 ⇠ 1 2 N ( µ , I d ) + 1 2 N ( � µ , I d ) P n • PCA: max β 2 S d − 1 1 i =1 ( β > x i ) 2 n P n • k-means: min µ 1 , µ 2 , y 1 i =1 k x i � µ y i k 2 2 n • SDP relaxations of k-means, etc • Density-based methods require large samples 5
Finding a Needle in a Haystack They are powerful but not omnipotent . µµ > + Σ 1 2 N ( µ , Σ ) + 1 : covariance 2 N ( � µ , Σ ) • Max variance useful 6 = • PCA: or k µ k 2 2 / k Σ k 2 � 1 Σ ≈ I Reduction to the spherical case? • Estimation of is difficult! , Σ ) 6
Headaches • PCA and many: nice shapes & large separations. • Learning with non-convex losses: 1. Initialization (e.g. spectral methods ); 2. Refinement (e.g. gradient descent). 5 0 - 5 Stretched mixtures can be catastrophic . - 5 0 5 1 0 Commonly-used: isotropic, Gaussian, uniform, etc. 7
C lustering via U ncoupled RE gression • The CURE methodology • Theoretical guarantees
Vanilla CURE { x i } n i =1 ✓ R d β ∈ R d Given centered , want such that β > x i ⇡ y i , i 2 [ n ] . 9
Vanilla CURE { x i } n i =1 ✓ R d β ∈ R d Given centered , want such that β > x i ⇡ y i , i 2 [ n ] . C lustering via U ncoupled RE gression: n X 1 � β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 10
Vanilla CURE { x i } n i =1 ✓ R d β ∈ R d Given centered , want such that β > x i ⇡ y i , i 2 [ n ] . C lustering via U ncoupled RE gression: n X 1 � β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 f ( x ) = ( x 2 − 1) 2 . CURE : take with valleys at , e.g. ; f ( ± 1 n X 1 y i = sgn( ˆ β > x i ) f ( β > x i ) solve ; return . ˆ min n β 2 R d i =1 11
Vanilla CURE P n 1 i =1 f ( β > x i ) is non-convex by nature. n • Projection pursuit (Friedman and Tukey, 1974), ICA (Hyvärinen and Oja, 2000) ‣ Maximize deviation from the null (Gaussian); ‣ Limited algorithmic guarantees. • Phase retrieval (Candès et al. 2011) ‣ Isotropic measurements, spectral initialization. 12
Vanilla CURE with Intercept { x i } n i =1 ✓ R d Given , find and s.t. β ∈ R d α ∈ R n X 1 � ↵ + β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 P The naïve extension n X 1 f ( α + β > x i ) . min n α 2 R , β 2 R d i =1 α , ˆ yields trivial solutions . (ˆ β ) = ( ± 1 , 0 ) | ↵ + β > x i | ⇡ 1 It only forces rather than # # { i : α + β > x i ⇡ 1 } ⇡ n/ 2 . 13
Vanilla CURE with Intercept { x i } n i =1 ✓ R d Given , find and s.t. β ∈ R d α ∈ R n X 1 � ↵ + β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 ⇢ 1 � n X f ( ↵ + β > x i ) + 1 2( ↵ + β > ¯ x ) 2 CURE : min . n ↵ 2 R , β 2 R d i =1 14
Vanilla CURE with Intercept { x i } n i =1 ✓ R d Given , find and s.t. β ∈ R d α ∈ R n X 1 � ↵ + β > x i ⇡ 1 2 � � 1 + 1 2 � 1 . n i =1 ⇢ 1 � n X f ( ↵ + β > x i ) + 1 2( ↵ + β > ¯ x ) 2 CURE : min . n ↵ 2 R , β 2 R d i =1 ⇢ 1 n X • : ; f ( ↵ + β > x i ) + | ↵ + β > x i | ⇡ 1 # n R i =1 � 1 2( ↵ + β > ¯ • : . x ) 2 # { i : α + β > x i ⇡ 1 } ⇡ n/ 2 . ‣ Moment matching . Extension: imbalanced cases. 15
Loss Function Clip to improve ( x 2 − 1) 2 / 4 • concentration and robustness for statistics ; • growth condition and smoothness for optimization . 16
Example: Fashion-MNIST 70000 fashion products, 10 categories (Xiao et al. 2017). • T-shirts/tops • Pullovers Visualization by PCA 17
Example: Fashion-MNIST Goal: cluster 1000 T-shirts/tops and 1000 Pullovers. Alg.: gradient descent, random initialization from unit sphere. Err.: CURE 5.2% , kmeans 44.3%, spectral (vanilla) 41.9%; spectral (Gaussian kernel) 10.5%. Also works when the classes are imbalanced . 18
General CURE { x i } n i =1 ✓ X Given , find in s.t. f : X ! Y 2 F n K 1 X X δ f ( x i ) ⇡ π j δ y j . n i =1 j =1 5 0 - 5 - 5 0 5 1 0 19
General CURE { x i } n i =1 ✓ X Given , find in s.t. f : X ! Y 2 F n K 1 X X δ f ( x i ) ⇡ π j δ y j . n i =1 j =1 f 2 F D ( f # ˆ min ⇢ n , ⌫ ) . CURE : • Discrepancy measure: divergence; MMD; W p ; • Fashion ( 10 classes), CNN + W 1 : state-of-the-art; • Bridle et al. (1992), Krause et al. (2010), Springenberg (2015), Xie et al. (2016), Yang et al. (2017), Hu et al. (2017), Shaham et al. (2018). 20
Clustering Algorithms • Generative: (X, Y) -> (Y | X) ‣ Distribution learning (EM, DBSCAN) ‣ ~ Linear discriminant analysis • Discriminative: (Y | X) — CURE belongs to this. ‣ Criterion opt. (projection pursuit, Transductive SVM) ‣ ~ Logistic regression 21
Clustering Algorithms Drawbacks of generative approaches • Model dependency • Unnecessary parameters • Computational challenges • Strong conditions 22
Clustering Algorithms { x i } n i =1 ⇠ 1 2 N ( µ , I d ) + 1 Example : with 2 N ( � µ , I d ) d � n p • Parameter estimation: k µ k 2 � d/n y µ • Clustering: k µ k 2 � ( d/n ) 1 / 4 Never ask for more than you need! 23
C lustering via U ncoupled RE gression • The CURE methodology • Theoretical guarantees
Elliptical Mixture Model Main Assumptions ( 5 0 ( µ 1 , Σ ) , if y i = 1 - 5 x i ⇠ if y i = � 1 . ( µ � 1 , Σ ) , � - 5 ( • , ; 0 x i = µ y i + Σ 1 / 2 z i P ( y i = 1) = P ( y i = � 1) = 1 / 2 5 1 0 • spherically symmetric, leptokurtic, sub-Gaussian. z i ⇢ 1 � n X f ( ↵ + β > x i ) + 1 2( ↵ + β > ¯ CURE: x ) 2 min . n α 2 R , β 2 R d i =1 25
Theoretical Guarantees Theorem (WYD’20) Suppose is large. The perturbed gradient descent alg. n/d (Jin et al. 2017) starting from 0 achieves stat. precision within ✓ n ◆ d _ d 2 e O n iterations (hiding polylog factors). 26
Theoretical Guarantees Theorem (WYD’20) Suppose is large. The perturbed gradient descent alg. n/d (Jin et al. 2017) starting from 0 achieves stat. precision within ✓ n ◆ d _ d 2 e O n iterations (hiding polylog factors). • Efficient clustering for stretched mixtures without warm start; • Two terms: prices for accuracy (stat.) and smoothness (opt.) ; p e • Angular error: ; excess risk: . e O ( d/n ) O ( d/n ) 27
Proof Sketch: Population Consider the centered case : x i ⇠ ( ± µ , Σ ) n X 1 f ( β > x i ) . min n β 2 R d i =1 Theorem (population landscape) Let . For the infinite-sample loss: f ( x ) = ( x 2 − 1) 2 / 4 • Two minima , where , locally strongly cvx; re β ⇤ ∝ Σ � 1 µ ± β ∗ • Local maximum ; all saddles are strict. 0 28
Loss Function Clip to improve ( x 2 − 1) 2 / 4 • concentration and robustness for statistics ; • growth condition and smoothness for optimization . 29
Proof Sketch: Finite Samples Theorem (empirical landscape) P n b Suppose is large and let . W.h.p., L ( β ) = 1 i =1 f ( β > x i ) n/d n • Approx. second-order stationary points are good: • is -Lipschitz, is -Lipschitz. e r b e r 2 b d O (1) O (1 _ √ n ) L L Nice landscape ensures e ffi ciency and accuracy of optimization. 30
Proof Sketch: Finite Samples Theorem (empirical landscape) P n b Suppose is large and let . W.h.p., L ( β ) = 1 i =1 f ( β > x i ) n/d n • Approx. second-order stationary points are good: kr b λ min [ r 2 b if then L ( β ) k 2 δ , L ( β )] � � δ , r ⇣ n ⌘ d k β � β ∗ k 2 . kr b L ( β ) k 2 + n log ; | {z } d | {z } opt err . stat err . • is -Lipschitz, is -Lipschitz. e r b e r 2 b d O (1) O (1 _ √ n ) L L Nice landscape ensures e ffi ciency and accuracy of optimization. 31
Summary A general CURE for clustering problems . Wang , Yan and Díaz. E ffi cient clustering for stretched mixtures: landscape and optimality. Submitted. ‣ Clustering -> classification ; ‣ Flexible choices of transforms, OOS-extensions; ‣ Stat. and comp. guarantees under mixture models. Extensions ‣ High dim., significance testing, model selection; ‣ Representation learning, semi-supervised version. 32
Q & A
Thank you!
Recommend
More recommend