Large-Scale Sparse Kernel Canonical Correlation Analysis Viivi Uurtio 1 , Sahely Bhadra 2 , and Juho Rousu 1 1 Department of Computer Science, Aalto University Helsinki Institute for Information Technology HIIT 2 Indian Institute of Technology (IIT), Palakkad June 11, 2019 Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 1 / 5
From large two-view datasets, it is not straightforward to identify which of the variables are related
From large two-view datasets, it is not straightforward to identify which of the variables are related � Xu , Yv � || Xu || 2 || Yv || 2
From large two-view datasets, it is not straightforward to identify which of the variables are related � Xu , Yv � → In standard CCA, we identify the related vari- || Xu || 2 || Yv || 2 ables from u and v
From large two-view datasets, it is not straightforward to identify which of the variables are related � Xu , Yv � → In standard CCA, we identify the related vari- || Xu || 2 || Yv || 2 ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v
From large two-view datasets, it is not straightforward to identify which of the variables are related � Xu , Yv � → In standard CCA, we identify the related vari- || Xu || 2 || Yv || 2 ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v
From large two-view datasets, it is not straightforward to identify which of the variables are related � Xu , Yv � → In standard CCA, we identify the related vari- || Xu || 2 || Yv || 2 ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v ⊠ ⊠ Kernel CCA
From large two-view datasets, it is not straightforward to identify which of the variables are related � Xu , Yv � → In standard CCA, we identify the related vari- || Xu || 2 || Yv || 2 ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v ⊠ ⊠ Kernel CCA � � ⊠ RF KCCA
From large two-view datasets, it is not straightforward to identify which of the variables are related � Xu , Yv � → In standard CCA, we identify the related vari- || Xu || 2 || Yv || 2 ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v ⊠ ⊠ Kernel CCA � � ⊠ RF KCCA ⊠ � � KNOI
From large two-view datasets, it is not straightforward to identify which of the variables are related � Xu , Yv � → In standard CCA, we identify the related vari- || Xu || 2 || Yv || 2 ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v ⊠ ⊠ Kernel CCA � � ⊠ RF KCCA ⊠ � � KNOI � � ⊠ Deep CCA
From large two-view datasets, it is not straightforward to identify which of the variables are related � Xu , Yv � → In standard CCA, we identify the related vari- || Xu || 2 || Yv || 2 ables from u and v → In the non-linear and/or large-scale variants, we cannot access the u and v Scalability u and v ⊠ ⊠ Kernel CCA � � ⊠ RF KCCA ⊠ � � KNOI � � ⊠ Deep CCA ⊠ � � SCCA-HSIC Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 2 / 5
gradKCCA is a kernel matrix free method that efficiently optimizes u and v
gradKCCA is a kernel matrix free method that efficiently optimizes u and v Let k x ( u ) = ( k x ( x i , u )) n i =1 and k y ( v ) = ( k y ( y i , v )) n i =1
gradKCCA is a kernel matrix free method that efficiently optimizes u and v Let k x ( u ) = ( k x ( x i , u )) n i =1 and k y ( v ) = ( k y ( y i , v )) n i =1 k x ( u ) ⊤ k y ( v ) max ρ gradKCCA ( u , v ) = || k x ( u ) || 2 || k y ( v ) || 2 u , v s.t. || u || P x ≤ s u and || v || P y ≤ s v
gradKCCA is a kernel matrix free method that efficiently optimizes u and v Let k x ( u ) = ( k x ( x i , u )) n i =1 and k y ( v ) = ( k y ( y i , v )) n i =1 k x ( u ) ⊤ k y ( v ) max ρ gradKCCA ( u , v ) = || k x ( u ) || 2 || k y ( v ) || 2 u , v s.t. || u || P x ≤ s u and || v || P y ≤ s v Maximum through alternating projected gradient ascent
gradKCCA is a kernel matrix free method that efficiently optimizes u and v Let k x ( u ) = ( k x ( x i , u )) n i =1 and k y ( v ) = ( k y ( y i , v )) n i =1 k x ( u ) ⊤ k y ( v ) max ρ gradKCCA ( u , v ) = || k x ( u ) || 2 || k y ( v ) || 2 u , v s.t. || u || P x ≤ s u and || v || P y ≤ s v Maximum through alternating projected gradient ascent Optimization steps for u :
gradKCCA is a kernel matrix free method that efficiently optimizes u and v Let k x ( u ) = ( k x ( x i , u )) n i =1 and k y ( v ) = ( k y ( y i , v )) n i =1 k x ( u ) ⊤ k y ( v ) max ρ gradKCCA ( u , v ) = || k x ( u ) || 2 || k y ( v ) || 2 u , v s.t. || u || P x ≤ s u and || v || P y ≤ s v Maximum through alternating projected gradient ascent Optimization steps for u : → Compute the gradient ∇ ρ u = ∂ρ ( u , v ) ∂ u
gradKCCA is a kernel matrix free method that efficiently optimizes u and v Let k x ( u ) = ( k x ( x i , u )) n i =1 and k y ( v ) = ( k y ( y i , v )) n i =1 k x ( u ) ⊤ k y ( v ) max ρ gradKCCA ( u , v ) = || k x ( u ) || 2 || k y ( v ) || 2 u , v s.t. || u || P x ≤ s u and || v || P y ≤ s v Maximum through alternating projected gradient ascent Optimization steps for u : → Compute the gradient ∇ ρ u = ∂ρ ( u , v ) ∂ u → Step-size using line search: max γ ρ ( u + γ ∇ ρ u )
gradKCCA is a kernel matrix free method that efficiently optimizes u and v Let k x ( u ) = ( k x ( x i , u )) n i =1 and k y ( v ) = ( k y ( y i , v )) n i =1 k x ( u ) ⊤ k y ( v ) max ρ gradKCCA ( u , v ) = || k x ( u ) || 2 || k y ( v ) || 2 u , v s.t. || u || P x ≤ s u and || v || P y ≤ s v Maximum through alternating projected gradient ascent Optimization steps for u : → Compute the gradient ∇ ρ u = ∂ρ ( u , v ) ∂ u → Step-size using line search: max γ ρ ( u + γ ∇ ρ u ) → Gradient step towards maximum: u grad = u + γ ∗ ∇ ρ u
gradKCCA is a kernel matrix free method that efficiently optimizes u and v Let k x ( u ) = ( k x ( x i , u )) n i =1 and k y ( v ) = ( k y ( y i , v )) n i =1 k x ( u ) ⊤ k y ( v ) max ρ gradKCCA ( u , v ) = || k x ( u ) || 2 || k y ( v ) || 2 u , v s.t. || u || P x ≤ s u and || v || P y ≤ s v Maximum through alternating projected gradient ascent Optimization steps for u : → Compute the gradient ∇ ρ u = ∂ρ ( u , v ) ∂ u → Step-size using line search: max γ ρ ( u + γ ∇ ρ u ) → Gradient step towards maximum: u grad = u + γ ∗ ∇ ρ u → Project onto ℓ P ball: u = � � . � Px ≤ s x u grad Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 3 / 5
Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA
Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA train test 1 DCCA 0.8 KNOI KCCA 0.6 gradKCCA KCCA preimage SCCA-HSIC 0 F1 score AUC train test 1 0.8 0.6 0 0.6 0.8 0.96 0.98 0.6 0.8 0.96 0.98 0.6 0.8 0.96 0.98 0.6 0.8 0.96 0.98 Proportion of Noise Variables
Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA train test 1 DCCA 0.8 KNOI KCCA 0.6 gradKCCA KCCA preimage gradKCCA DCCA RCCA KNOI SCCA-HSIC SCCA-HSIC 0 F1 score AUC train test train test F1 score Time (s) 10 h 1 1 1 h 0.8 0.6 1 min 1 s 0 0.9 0.6 0.8 0.96 0.98 0.6 0.8 0.96 0.98 0.6 0.8 0.96 0.98 0.6 0.8 0.96 0.98 10 3 10 4 10 5 10 6 10 3 10 4 10 5 10 6 10 3 10 4 10 5 10 6 10 3 10 4 10 5 10 6 Proportion of Noise Variables Sample Size
Experiments demonstrate noise tolerance, scalability, and superior speed of gradKCCA train test 1 DCCA 0.8 KNOI KCCA 0.6 gradKCCA KCCA preimage gradKCCA DCCA RCCA KNOI SCCA-HSIC SCCA-HSIC 0 F1 score AUC train test train test F1 score Time (s) 10 h 1 1 1 h 0.8 0.6 1 min 1 s 0 0.9 0.6 0.8 0.96 0.98 0.6 0.8 0.96 0.98 0.6 0.8 0.96 0.98 0.6 0.8 0.96 0.98 10 3 10 4 10 5 10 6 10 3 10 4 10 5 10 6 10 3 10 4 10 5 10 6 10 3 10 4 10 5 10 6 Proportion of Noise Variables Sample Size MediaMill ρ train ρ test Time (s) gradKCCA 0.666 ± 0.004 0.657 ± 0.007 8 ± 4 Deep CCA 0.643 ± 0.005 0.633 ± 0.003 1280 ± 112 RF KCCA 0.633 ± 0.001 0.626 ± 0.005 23 ± 9 KNOI 0.652 ± 0.001 0.645 ± 0.003 218 ± 73 SCCA-HSIC 0.627 ± 0.004 0.625 ± 0.002 1804 ± 143 Viivi Uurtio (Aalto, HIIT) ICML 2019 June 11, 2019 4 / 5
Recommend
More recommend