Falkon: optimal and efficient large scale kernel learning Alessandro Rudi INRIA - ´ Ecole Normale Sup´ erieure joint work with Luigi Carratino (UniGe), Lorenzo Rosasco (MIT - IIT) July, 6th – ISMP 2018
Learning problem The problem P Find � dρ ( x, y )( y − f ( x )) 2 f H = argmin E ( f ) , E ( f ) = f ∈H with ρ unknown but given ( x i , y i ) n i =1 i.i.d. samples. Basic assumtions: � | y | p dρ ≤ 1 2 p ! σ 2 b p − 2 , ◮ Tail assumption: ∀ p ≥ 2 ◮ ( H , �· , ·� H ) RKHS with bounded kernel K
Kernel ridge regression n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ = argmin H n f ∈H i =1 n � � f λ ( x ) = K ( x, x i ) c i i =1 b c = b y K ( � K + λnI ) c = � y Complexity: Space O ( n 2 ) Kernel eval. O ( n 2 ) Time O ( n 3 )
Random projections Solve � P n on H M = span { K (˜ x 1 , · ) , . . . , K (˜ x M , · ) } n � 1 ( y i − f ( x i )) 2 + λ � f � 2 � f λ,M = argmin H n f ∈H M i =1 ◮ ... that is, pick M columns at random � M c � f λ,M ( x ) = K ( x, ˜ x i ) c i i =1 b y b K nM = ( � nM � K nM + λn � K MM ) c = � K ⊤ K ⊤ nM � y - Nystr¨ om methods (Smola, Scholk¨ opf ’00) - Gaussian processes: inducing inputs (Quionero-Candela et al ’05) - Galerkin methods and Randomized linear algebra (Halko et al. ’11)
Nystr¨ om KRR: Statistics (refined) Let Lf ( x ′ ) = E K ( x ′ , x ) f ( x ) and N ( λ ) = Trace (( L + λI ) − 1 L ) Capacity condition: N ( λ ) = O ( λ − γ ) , γ ∈ [0 , 1] Source condition: f H ∈ Range ( L r ) , r ≥ 1 / 2 Theorem [Rudi, Camoriano, R. ’15] Under (basic) and (refined) f λ,M ) − E ( f H ) � N ( λ ) + λ 2 r + 1 E E ( � M . n 1 By selecting λ n = n − 2 r + γ , M n = 1 λ n E E ( � 2 r f λ n ,M n ) − E ( f H ) � n − 2 r + γ
Remarks M = n c M = O ( √ n ) suffices for O (1 / √ n ) rates ◮ Previous works: only for fixed design (Bach ’13, Alaoui, Mahoney, ’15, Yang et al. ’15, Musco, Musco ’16) ◮ Same minmax bound of KRR [Caponnetto, De Vito ’05] . ◮ Projection regularizes!
Computations required for O (1 / √ n ) rate Space: O ( n ) O ( n √ n ) Kernel eval.: O ( n 2 ) Time: O ( √ n ) Test: Possible improvements: ◮ adaptive sampling ◮ optimization
Optimization to rescue c � nM � K nM + λn � c = � K ⊤ K ⊤ K MM nM � y . b � �� � � �� � b y K nM = H b Idea: First order methods � � c t = c t − 1 − τ � nM ( � K nM c t − 1 − y n ) + λn � K ⊤ K MM c t − 1 n Pros: requires O ( nMt ) Cons: t ∝ κ ( H ) arbitrarily large- κ ( H ) = σ max ( H ) /σ min ( H ) condition number.
Preconditioning Idea : solve an equivalent linear system with better condition number Preconditioning P ⊤ HPβ = P ⊤ b, Hc = b �→ c = Pβ. Ideally PP ⊤ = H − 1 , so that t = O ( κ ( H )) �→ t = O (1)! Note : Preconditioning KRR (Fasshauer et al ’12, Avron et al ’16, Cutajat ’16, Ma, Belkin ’17) H = K + λnI Can we precondition Nystrom-KRR?
Preconditioning Nystom-KRR H := � nM � K nM + λn � K ⊤ Consider K MM Proposed Preconditioning � n � − 1 PP ⊤ = � MM + λn � K 2 K MM M Compare to naive preconditioning � � − 1 PP ⊤ = � nM � K nM + λn � K ⊤ K MM .
Baby FALKON Proposed Preconditioning � n � − 1 PP ⊤ = � MM + λn � K 2 K MM , M Gradient descent � M � f λ,M,t ( x ) = K ( x, � x i ) c t,i , c t = Pβ t i =1 nP ⊤ � � β t = β t − 1 − τ K ⊤ � nM ( � K nM Pβ t − 1 − y n ) + λn � K MM Pβ t − 1
FALKON ◮ Gradient descent �→ conjugate gradient ◮ Computing P � 1 � 1 M TT ⊤ + λI √ nT − 1 A − 1 , P = T = chol( K MM ) , A = chol , where chol( · ) is the Cholesky decomposition.
Falkon statistics Theorem Under (basic) and (refined), when M > log n λ , � � 1 / 2 � � f λ n ,M n ,t n ) − E ( f H ) � N ( λ ) + λ 2 r + 1 1 − log n E E ( � M + exp − t n λM By selecting M n = 2 log n 1 λ n = n − 2 r + γ , , t n = log n, λ then 2 r E E ( � f λ n ,M n ,t n ) − E ( f H ) � n − 2 r + γ
Remarks ◮ Same rates and memory of NKRR, much smaller time complexity, for O (1 / √ n ) : O ( √ n ) Model: Space: O ( n ) O ( n √ n ) Kernel eval.: O ( n 2 ) → O ( n √ n ) ✟ Time: ✟✟ Related (worse complexity) ◮ EigenPro (Belkin et al. ’16) ◮ SGD (Smale, Yao ’05, Tarres, Yao ’07, Ying, Pontil ’08, Bach et al. ’14-. . . , ) ◮ RF-KRR (Rahimi, Recht ’07; Bach ’15; Rudi, Rosasco ’17) ◮ Divide and conquer (Zhang et al. ’13) ◮ NYTRO (Angles et al ’16) ◮ Nystr¨ om SGD (Lin, Rosasco ’16)
In practice Higgs dataset: n = 10 , 000 , 000 , M = 50 , 000 1 0.95 0.9 0.85 0.8 0.75 0 7 20 40 60 80 100
Some experiments MillionSongs ( n ∼ 10 6 ) YELP ( n ∼ 10 6 ) TIMIT ( n ∼ 10 6 ) MSE Relative error Time( s ) RMSE Time( m ) c-err Time( h ) 4 . 51 × 10 − 3 FALKON 80.30 55 0.833 20 32.3% 1.5 4 . 58 × 10 − 3 289 † Prec. KRR - - - - - 293 ⋆ 4 . 56 × 10 − 3 Hierarchical - - - - - 737 ∗ D&C 80.35 - - - - - Rand. Feat. 80.93 - 772 ∗ - - - - 876 ∗ Nystr¨ om 80.38 - - - - - ADMM R. F. - 5 . 01 × 10 − 3 958 † - - - - 42 ‡ 1 . 7 ‡ BCD R. F. - - - 0.949 34.0% BCD Nystr¨ om - - - 0.861 60 ‡ 33.7% 1 . 7 ‡ 4 . 55 × 10 − 3 500 ‡ 8 . 3 ‡ KRR - - 0.854 33.5% EigenPro - - - - - 32.6% 3 . 9 ≀ Deep NN - - - - - 32.4% - Sparse Kernels - - - - - 30.9% - Ensemble - - - - - 33.5% - Table: MillionSongs, YELP and TIMIT Datasets. Times obtained on: ‡ = cluster of 128 EC2 r3.2xlarge machines, † = cluster of 8 EC2 r3.8xlarge machines, ≀ = single machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU and 128GB of RAM, ⋆ = cluster with 512 GB of RAM and IBM POWER8 12-core processor, ∗ = unknown platform.
Some more experiments SUSY ( n ∼ 10 6 ) HIGGS ( n ∼ 10 7 ) IMAGENET ( n ∼ 10 6 ) c-err AUC Time( m ) AUC Time( h ) c-err Time( h ) FALKON 19.6% 0.877 4 0.833 3 20.7% 4 6 ≀ EigenPro 19.8% - - - - - 40 † Hierarchical 20.1% - - - - - Boosted Decision Tree - 0.863 - 0.810 - - - Neural Network - 0.875 - 0.816 - - - 4680 ‡ 78 ‡ Deep Neural Network - 0.879 0.885 - - Inception-V4 - - - - - 20.0% - Table: Architectures: † cluster with IBM POWER8 12-core cpu, 512 GB RAM, ≀ single machine with two Intel Xeon E5-2620, one Nvidia GTX Titan X GPU, 128GB RAM, ‡ single machine.
Contributions ◮ Best computations so far for optimal statistics Time O ( n √ n ) Space O ( n ) ◮ In the pipeline: adaptive sampling, general projection, SGD ◮ TBD other loss, other regularizers, other problems, other solvers. . .
Proof: bridging statistics and optimization Lemma Let δ > 0 , κ P := κ ( P ⊤ HP ) , c δ = c 0 log 1 δ . When λ ≥ 1 n c δ exp( − t/ √ κ P ) . E ( � E ( � f λ,M,t ) − E ( f H ) ≤ f λ,M ) − E ( f H ) + with probability 1 − δ . Lemma Let δ ∈ (0 , 1] , λ > 0 . When M = 2 log 1 δ , λ then � � − 1 1 − log 1 κ ( P ⊤ HP ) ≤ δ < 4 λM with probability 1 − δ .
Proving κ ( P ⊤ HP ) ≈ 1 Let K x = K ( x, · ) ∈ H , � � n � M C n = 1 C M = 1 � � C = K x ⊗ K x dρ X ( x ) , K x i ⊗ K x i , K � x j ⊗ K � x j . n M i =1 j =1 � 1 � M TT ⊤ + λI √ n T − 1 A − 1 , T = chol( K MM ) , A = chol 1 Recall that P = . Steps P ⊤ HP = A −⊤ V ∗ ( � C n + λI ) V A − 1 1 . C M + λI ) V A − 1 + A −⊤ V ∗ ( � A −⊤ V ∗ ( � C n − � P ⊤ HP C M ) V A − 1 2 . = I + A −⊤ V ∗ ( � C n − � P ⊤ HP C M ) V A − 1 3 . = with E = A −⊤ V ∗ ( � C n − � P ⊤ HP C M ) V A − 1 3 . = I + E
Recommend
More recommend