SLIDE 1
Regularization Methods for Large Scale Machine Learning Alessandro - - PowerPoint PPT Presentation
Regularization Methods for Large Scale Machine Learning Alessandro - - PowerPoint PPT Presentation
Regularization Methods for Large Scale Machine Learning Alessandro Rudi University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu In collaboration with: Raffaello Camoriano, Lorenzo Rosasco May
SLIDE 2
SLIDE 3
Statistics and computations
(Chandrasekaran, Jordan ’12)
SLIDE 4
Supervised Learning Problem: Estimate f∗
SLIDE 5
Supervised Learning Problem:
Estimate f∗ given Sn = {(x1, y1), . . . , (xn, yn)}
(x2, y2) (x3, y3) (x4, y4) (x5, y5)
(x1, y1)
f ∗
SLIDE 6
Supervised Learning Problem:
Estimate f∗ given Sn = {(x1, y1), . . . , (xn, yn)}
(x2, y2) (x3, y3) (x4, y4) (x5, y5)
(x1, y1)
f ∗ Setting yi = f ∗(xi) + εi i ∈ {1, . . . , n}
◮ εi ∈ R, xi ∈ Rd random (unknown distribution) ◮ f∗ unknown
SLIDE 7
Recall Kernel Ridge Regression
- f λ(x) = Φ(x)⊤
wλ =
n
- i=1
K(x, xi) cλ
i ,
- cλ = (
K + λnI)−1 y
◮ Φ(x)⊤Φ(x′) = K(x, x′)
kernel, e.g. Gaussian
◮
K n by n data matrix
◮
y outputs vector Computational complexity Time: O(n3) Memory: O(n2)
SLIDE 8
Statistical Guarantees
Let E(f) = E (y − f(x))2.
Theorem ( Caponetto, DeVito ’05 )
Assume ∃w such that f∗(x) = w⊤Φ(x), subexp noise, and bounded
- kernel. Then w.h.p.
E( f λ) − E(f∗) 1 λn + λ, so that for λn = 1/√n, w.h.p. E( f λn) − E(f∗) 1 √n.
SLIDE 9
Remarks
◮ Bound is minimax optimal (Caponetto, DeVito ’05) ◮ Adaptivity via cross validation or Lepskii method (Caponetto, Yao ’07,
De Vito Pereverzev R. ’07)
◮ Refined results, e.g. for Sobolev classes rate is n−
2s 2s+d (Caponetto,
DeVito ’05)
Computational complexity kills the method for large problems Computational complexity Time: O(n3) Memory: O(n2)
SLIDE 10
BIG DATA
Where it is possible to run Kernel Ridge regression
◮ n ≈ 10 000
Laptop (∼ 1 Gigabyte memory),
◮ n ≈ 100 000
Desktop (∼ 100 Gigabyte memory),
◮ n ≈ 1000 000
Cluster (∼ 10 Terabyte memory),
◮ n ≈ 10 000 000
Supercomputer (TOP10) (∼ 1 Petabyte memory) High energy physics experiments: n ≈ 107 per second. . .
SLIDE 11
Outline
◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling
SLIDE 12
An idea
If K(x, x′) = ΦM(x)⊤ΦM(x′) with ΦM(x) ∈ RM
- f λ,M(x) = ΦM(x)⊤
wλ,M
- wλ,M = (
Φ⊤ Φ + λnI)−1 Φ⊤ y
SLIDE 13
An idea
If K(x, x′) = ΦM(x)⊤ΦM(x′) with ΦM(x) ∈ RM
- f λ,M(x) = ΦM(x)⊤
wλ,M
- wλ,M = (
Φ⊤ Φ + λnI)−1 Φ⊤ y
◮
Φ = (ΦM(x1), . . . ΦM(xn))⊤ ∈ Rn×M
◮
wλ,M vector in RM
◮
y outputs vector in Rn
b y
=
Computational complexity Time:✟✟ ✟ O(n3) → O(nM 2) Memory: ✟✟ ✟ O(n2) → O(nM)
SLIDE 14
Kernel Approximation
Find ΦM(x) ∈ RM such that K(x, x′) ≈ ΦM(x)⊤ΦM(x′) Apply previous algorithm.
SLIDE 15
Random Fourier Features
Gaussian Kernel e−γx−x′2 = Eω [eiω⊤xe−iω⊤x′], ω ∼ N(0, γ)
SLIDE 16
Random Fourier Features
Gaussian Kernel e−γx−x′2 = Eω [eiω⊤xe−iω⊤x′], ω ∼ N(0, γ) ≈ 1 M
M
- j=1
eiω⊤
j xe−iω⊤ j x,
ωj ∼ N(0, γ),
SLIDE 17
Random Fourier Features
Gaussian Kernel e−γx−x′2 = Eω [eiω⊤xe−iω⊤x′], ω ∼ N(0, γ) ≈ 1 M
M
- j=1
eiω⊤
j xe−iω⊤ j x,
ωj ∼ N(0, γ), ΦM(x) := 1 √ M (eiω⊤
1 x, . . . , eiω⊤ Mx).
SLIDE 18
Random Features Expansion
Find q such that K(x, x′) = Eω [q(x, ω)q(x′, ω)],
SLIDE 19
Random Features Expansion
Find q such that K(x, x′) = Eω [q(x, ω)q(x′, ω)], sample w1, . . . wM and consider K(x, x′) ≈ ΦM(x)⊤ΦM(x′) := 1 M
M
- j=1
q(x, ωj)q(x, ωj). ΦM(x) := 1 √ M (q(x, ω1), . . . , q(x, ωM)).
SLIDE 20
Examples of Random Features
◮ translation invariant kernels, ◮ dot product kernels, ◮ group invariant kernels, ◮ infinite neural nets kernels, ◮ homogeneous additive kernels, ◮ sum, products, composition of kernels, ◮ . . .
SLIDE 21
Computations
If M ≪ n
◮ TIME:
O(nM 2) ≪ O(n3)
◮ SPACE:
O(nM) ≪ O(n2) Any loss in ACCURACY?
SLIDE 22
Previous Results
◮ *Many* different random features for different kernels
(Rahimi, Recht ’07, Vedaldi, Zisserman, . . . 10+)
◮ Theoretical guarantees: mainly kernel approximation
(Rahimi, Recht ’07, . . . , Sriperumbudur and Szabo ’15)
|K(x, x′) − ΦM(x)⊤ΦM(x′)| 1 √ M ,
◮ Theoretical guarantees: generalization bounds (Rahimi, Recht ’09,
Bach, ’15)
M = n ⇒ E( f λ,M) − E(f ∗) ≤ 1 √n
SLIDE 23
Statistical Guarantees
Let E(f) = E(y − f(x))2.
Theorem (R., Camoriano, Rosasco ’16 )
Assume ∃w such that f∗(x) = w⊤Φ(x), subexp noise, and bounded kernel. Then w.h.p. E( f λ,M) − E(f∗) 1 λn + 1 M + λ, so that for λn = 1 √n, Mn = 1 λn the following hold w.h.p. E( f λn,Mn) − E(f∗) 1 √n.
SLIDE 24
Remarks
◮ Bound is minimax optimal, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ Refined results, e.g. for Sobolev classes rate is n−
2s 2s+d (R., Camoriano,
Rosasco ’16)
SLIDE 25
Remarks
◮ Bound is minimax optimal, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ Refined results, e.g. for Sobolev classes rate is n−
2s 2s+d (R., Camoriano,
Rosasco ’16)
M = √n guarantees NO loss in accuracy
Computational complexity Time: ✟✟ ✟ O(n3) → O(n2) Memory: ✟✟ ✟ O(n2) → O(n√n)
SLIDE 26
M controls space, time and Statistics
Corollary: Same result if we set Mn = √n, λn = 1 Mn .
◮ M can be seen as a regularization parameter
SLIDE 27
M controls space, time and Statistics
Corollary: Same result if we set Mn = √n, λn = 1 Mn .
◮ M can be seen as a regularization parameter
New incremental algorithm
SLIDE 28
M controls space, time and Statistics
Corollary: Same result if we set Mn = √n, λn = 1 Mn .
◮ M can be seen as a regularization parameter
New incremental algorithm
◮
- 1. Pick a random feature
+ compute solution
SLIDE 29
M controls space, time and Statistics
Corollary: Same result if we set Mn = √n, λn = 1 Mn .
◮ M can be seen as a regularization parameter
New incremental algorithm
◮
- 1. Pick a random feature
+ compute solution
- 2. Pick another random features
+ rank one update
SLIDE 30
M controls space, time and Statistics
Corollary: Same result if we set Mn = √n, λn = 1 Mn .
◮ M can be seen as a regularization parameter
New incremental algorithm
◮
- 1. Pick a random feature
+ compute solution
- 2. Pick another random features
+ rank one update
- 3. Pick another random feature . . .
SLIDE 31
M controls space, time and Statistics
Corollary: Same result if we set Mn = √n, λn = 1 Mn .
◮ M can be seen as a regularization parameter
New incremental algorithm
◮
- 1. Pick a random feature
+ compute solution
- 2. Pick another random features
+ rank one update
- 3. Pick another random feature . . .
M controls at the same time: Space, Time, Statistics
SLIDE 32
Outline
◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling
SLIDE 33
Data dependent subsampling with Nystr¨
- m
{˜ x1, . . . ˜ xM} ⊆ {x1, . . . , xn}
SLIDE 34
Data dependent subsampling with Nystr¨
- m
{˜ x1, . . . ˜ xM} ⊆ {x1, . . . , xn}
- f λ,M(x) =
M
- j=1
K(x, ˜ xj) cλ,M
i
,
- cλ,M = (
K⊤
nM
KnM+λn K⊤
MM)−1
K⊤
nM
y
SLIDE 35
Data dependent subsampling with Nystr¨
- m
{˜ x1, . . . ˜ xM} ⊆ {x1, . . . , xn}
- f λ,M(x) =
M
- j=1
K(x, ˜ xj) cλ,M
i
,
- cλ,M = (
K⊤
nM
KnM+λn K⊤
MM)−1
K⊤
nM
y
◮ (
KnM)ij = K(xi, ˜ xj) n by M matrix
◮ (
KMM)ij = K(˜ xi, ˜ xj) M by M matrix
◮
y outputs vector
b y
=
Computational complexity Time:O(nM 2) Memory: O(nM)
SLIDE 36
Remarks b y
=
◮ Plenty of methods for subsampling . . . ◮ Connections to Nystr¨
- m for integral operators
◮ Previous results: kernel approximation
K − K⊤
nM
K†
MM
KnM. Any loss in ACCURACY?
SLIDE 37
Statistical Guarantees
Let E(f) = E(y − f(x))2.
Theorem (R., Camoriano, Rosasco ’15 )
Assume ∃w such that f∗(x) = w⊤Φ(x), subexp noise, and bounded kernel. Then w.h.p. E( f λ,M) − E(f∗) 1 λn + λ + 1 M , so that for λn = 1 √n, Mn = 1 λn the following hold w.h.p. E( f λn,Mn) − E(f∗) 1 √n.
SLIDE 38
Remarks
◮ Bound is minimax optimal, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ Refined results, e.g. for Sobolev classes rate is n−
2s 2s+d (R., Camoriano,
Rosasco ’15)
M = √n guarantees NO loss in accuracy
SLIDE 39
Remarks
◮ Bound is minimax optimal, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ Refined results, e.g. for Sobolev classes rate is n−
2s 2s+d (R., Camoriano,
Rosasco ’15)
M = √n guarantees NO loss in accuracy
Computational complexity Time: ✟✟ ✟ O(n3) → O(n2) Memory: ✟✟ ✟ O(n2) → O(n√n)
SLIDE 40
Outline
◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling
SLIDE 41
Refined Results
Can we do better than uniform sampling? Non-uniform subsampling Select the point ˜ x = xi with probability pi.
SLIDE 42
Refined Results
Can we do better than uniform sampling? Non-uniform subsampling Select the point ˜ x = xi with probability pi.
◮ Leverage scores
pi := K⊤
i (
K + λnI)−1 Ki, with Ki the i-th column of the data matrix K.
SLIDE 43
Statistical Guarantees
Let E(f) = E(y − f(x))2.
Theorem (R., Camoriano, Rosasco ’15 )
Assume ∃w such that f∗(x) = w⊤Φ(x), subexp noise, bounded kernel, and Φ induces a Sobolev kernel with smoothness s. When λn = n
2s 2s+d ,
Mn = n
d 2s+d
the following hold w.h.p. E( f λn,Mn) − E(f∗) n−
2s 2s+d .
SLIDE 44
Remarks
◮ Bound is minimax, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ M = n
d 2s+d ≪ √n guarantees NO loss in accuracy
◮ M = O(1) if s is large
SLIDE 45
Remarks
◮ Bound is minimax, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ M = n
d 2s+d ≪ √n guarantees NO loss in accuracy
◮ M = O(1) if s is large
Computational complexity Time: ✟✟ ✟ O(n3) → O(nn
2d 2s+d )
Memory: ✟✟ ✟ O(n2) → O(nn
d 2s+d )
SLIDE 46