Regularization Methods for Large Scale Machine Learning Alessandro - - PowerPoint PPT Presentation

regularization methods for large scale machine learning
SMART_READER_LITE
LIVE PREVIEW

Regularization Methods for Large Scale Machine Learning Alessandro - - PowerPoint PPT Presentation

Regularization Methods for Large Scale Machine Learning Alessandro Rudi University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu In collaboration with: Raffaello Camoriano, Lorenzo Rosasco May


slide-1
SLIDE 1

Regularization Methods for Large Scale Machine Learning

Alessandro Rudi University of Genova - Istituto Italiano di Tecnologia Massachusetts Institute of Technology lcsl.mit.edu

In collaboration with: Raffaello Camoriano, Lorenzo Rosasco

May 6th, 2017 - RegML, Oslo

slide-2
SLIDE 2

Machine learning Algorithms

Desiderata

◮ flexible non-linear / non-parametric models ◮ scalable computation ◮ statistical guarantees

slide-3
SLIDE 3

Statistics and computations

(Chandrasekaran, Jordan ’12)

slide-4
SLIDE 4

Supervised Learning Problem: Estimate f∗

slide-5
SLIDE 5

Supervised Learning Problem:

Estimate f∗ given Sn = {(x1, y1), . . . , (xn, yn)}

(x2, y2) (x3, y3) (x4, y4) (x5, y5)

(x1, y1)

f ∗

slide-6
SLIDE 6

Supervised Learning Problem:

Estimate f∗ given Sn = {(x1, y1), . . . , (xn, yn)}

(x2, y2) (x3, y3) (x4, y4) (x5, y5)

(x1, y1)

f ∗ Setting yi = f ∗(xi) + εi i ∈ {1, . . . , n}

◮ εi ∈ R, xi ∈ Rd random (unknown distribution) ◮ f∗ unknown

slide-7
SLIDE 7

Recall Kernel Ridge Regression

  • f λ(x) = Φ(x)⊤

wλ =

n

  • i=1

K(x, xi) cλ

i ,

  • cλ = (

K + λnI)−1 y

◮ Φ(x)⊤Φ(x′) = K(x, x′)

kernel, e.g. Gaussian

K n by n data matrix

y outputs vector Computational complexity Time: O(n3) Memory: O(n2)

slide-8
SLIDE 8

Statistical Guarantees

Let E(f) = E (y − f(x))2.

Theorem ( Caponetto, DeVito ’05 )

Assume ∃w such that f∗(x) = w⊤Φ(x), subexp noise, and bounded

  • kernel. Then w.h.p.

E( f λ) − E(f∗) 1 λn + λ, so that for λn = 1/√n, w.h.p. E( f λn) − E(f∗) 1 √n.

slide-9
SLIDE 9

Remarks

◮ Bound is minimax optimal (Caponetto, DeVito ’05) ◮ Adaptivity via cross validation or Lepskii method (Caponetto, Yao ’07,

De Vito Pereverzev R. ’07)

◮ Refined results, e.g. for Sobolev classes rate is n−

2s 2s+d (Caponetto,

DeVito ’05)

Computational complexity kills the method for large problems Computational complexity Time: O(n3) Memory: O(n2)

slide-10
SLIDE 10

BIG DATA

Where it is possible to run Kernel Ridge regression

◮ n ≈ 10 000

Laptop (∼ 1 Gigabyte memory),

◮ n ≈ 100 000

Desktop (∼ 100 Gigabyte memory),

◮ n ≈ 1000 000

Cluster (∼ 10 Terabyte memory),

◮ n ≈ 10 000 000

Supercomputer (TOP10) (∼ 1 Petabyte memory) High energy physics experiments: n ≈ 107 per second. . .

slide-11
SLIDE 11

Outline

◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling

slide-12
SLIDE 12

An idea

If K(x, x′) = ΦM(x)⊤ΦM(x′) with ΦM(x) ∈ RM

  • f λ,M(x) = ΦM(x)⊤

wλ,M

  • wλ,M = (

Φ⊤ Φ + λnI)−1 Φ⊤ y

slide-13
SLIDE 13

An idea

If K(x, x′) = ΦM(x)⊤ΦM(x′) with ΦM(x) ∈ RM

  • f λ,M(x) = ΦM(x)⊤

wλ,M

  • wλ,M = (

Φ⊤ Φ + λnI)−1 Φ⊤ y

Φ = (ΦM(x1), . . . ΦM(xn))⊤ ∈ Rn×M

wλ,M vector in RM

y outputs vector in Rn

b y

=

Computational complexity Time:✟✟ ✟ O(n3) → O(nM 2) Memory: ✟✟ ✟ O(n2) → O(nM)

slide-14
SLIDE 14

Kernel Approximation

Find ΦM(x) ∈ RM such that K(x, x′) ≈ ΦM(x)⊤ΦM(x′) Apply previous algorithm.

slide-15
SLIDE 15

Random Fourier Features

Gaussian Kernel e−γx−x′2 = Eω [eiω⊤xe−iω⊤x′], ω ∼ N(0, γ)

slide-16
SLIDE 16

Random Fourier Features

Gaussian Kernel e−γx−x′2 = Eω [eiω⊤xe−iω⊤x′], ω ∼ N(0, γ) ≈ 1 M

M

  • j=1

eiω⊤

j xe−iω⊤ j x,

ωj ∼ N(0, γ),

slide-17
SLIDE 17

Random Fourier Features

Gaussian Kernel e−γx−x′2 = Eω [eiω⊤xe−iω⊤x′], ω ∼ N(0, γ) ≈ 1 M

M

  • j=1

eiω⊤

j xe−iω⊤ j x,

ωj ∼ N(0, γ), ΦM(x) := 1 √ M (eiω⊤

1 x, . . . , eiω⊤ Mx).

slide-18
SLIDE 18

Random Features Expansion

Find q such that K(x, x′) = Eω [q(x, ω)q(x′, ω)],

slide-19
SLIDE 19

Random Features Expansion

Find q such that K(x, x′) = Eω [q(x, ω)q(x′, ω)], sample w1, . . . wM and consider K(x, x′) ≈ ΦM(x)⊤ΦM(x′) := 1 M

M

  • j=1

q(x, ωj)q(x, ωj). ΦM(x) := 1 √ M (q(x, ω1), . . . , q(x, ωM)).

slide-20
SLIDE 20

Examples of Random Features

◮ translation invariant kernels, ◮ dot product kernels, ◮ group invariant kernels, ◮ infinite neural nets kernels, ◮ homogeneous additive kernels, ◮ sum, products, composition of kernels, ◮ . . .

slide-21
SLIDE 21

Computations

If M ≪ n

◮ TIME:

O(nM 2) ≪ O(n3)

◮ SPACE:

O(nM) ≪ O(n2) Any loss in ACCURACY?

slide-22
SLIDE 22

Previous Results

◮ *Many* different random features for different kernels

(Rahimi, Recht ’07, Vedaldi, Zisserman, . . . 10+)

◮ Theoretical guarantees: mainly kernel approximation

(Rahimi, Recht ’07, . . . , Sriperumbudur and Szabo ’15)

|K(x, x′) − ΦM(x)⊤ΦM(x′)| 1 √ M ,

◮ Theoretical guarantees: generalization bounds (Rahimi, Recht ’09,

Bach, ’15)

M = n ⇒ E( f λ,M) − E(f ∗) ≤ 1 √n

slide-23
SLIDE 23

Statistical Guarantees

Let E(f) = E(y − f(x))2.

Theorem (R., Camoriano, Rosasco ’16 )

Assume ∃w such that f∗(x) = w⊤Φ(x), subexp noise, and bounded kernel. Then w.h.p. E( f λ,M) − E(f∗) 1 λn + 1 M + λ, so that for λn = 1 √n, Mn = 1 λn the following hold w.h.p. E( f λn,Mn) − E(f∗) 1 √n.

slide-24
SLIDE 24

Remarks

◮ Bound is minimax optimal, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ Refined results, e.g. for Sobolev classes rate is n−

2s 2s+d (R., Camoriano,

Rosasco ’16)

slide-25
SLIDE 25

Remarks

◮ Bound is minimax optimal, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ Refined results, e.g. for Sobolev classes rate is n−

2s 2s+d (R., Camoriano,

Rosasco ’16)

M = √n guarantees NO loss in accuracy

Computational complexity Time: ✟✟ ✟ O(n3) → O(n2) Memory: ✟✟ ✟ O(n2) → O(n√n)

slide-26
SLIDE 26

M controls space, time and Statistics

Corollary: Same result if we set Mn = √n, λn = 1 Mn .

◮ M can be seen as a regularization parameter

slide-27
SLIDE 27

M controls space, time and Statistics

Corollary: Same result if we set Mn = √n, λn = 1 Mn .

◮ M can be seen as a regularization parameter

New incremental algorithm

slide-28
SLIDE 28

M controls space, time and Statistics

Corollary: Same result if we set Mn = √n, λn = 1 Mn .

◮ M can be seen as a regularization parameter

New incremental algorithm

  • 1. Pick a random feature

+ compute solution

slide-29
SLIDE 29

M controls space, time and Statistics

Corollary: Same result if we set Mn = √n, λn = 1 Mn .

◮ M can be seen as a regularization parameter

New incremental algorithm

  • 1. Pick a random feature

+ compute solution

  • 2. Pick another random features

+ rank one update

slide-30
SLIDE 30

M controls space, time and Statistics

Corollary: Same result if we set Mn = √n, λn = 1 Mn .

◮ M can be seen as a regularization parameter

New incremental algorithm

  • 1. Pick a random feature

+ compute solution

  • 2. Pick another random features

+ rank one update

  • 3. Pick another random feature . . .
slide-31
SLIDE 31

M controls space, time and Statistics

Corollary: Same result if we set Mn = √n, λn = 1 Mn .

◮ M can be seen as a regularization parameter

New incremental algorithm

  • 1. Pick a random feature

+ compute solution

  • 2. Pick another random features

+ rank one update

  • 3. Pick another random feature . . .

M controls at the same time: Space, Time, Statistics

slide-32
SLIDE 32

Outline

◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling

slide-33
SLIDE 33

Data dependent subsampling with Nystr¨

  • m

{˜ x1, . . . ˜ xM} ⊆ {x1, . . . , xn}

slide-34
SLIDE 34

Data dependent subsampling with Nystr¨

  • m

{˜ x1, . . . ˜ xM} ⊆ {x1, . . . , xn}

  • f λ,M(x) =

M

  • j=1

K(x, ˜ xj) cλ,M

i

,

  • cλ,M = (

K⊤

nM

KnM+λn K⊤

MM)−1

K⊤

nM

y

slide-35
SLIDE 35

Data dependent subsampling with Nystr¨

  • m

{˜ x1, . . . ˜ xM} ⊆ {x1, . . . , xn}

  • f λ,M(x) =

M

  • j=1

K(x, ˜ xj) cλ,M

i

,

  • cλ,M = (

K⊤

nM

KnM+λn K⊤

MM)−1

K⊤

nM

y

◮ (

KnM)ij = K(xi, ˜ xj) n by M matrix

◮ (

KMM)ij = K(˜ xi, ˜ xj) M by M matrix

y outputs vector

b y

=

Computational complexity Time:O(nM 2) Memory: O(nM)

slide-36
SLIDE 36

Remarks b y

=

◮ Plenty of methods for subsampling . . . ◮ Connections to Nystr¨

  • m for integral operators

◮ Previous results: kernel approximation

K − K⊤

nM

K†

MM

KnM. Any loss in ACCURACY?

slide-37
SLIDE 37

Statistical Guarantees

Let E(f) = E(y − f(x))2.

Theorem (R., Camoriano, Rosasco ’15 )

Assume ∃w such that f∗(x) = w⊤Φ(x), subexp noise, and bounded kernel. Then w.h.p. E( f λ,M) − E(f∗) 1 λn + λ + 1 M , so that for λn = 1 √n, Mn = 1 λn the following hold w.h.p. E( f λn,Mn) − E(f∗) 1 √n.

slide-38
SLIDE 38

Remarks

◮ Bound is minimax optimal, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ Refined results, e.g. for Sobolev classes rate is n−

2s 2s+d (R., Camoriano,

Rosasco ’15)

M = √n guarantees NO loss in accuracy

slide-39
SLIDE 39

Remarks

◮ Bound is minimax optimal, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ Refined results, e.g. for Sobolev classes rate is n−

2s 2s+d (R., Camoriano,

Rosasco ’15)

M = √n guarantees NO loss in accuracy

Computational complexity Time: ✟✟ ✟ O(n3) → O(n2) Memory: ✟✟ ✟ O(n2) → O(n√n)

slide-40
SLIDE 40

Outline

◮ Data independent subsampling ◮ Data dependent subsampling ◮ Adaptive subsampling

slide-41
SLIDE 41

Refined Results

Can we do better than uniform sampling? Non-uniform subsampling Select the point ˜ x = xi with probability pi.

slide-42
SLIDE 42

Refined Results

Can we do better than uniform sampling? Non-uniform subsampling Select the point ˜ x = xi with probability pi.

◮ Leverage scores

pi := K⊤

i (

K + λnI)−1 Ki, with Ki the i-th column of the data matrix K.

slide-43
SLIDE 43

Statistical Guarantees

Let E(f) = E(y − f(x))2.

Theorem (R., Camoriano, Rosasco ’15 )

Assume ∃w such that f∗(x) = w⊤Φ(x), subexp noise, bounded kernel, and Φ induces a Sobolev kernel with smoothness s. When λn = n

2s 2s+d ,

Mn = n

d 2s+d

the following hold w.h.p. E( f λn,Mn) − E(f∗) n−

2s 2s+d .

slide-44
SLIDE 44

Remarks

◮ Bound is minimax, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ M = n

d 2s+d ≪ √n guarantees NO loss in accuracy

◮ M = O(1) if s is large

slide-45
SLIDE 45

Remarks

◮ Bound is minimax, same as Tikhonov ◮ Adaptivity via cross validation or Lepskii method ◮ M = n

d 2s+d ≪ √n guarantees NO loss in accuracy

◮ M = O(1) if s is large

Computational complexity Time: ✟✟ ✟ O(n3) → O(nn

2d 2s+d )

Memory: ✟✟ ✟ O(n2) → O(nn

d 2s+d )

slide-46
SLIDE 46

Contributions & Open Questions

Contributions

◮ M = √n gives optimal bounds ◮ M ≪ √n for adaptive sampling. ◮ Fast rates under smoothness conditions.

Open Questions

◮ Computational lower bounds? ◮ Efficient adaptive sampling

Back to big data. . .

◮ n ≈ 10 000 Laptop ◮ n ≈

100 000 ✘✘✘ ✘ DesktopLaptop

◮ n ≈

1000 000 ✘✘✘ ✘ ClusterLaptop

◮ n ≈ 10 000 000

✭✭✭✭✭✭✭✭✭✭ ✭ TOP 10 SupercomputerDesktop

◮ n ≈

100 000 000 ✟ ✟ ???Desktop