Multiscale Methods: Dictionary Learning, Regression, Measure - - PowerPoint PPT Presentation

multiscale methods dictionary learning regression measure
SMART_READER_LITE
LIVE PREVIEW

Multiscale Methods: Dictionary Learning, Regression, Measure - - PowerPoint PPT Presentation

Multiscale Methods: Dictionary Learning, Regression, Measure Estimation for data near low dimensional sets Mauro Maggioni Departments of Mathematics and Applied Mathematics, The Institute for Data Intensive Engineering and Science, Johns


slide-1
SLIDE 1

Multiscale Methods: Dictionary Learning, Regression, Measure Estimation for data near low‐ dimensional sets

Mauro Maggioni

Departments of Mathematics and Applied Mathematics, The Institute for Data Intensive Engineering and Science, Johns Hopkins University

Geometry, Analysis and Probability KIAS, 5/10/17

  • W. Liao
  • S. Vigogna
slide-2
SLIDE 2

Curse of dimensionality

To estimate this histogram with accuracy ✏, under reasonable conditions we need bins of width ✏ and at least constant number of points in each bin, for a total of O(✏−1) points. Unfortunately in D dimensions, there are O(✏−D) boxes of size ✏. So we need O(✏−D) points. This is way too many: for ✏ = 10−1 and D = 100, we would need 10100 points.

⊗ ⊗ ⊗ ⊗ ⊗ · · ·

Data as samples {xi}n

i=1 from a probability distribution µ in RD

Can we reduce the dimensionality?

In 1 dimension estimating µ could correspond to having a histogram where the height of a column in a bin is the probability of seeing a point in that bin.

slide-3
SLIDE 3

“In high dimensions tiere are no fvnctjons,

  • nly measures”

P.W. Jones

slide-4
SLIDE 4

µ a probability measure in RD, D large. Assume that µ is (nearly) low- dimensional, e.g. concentrates around a manifold M of dimension d ⌧ D.

G e

  • m

e t r i c P r

  • b

l e m

Given n samples x1, . . . , xn i.i.d. from µ: · construct an efficient encoding for samples from µ, i.e. a map D : RD → Rm, an inverse map D−1 : Rm → RD, such that: m = m(✏) is small sup

x∼µ ||D(x)||0 ≤ k ,

sup

x∼µ ||x − D−1D(x)||2 < ✏ .

R e g r e s s i

  • n

· in addition given yi = f(xi) + ηi, with ηi independent of each other and

  • f xi, construct ˆ

f : RD → R such that Px∼µ(||f(x) − ˆ f(x)||L2(µ) > t) is small.

\noindent Objective: \noindent$\cdot$ Adaptive: no need to know regularity \& fast algorithms: $ \tilde O(n)$ or better.

Learning Geometry, Measure & Functions

\noindent $\mu$ a probability measure in $\mathbb{R}^D$, $D$ large. Assume that $\mu$ is (nearly) low- dimensional, e.g. concentrates around a manifold $\mathcal{M}$ of dimension

Objective: · Adaptive: no need to know regularity & fast algorithms: ˜ O(n) or better. · performance guarantees that depend on n (or ✏) and d, but no curse of ambient dimensionality (D). · given just the xi’s, construct ˆ µ close to µ.

M e a s u r e 
 E s t i m a t i

  • n
slide-5
SLIDE 5

−15 −10 −5 5 10 15 20 −10 −5 5 10 15

U: orthogonal D × D Σ: diagonal D × n Diagonal entries σ1 ≥ σ2 ≥ · · · ≥ 0 are called singular values. V : orthogonal n × n

Principal Component Analysis

system of coordinates for points system of coordinates for features 1901, K. Pearson

slide-6
SLIDE 6

Intrinsic Dimension of Data

slide-7
SLIDE 7

M

M + η Br(z)

z

||η|| ∼ σ √ D

Model: data {xi}n

i=1 is sampled from a manifold M of dimension k, embed-

ded in RD, with k ⌧ D. We receive ˜ Xn := {xi + ηi}n

i=1, where ηi ⇠i.i.d N

is D-dimensional noise (e.g. Gaussian). Objective: estimate k.

Green: where data is Red: where noisy data is Blue: volume in ball

  • A. Little, MM, L. Rosasco, A.C.H.A.

⇣ (β2

2,i β2 2,i+1)r2

slide-8
SLIDE 8

M M + η Br(z)

z

||η|| ∼ σ √ D

M

M + η

z

Br(z)

||η|| ∼ σ √ D

M + η z

Br(z)

M

||η|| ∼ σ √ D

Model: data {xi}n

i=1 is sampled from a manifold M of dimension k, embed-

ded in RD, with k ⌧ D. We receive ˜ Xn := {xi + ηi}n

i=1, where ηi ⇠i.i.d N

is D-dimensional noise (e.g. Gaussian). Objective: estimate k.

Green: where data is Red: where noisy data is Blue: volume in ball

slide-9
SLIDE 9

Multiscale SVD: sphere+noise

Large scales Small scales

Example: consider S9(100, 1000, 0.1): 1000 points uniformly samples on a 9- dimensional unit sphere, embedded in 100 dimensions, with Gaussian noise N(0, 0.1I100). Observe that E[||η||2] ∼ 0.12 · 100 = 1.

slide-10
SLIDE 10

Joint with C. Clementi, M. Rohrdanz, W. Zheng

φ ψ

Example: Molecular Dynamics Data

The dynamics of a small peptide (12 atoms with H-atoms removed) in a bath of water molecules, is approximated by a Langevin system of stochastic equations ˙ x = rU(x) + ˙ w The set of configurations is a point cloud in R12×3.

slide-11
SLIDE 11

Example: Alanine dipeptide

φ ψ

0.2 0.3 0.4 0.5 0.6 0.1 0.2 0.3 0.4

Singular values

0.1 0.13 0.16 0.19 0.22 0.02 0.04 0.06 0.08

ε(˚ A)

Singular values

MSVD near transition state MSVD near free energy minimum Free energy in terms of empirical coordinates

  • M. Rohrdanz, W. Zheng, MM, C. Clementi, J. Chem. Phys. 2011
slide-12
SLIDE 12

Geometric MultiResolution Analysis

We are developing a multiscale geometric approximation for a point clouds M. We proceed in 3 stages: (i) Construct multiscale partitions {{Cj,k}k∈Γj}J

j=0 of the data: for each

j, M = [k∈ΓjCj,k, and Cj,k is a nice “cube” at scale 2−j. We obtain Cj,k using cover trees. (ii) Compute low-rank SVD of the local covariance: covj,k = Φj,kΣj,kΦT

j,k.

Let Pj,k be the affine projection RD ! Vj,k := hΦj,ki (local approxi- mate tangent space): Pj,k(x) = Φj,kΦ∗

j,k(x cj,k) + cj,k. These pieces of

planes Pj,k(Cj,k) form an approximation Mj to the original data M; let PMj(x) := Pj,k(x) for x 2 Cj,k. (iii) We efficiently encode the difference QMj+1 between PMj+1(x) and PMj(x), by constructing affine “detail” operators analogous to the wavelet projections in wavelet theory. We obtain a multiscale nonlinear transform mapping data to a multiscale family

  • f pieces of planes. Fast algorithms and multiscale organization allow for fast

pruning and optimization algorithms to be run on this multiscale structure.

W.K. Allard, G. Chen, MM, A.C.H.A.

slide-13
SLIDE 13

Scale from coarse to fine Clusters at each scale

j

Cj,k

k

Geometric MultiResolution Analysis

Subset of data

slide-14
SLIDE 14

Geometric MultiResolution Analysis

We are developing a multiscale geometric approximation for a point clouds M. We proceed in 3 stages: (i) Construct multiscale partitions {{Cj,k}k∈Γj}J

j=0 of the data: for each

j, M = [k∈ΓjCj,k, and Cj,k is a nice “cube” at scale 2−j. We obtain Cj,k using cover trees. (ii) Compute low-rank SVD of the local covariance: covj,k = Φj,kΣj,kΦT

j,k.

Let Pj,k be the affine projection RD ! Vj,k := hΦj,ki (local approxi- mate tangent space): Pj,k(x) = Φj,kΦ∗

j,k(x cj,k) + cj,k. These pieces of

planes Pj,k(Cj,k) form an approximation Mj to the original data M; let PMj(x) := Pj,k(x) for x 2 Cj,k. (iii) We efficiently encode the difference QMj+1 between PMj+1(x) and PMj(x), by constructing affine “detail” operators analogous to the wavelet projections in wavelet theory. We obtain a multiscale nonlinear transform mapping data to a multiscale family

  • f pieces of planes. Fast algorithms and multiscale organization allow for fast

pruning and optimization algorithms to be run on this multiscale structure.

W.K. Allard, G. Chen, MM, A.C.H.A.

slide-15
SLIDE 15

Scale from coarse to fine Clusters at each scale

j

Cj,k

k

x ∈ VJ,x

M = ∪k∈ΓjCj,k Mj = ∪k∈Γj Pj,k(Cj,k) | {z }

⊆Vj,k

Geometric MultiResolution Analysis

Local linear low-d approximation

  • n piece of data

Subset of data

hΦj,xi hΦj−1,xi

slide-16
SLIDE 16

Geometric MultiResolution Analysis

We are developing a multiscale geometric approximation for a point clouds M. We proceed in 3 stages: (i) Construct multiscale partitions {{Cj,k}k∈Γj}J

j=0 of the data: for each

j, M = [k∈ΓjCj,k, and Cj,k is a nice “cube” at scale 2−j. We obtain Cj,k using cover trees. (ii) Compute low-rank SVD of the local covariance: covj,k = Φj,kΣj,kΦT

j,k.

Let Pj,k be the affine projection RD ! Vj,k := hΦj,ki (local approxi- mate tangent space): Pj,k(x) = Φj,kΦ∗

j,k(x cj,k) + cj,k. These pieces of

planes Pj,k(Cj,k) form an approximation Mj to the original data M; let PMj(x) := Pj,k(x) for x 2 Cj,k. (iii) We efficiently encode the difference QMj+1 between PMj+1(x) and PMj(x), by constructing affine “detail” operators analogous to the wavelet projections in wavelet theory. We obtain a multiscale nonlinear transform mapping data to a multiscale family

  • f pieces of planes. Fast algorithms and multiscale organization allow for fast

pruning and optimization algorithms to be run on this multiscale structure.

W.K. Allard, G. Chen, MM, A.C.H.A.

slide-17
SLIDE 17

Scale from coarse to fine Clusters at each scale

j

Cj,k

k

x ∈ VJ,x

M = ∪k∈ΓjCj,k Mj = ∪k∈Γj Pj,k(Cj,k) | {z }

⊆Vj,k

Geometric MultiResolution Analysis

Local linear low-d approximation

  • n piece of data

Subset of data

hΦj,xi hΦj−1,xi hΨj,xi

slide-18
SLIDE 18

Multiscale encoding of manifolds

Coefficients of QMj(x) onto Ψj,x

This table of coefficients represents the whole manifold: given this table, and the dictionary, the data may be reconstructed up to the requested precision. The matrix form actually obscures the natural data structure, which is hierarchical. Each column is the GWT of a point; observe the sparsity/compressibility of the representation.

−0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1 0 0.5 1 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1

scale = 4, error = 0.089256

−0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 0.2 0.4 0.6 0.8 1 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 1

scale = 7, error = 0.010045

Wavelet Coefficients points scales

2000 4000 6000 8000 10000 2 4 6 8 10 12 −0.8 −0.6 −0.4 −0.2 0.2 0.4 0.6 0.8 0.5 1 1.5 −4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5

coefficients against scale (log10 scale) scale coefficient

y = − 2.2*x − 0.14

coarse fine

P0,0 Q1,0 Q1,1 Q2,0 Q2,1 Q2,2 Q2,3 Qj,k j Cj,k

slide-19
SLIDE 19

Handwritten digits

Multiscale approximation with GWT for one data point (digit)

1 2 3 4 5 6 7 8 9

  • riginal

Wavelet Coefficients points scales

1000 2000 3000 4000 5000 5 10 15 20 25 30 35 40 −1500 −1000 −500 500 1000 1500

The GWT of the data is sparse (left), and the approximation error decays fast, even when a manifold structure is missing.

−3.5 −3.4 −3.3 −3.2 −3.1 −3 −2.9 −2.8 −1.2 −1.1 −1 −0.9 −0.8 −0.7 −0.6 −0.5 −0.4 −0.3 −0.2

scale error Error against scale (in log10 scale)

y= −1.5028 x − 5.4443 1 2 3 4 5 6 7 8 9

Subset of the dictionary used for the digit on the left The GWT subspaces have small dimension

slide-20
SLIDE 20

Computational costs

Intrinsic dimension Ambient dimension Number of points Cost of one inner product in the ambient space Cost of projections onto 
 hierarchical family of 
 low-dimensional planes

Let D be ambient dimension, n the number of points, and d the intrinsic

  • dimension. ✏ is the accuracy to be achieved in representing the data.

The cost of the construction is O✏(nD(log(n) + d2)) | {z }

GMRA

+ O✏(2dDn log n) | {z }

Cj,k 0s

We can estimate the dependency on ✏, for example it is log

1 √✏ for noisy C2

manifolds. The cost of the FGWT and IGWT of one point is D(d + log n) + O✏(d2 log(n))

slide-21
SLIDE 21

Given x1, . . . , xn ∼ µ, we need to construct GMRA from these points: an empirical tree ˆ T , planes ˆ Φj,k, and projection ˆ Pj at each scale j. How do we measure success? MSE := E||X− ˆ PjX||2 := E Z ||x− ˆ Pjx||2dµ = Z ||x − Pjx||2dµ | {z }

bias2

+ E Z ||Pjx − ˆ Pjx||2dµ | {z }

variance

  • reg. assumption: . 2−2js

b

Learning Theory for GMRA

log10(#cell)

0.5 1 1.5 2 2.5 3 3.5 4

log10(L2 error)

  • 2.2
  • 2
  • 1.8
  • 1.6
  • 1.4
  • 1.2
  • 1
  • 0.8
  • 0.6
  • 0.4

log10(error) versus log10(#cell)

Uniform Adaptive

MSE on test as a function of scale/# cells for 3-dimensional S-manifold

slide-22
SLIDE 22

Details: Tree and dyadic cubes construction

Geometric assumptions, µ approximately d-dimensional: (A1) There exist a multiscale tree T with multiscale partitions Λj := {Cj,k}k. (A2) #Λj ≤ 2jd/θ1. (A3) diam.(Cj,k) ≤ θ22−j. (A4) Let λj,k

1

≥ λj,k

2

≥ . . . ≥ λj,k

D be the eigenvalues of the covariance matrix of

µ|Cj,k. Then λj,k

d

≥ θ32−2j/d, and λj,k

d+1 ≤ 1 2λj,k d .

Tree construction. Points x1, . . . , xn arrive in a streaming fashion, and are added to create a tree ˆ T using the cover tree algorithm. Every node of ˆ T is a point. Properties: · at scale j, nodes are at distance ⇣ 2−j. · children of a node at scale j are at distance ⇣ 2−j from the parent. · the tree grows with each point being added, with no re-wiring needed: a new point becomes either a new leaf or a new root.

slide-23
SLIDE 23

Details: Tree and dyadic cubes construction

Geometric assumptions, µ approximately d-dimensional: (A1) There exist a multiscale tree T with multiscale partitions Λj := {Cj,k}k. (A2) #Λj ≤ 2jd/θ1. (A3) diam.(Cj,k) ≤ θ22−j. (A4) Let λj,k

1

≥ λj,k

2

≥ . . . ≥ λj,k

D be the eigenvalues of the covariance matrix of

µ|Cj,k. Then λj,k

d

≥ θ32−2j/d, and λj,k

d+1 ≤ 1 2λj,k d .

Construction of {Cj,k}. Assume µ is doubling, with doubling dimension d. We proceed in a coarse to fine fashion, using Voronoi cells, and carving out suitable regions around points so that every point at scale j has a ball of radius ⇣ 2−j in its corresponding Cj,k. In practice: we start at the bottom with Voronoi cells, and we union them going up the tree, consistently with the tree structure.

slide-24
SLIDE 24

Details: Tree and dyadic cubes construction

Geometric assumptions, µ approximately d-dimensional: (A1) There exist a multiscale tree T with multiscale partitions Λj := {Cj,k}k. (A2) #Λj ≤ 2jd/θ1. (A3) diam.(Cj,k) ≤ θ22−j. (A4) Let λj,k

1

≥ λj,k

2

≥ . . . ≥ λj,k

D be the eigenvalues of the covariance matrix of

µ|Cj,k. Then λj,k

d

≥ θ32−2j/d, and λj,k

d+1 ≤ 1 2λj,k d .

Lemma For the algorithm just described, if xi are i.i.d. samples from µ, dou- bling, then with high probability (A1)-(A4) hold at all scales j of ˆ Tn such that 2−jd & 1. Note that all this says is that nj,k & 1 since nj,k ⇣ 2−jd.

slide-25
SLIDE 25

Details: Local Principal Component

In each ˆ Cj,k, we have nj,k (a random variable) points, and use them to compute local Principal Component Analysis. If Xn,j,k ∈ RD×nj,k represents the points in ˆ Cj,k, then factorize Xn,j,k − E[Xn,j,k] = Un,j,kΣn,j,kVn,j,k using the SVD. Note that Xn,k,j has numerical rank d. Then it is enough to have nj,k & ✏−2d log d in order to have Un,j,k, Σn,j,k, Vn,j,k close to their expected values (= to what

  • ne obtains for n → +∞). This follows from random matrix theory arguments.

Conclusion: pick the finest (so that we make the bias term small) scale j∗ such that with high probability nj,k & ✏2d log d.

MSE := E||X− ˆ PjX||2 := E Z ||x− ˆ Pjx||2dµ = Z ||x − Pjx||2dµ | {z }

bias2

+ E Z ||Px − ˆ Pjx||2dµ | {z }

variance

slide-26
SLIDE 26

Multiscale Dictionary Learning

Theorem [MM, S. Minsker, N. Strawn, JLMR ’16] In terms of n, and in terms of smoothness s, the above becomes, for j = j(n, s) E||x − P∗

j Pj(x)||2 . n−

2s 2s+d−2

slide-27
SLIDE 27

Approximation using Λj Approximation using Λη

j coarse to fine space log10 ∆j,k

Simple example

slide-28
SLIDE 28

Key Ideas

P1,1(C1,1)

P2,3(C2,3) ∆2,3

C0,0

C2,2 C2,1 C2,0 C2,3

C1,1 C1,0

∆j+1,k0 > η ∆j0+1,k00 < η Tη Λη Λj

\noindent {\bf{Multiscale}} methods. Advantages: can adapt to unknown regularity, can detected unknown relevant scales in the data, can lead to fast greedy algorithms.

· Local, low-dimensional, “local-average” approximations Pj,k(X) to the

  • bject of interest in each Cj,k.

· On each edge of the tree a difference operator ∆j+1,k0(X) = Z

Cj,k

||Pj,kx − Pj+1,k0(x)x||2dµ(x) . · Let Tη be the smallest tree containing all ∆j,k’s larger than η. · The estimator is given by reconstructing the object of interest using {PTη(X)}.

slide-29
SLIDE 29

Geometric Approximation

Geometric assumptions, µ approximately d-dimensional: (A1) There exist a multiscale tree T with multiscale partitions Λj := {Cj,k}k. (A2) #Λj ≤ 2jd/θ1. (A3) diam.(Cj,k) ≤ θ22−j. (A4) Let λj,k

1

≥ λj,k

2

≥ . . . ≥ λj,k

D be the eigenvalues of the covariance matrix of

µ|Cj,k. Then λj,k

d

≥ θ32−2j/d, and λj,k

d+1 ≤ 1 2λj,k d .

Geometric complexity/regularity of µ. Let d ≥ 3, s > 0. We say that µ ∈ Bs if |µ|p

Bs := sup T

sup

η>0

ηp X

j≥0

2−2j#jTη < ∞, with p = 2(d − 2) 2s + d − 2 where T varies over the set, assumed nonempty, of multiscale tree decomposi- tions satisfying Assumption (A1-A4). Roughly speaking, the number of important pieces needed to approximate µ is under control.

slide-30
SLIDE 30

Theorem

Theorem [W. Liao, MM, ’16] Assume supp(µ) ✓ B0(M), let ν > 0. There exists κ0(θ2, θ3, d, ν) such that if τn = κ p log n/n with κ κ0, the following holds: if µ 2 Bs, then there exist c1, c2 such that P ( kX ˆ Pˆ

Λτn Xk c1

✓log n n ◆

s 2s+d−2 )

 c2n−ν . Algorithm - Adaptive GMRA Input: Data: Xn ∈ RD×n, dimension d, threshold: κ. Output: ˆ Pˆ

Λτn: piecewise linear projection on adaptive partition.

1: Construct Tn and {Cj,k} on half of Xn 2: ˆ Pj,k := PCAd(Xn ∩ Cj,k) , ˆ ∆2

j,k := 1 n

P

x∈Cj,k || ˆ

Pj,k(x) − ˆ Pj+1,k0(x)||2 . 3: Let b Tη be the tree obtained by thresholding ˆ ∆j,k < η. Let ˆ Λη be the partition

  • f outer leaves.
slide-31
SLIDE 31

Theorem - observations

· Adaptive: no need to know s · No curse of ambient dimensionality · O(Cdn log n) algorithm · b Pb

Λτn is a piecewise linear projector onto an ensemble of #Λτn planes: pro-

vides dictionary for data, compression, denoising, compressive sampling, etc... Theorem [W. Liao, MM, ’16] Assume supp(µ) ✓ B0(M), let ν > 0. There exists κ0(θ2, θ3, d, ν) such that if τn = κ p log n/n with κ κ0, the following holds: if µ 2 Bs, then there exist c1, c2 such that P ( kX ˆ Pˆ

Λτn Xk c1

✓log n n ◆

s 2s+d−2 )

 c2n−ν .

slide-32
SLIDE 32

bb

b

b

(a) S manifold (b) S manifold (c) Z manifold (d) Z manifold

0.5 1 1.5 2 2.5 3 3.5 4 4.5

log 10(partition size)

  • 3.5
  • 3
  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

log 10(L2 error) log10error versus log10(partition size)

d=3 Uniform: slope= -0.78825 theory= -0.66667 d=3 Adaptive: slope= -0.76836 theory= -0.66667 d=4 Uniform: slope= -0.59324 theory= -0.5 d=4 Adaptive: slope= -0.61705 theory= -0.5 d=5 Uniform: slope= -0.48404 theory= -0.4 d=5 Adaptive: slope= -0.49943 theory= -0.4

(e) Error versus partition size for the S manifold

0.5 1 1.5 2 2.5 3 3.5 4

log 10(partition size)

  • 3.5
  • 3
  • 2.5
  • 2
  • 1.5
  • 1
  • 0.5

log 10(L2 error) log10error versus log10(partition size)

d=3 Uniform: slope= -0.58149 theory= -0.5 d=3 Adaptive: slope= -0.77297 theory= -0.75 d=4 Uniform: slope= -0.4914 theory= -0.375 d=4 Adaptive: slope= -0.64198 theory= -0.5 d=5 Uniform: slope= -0.40358 theory= -0.3 d=5 Adaptive: slope= -0.48847 theory= -0.375

(f) Error versus partition size for the Z manifold

Simple examples

slide-33
SLIDE 33

Top line: 3D shapes; bottom line: adaptive partitions in which every cell is colored by its numeric scale (− log10(radius)). The colors representing coarse to fine scales are ordered as: blue→green→yellow→red. It is noticeable that cells at irregular locations are selected at finer scales than cells at “flat” locations.

Simple examples, cont’d

slide-34
SLIDE 34

Measure Estimation

slide-35
SLIDE 35

Geometric Measure Estimation

Scale from coarse to fine

j

We use the GMRA construction above as a guide: at every “dyadic cube” Cj,k in the GMRA tree (corresponding to a location and scale), we try to approximate µ from samples in Cj,k, by picking an element of some trivial class of probability measures F - e.g. Gaussians of dimension k. We then use bias-variance tradeoff as we go down the tree to finer and finer scales to run a stopping time argument.

slide-36
SLIDE 36

Geometric Measure Estimation

Scale from coarse to fine

j

We use the GMRA construction above as a guide: at every “dyadic cube” Cj,k in the GMRA tree (corresponding to a location and scale), we try to approximate µ from samples in Cj,k, by picking an element of some trivial class of probability measures F - e.g. Gaussians of dimension k. We then use bias-variance tradeoff as we go down the tree to finer and finer scales to run a stopping time argument.

slide-37
SLIDE 37

Geometric Measure Estimation

Example: for Gaussians of effective rank d, ζ = O(d). F is required to satisfy the following: there exist a sample value ζ > 0, and constants C1, C2 > 0 such that for all ν ∈ F, if z is a sample from νn and νz is the (random) empirical measure νz := 1

n

Pn

i=1 δzi, we have for all t > 1 and all

choices of x0 ∈ X P d(PFνz, ν) > r ζ n ||ν||2,x0t ! < C1e−C2t2 .

slide-38
SLIDE 38

Guarantees on the estimator

x x

x

slide-39
SLIDE 39

Generative models for data

We may use the GWT to model probability distributions in high dimensions that are essentially supported on unions of low-dimensional planes. We can then use this to generate new points. Here we take 2000 images of digit 7, and for any fixed j we learned a simple model by fitting a distribution in each Vj,k. Left: draws from such a distribution (’GWT’), for j = 6 [ 1 min.]; draws from a model of comparable complexity in SVD space (’SVD’) [ 0.3min.]; draws from a state-of-art Bayesian model (’MFA’) [ 15 hrs.]. Right: as a function of the scale j, we compare draws from the model to a validation set by measuring the Hausdorff distance of point clouds; we also measure the variability in Hausdorff distance over multiple draws.

slide-40
SLIDE 40

Generative models for data

GWT SVD MFA 1 2 3 4 5 6 7 8 500 1000 1500 2000 2500 3000 Scale of the GWT model Hausdorff distances Validation: distance to training GWT: distance to training SVD: distance to training MFA: distance to training Validation variability GWT variability SVD variability MFA variability

We may use the GWT to model probability distributions in high dimensions that are essentially supported on unions of low-dimensional planes. We can then use this to generate new points. Here we take 2000 images of digit 7, and for any fixed j we learned a simple model by fitting a distribution in each Vj,k. Left: draws from such a distribution (’GWT’), for j = 6 [ 1 min.]; draws from a model of comparable complexity in SVD space (’SVD’) [ 0.3min.]; draws from a state-of-art Bayesian model (’MFA’) [ 15 hrs.]. Right: as a function of the scale j, we compare draws from the model to a validation set by measuring the Hausdorff distance of point clouds; we also measure the variability in Hausdorff distance over multiple draws.

slide-41
SLIDE 41

Algorithm - Adaptive Regression on Manifolds Input: Data: (Xn, Yn) ∈ (RD × R)n, dimension d, threshold: κ. Output: ˆ fˆ

Λτn: adaptive piecewise linear estimator of f on adaptive partition.

1: Construct Tn, {Cj,k} and {Pj,k} as in GMRA, on half of Xn. 2: Construct ˆ fj,k: local polynomial least squares linear fit on Pj,k(Cj,k). 3: ∆2

j,k := 1 n

P

x∈Cj,k || ˆ

fj,k(x) − ˆ fj+1,k0(x)||2 . 4: Threshold Tn at ˆ ∆j,k < η to get b Tη. Let ˆ Λη be the partition of outer leaves.

Regression

Given n samples x1, . . . , xn i.i.d. from µ: · construct an efficient encoding for samples from µ, i.e. a map D : RD → Rm, an inverse map D−1 : Rm → RD, such that: m = m(✏) is small sup

x∼µ ||D(x)||0 ≤ k ,

sup

x∼µ ||x − D−1D(x)||2 < ✏ .

Regression · in addition given yi = f(xi) + ηi, with ηi independent of each other and

  • f xi, construct ˆ

f : RD → R such that Px∼µ(||f(x) − ˆ f(x)||L2(µ) > t) is small.

slide-42
SLIDE 42
  • ρ: unkown probability measure on M ◊ R.
  • Regression function fρ(x) := E(y|x) =

Z ydρ(y|x).

  • Random observations: (xi, yi) ≥ ρ, i = 1, . . . , n.
  • 1. Run GMRA on {xi}n

i=1.

  • 2. Compute local coordinates in Rd:

Multiscale Regression on Manifolds

  • 1. Run GMRA on {xi}n

i=1.

  • 2. Compute local coordinates in Rd:

theoretical empirical

πj,k(x) := ProjV j,k(x ≠ c

j,k)

d

πj,k(x) := Proj c

V j,k(x ≠ c

c

j,k)

  • 3. Compute a local linear estimator f j,k : Cj,k æ R (Least Squares):

f j,k(x) := [πj,k(x) 2≠j] · βj,k

d

f j,k(x) := [d πj,k(x) 2≠j] ·

d

βj,k

  • 4. Construct a global linear estimator fΛ : M æ R on a partition Λ:

fΛ :=

P

Cj,kœΛ f j,k

d

fΛ :=

P

Cj,kœΛ

d

f j,k Computational cost: GMRA + ( log )

slide-43
SLIDE 43
  • Refinement criterion on Cj,k:

theoretical empirical ∆j,k = Î(fΛj ≠ fΛj+1)1Cj,kÎ

d

∆j,k = (1

n

P

i Î(

c

fΛj ≠

c

fΛj+1)1Cj,k(xi)Î2)

1 2

  • Truncate the tree: Tτ ∏ {Cj,k : ∆j,k Ø τ}.
  • Adaptive partition Λτ (empirical

d

Λτ) consists of outer leaves of Tτ.

  • Model class: fρ œ Bs if suptree T supτ>0 τ

2d 2s+d#Tτ < Œ.

Theorem 2. Suppose |y| Æ M. Let ν > 0. There exists κ0 such that if fρ œ Bs for some s > 0 and τn = κ

s

log n/n with κ Ø κ0, there is a ˜ c such that P

8 > > > > > < > > > > > :Îfρ ≠ d

fc

ΛτnÎ Ø ˜

c

B B B @

log n n

1 C C C A

s 2s+d

9 > > > > > = > > > > > ; . n≠ν.

As a result, MSE . (log n/n)

2s 2s+d.

Numerical experiments

Adaptive Approximation

slide-44
SLIDE 44

Algorithm - Adaptive Regression on Manifolds Input: Data: (Xn, Yn) ∈ (RD × R)n, dimension d, threshold: κ. Output: ˆ fˆ

Λτn: adaptive piecewise linear estimator of f on adaptive partition.

1: Construct Tn, {Cj,k} and {Pj,k} as in GMRA, on half of Xn. 2: Construct ˆ fj,k: local polynomial least squares linear fit on Pj,k(Cj,k). 3: ∆2

j,k := 1 n

P

x∈Cj,k || ˆ

fj,k(x) − ˆ fj+1,k0(x)||2 . 4: Threshold Tn at ˆ ∆j,k < η to get b Tη. Let ˆ Λη be the partition of outer leaves.

Regression

Theorem [W. Liao, S. Vigogna, MM, ’16] Assume supp(µ) is bounded, |y| < M, and let ν > 0. There exists κ0(ϑ, d, M, ν) such that if τn = κ p log n/n with κ κ0, the following holds: if f 2 Bs, i.e. supT supη>0 ηp#Tη < 1 , p = 2d/(2s + d), then 9c1, c2 such that P ( kf ˆ fˆ

Λτn kL2(µ) c1

✓log n n ◆

s 2s+d )

 c2n−ν .

slide-45
SLIDE 45

Regression

· Adaptive: no need to know s · No curse of ambient dimensionality · O(Cdn log n) algorithm · ˆ fb

Λτn is a piecewise linear approximation to f defined on piecewise linear

approximations onto an ensemble of #Λτn planes Theorem [W. Liao, S. Vigogna, MM, ’16] Assume supp(µ) is bounded, |y| < M, and let ν > 0. There exists κ0(ϑ, d, M, ν) such that if τn = κ p log n/n with κ κ0, the following holds: if f 2 Bs, i.e. supT supη>0 ηp#Tη < 1 , p = 2d/(2s + d), then 9c1, c2 such that P ( kf ˆ fˆ

Λτn kL2(µ) c1

✓log n n ◆

s 2s+d )

 c2n−ν .

slide-46
SLIDE 46

THANK YOU! www.math.jhu.edu/~mauro

Conclusions

  • W. Liao
  • S. Vigogna

· Introduced estimators for geometric, measure and function approximation. · Multiscale, adaptive, greedy, fast; with strong finite-sample guarantees. · Rates are optimal for regression - they are the same as if f was on a known Euclidean domain of Rd, instead of an unknown d-dimensional manifold in RD. Open questions about the geometric case. · Constructions are robust with respect to noise. · It pays to take advantage of manifold structure in order to do regression.