Large Graph Limits of Learning Algorithms Matt Dunlop, Xiyang - - PowerPoint PPT Presentation

large graph limits of learning algorithms matt dunlop
SMART_READER_LITE
LIVE PREVIEW

Large Graph Limits of Learning Algorithms Matt Dunlop, Xiyang - - PowerPoint PPT Presentation

Large Graph Limits of Learning Algorithms Matt Dunlop, Xiyang (Michael) Luo Computing and Mathematical Sciences, Caltech Department of Mathematics, UCLA Andrea Bertozzi (UCLA), Xiyang Luo (UCLA) Andrew Stuart (Caltech) and Kostas Zygalakis


slide-1
SLIDE 1

Large Graph Limits of Learning Algorithms Matt Dunlop, Xiyang (Michael) Luo

Computing and Mathematical Sciences, Caltech Department of Mathematics, UCLA

Andrea Bertozzi (UCLA), Xiyang Luo (UCLA) Andrew Stuart (Caltech) and Kostas Zygalakis (Edinburgh)

JUQ, to appear

⋆ Matt Dunlop (Caltech), Dejan Slepˇ cev (CMU) Andrew Stuart (Caltech) and Matt Thorpe (Cambridge)

In preparation

1

slide-2
SLIDE 2

Talk Overview

Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions

2

slide-3
SLIDE 3

Talk Overview

Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions

3

slide-4
SLIDE 4

Regression

Let D ⊂ Rd be a bounded open set. Let D′ ⊂ D.

Ill-Posed Inverse Problem

Find u : D → R given y(x) = u(x), x ∈ D′. Strong prior information needed.

4

slide-5
SLIDE 5

Classification

Let D ⊂ Rd be a bounded open set. Let D′ ⊂ D.

Ill-Posed Inverse Problem

Find u : D → R given y(x) = sign

  • u(x)
  • ,

x ∈ D′. Even stronger prior information needed.

5

slide-6
SLIDE 6

y = sign(u). Red= 1. Blue= −1. Yellow: no information.

6

slide-7
SLIDE 7

Reconstruction of the function u on D

7

slide-8
SLIDE 8

Talk Overview

Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions

8

slide-9
SLIDE 9

Graph Laplacian

Graph Laplacian: Similarity graph G with n vertices Z = {1, . . . , n}. Weighted adjacency matrix W = {wj,k},

  • wj,k = ηε(xj − xk).
  • Diagonal D = diag{djj}, djj =

k∈Z wj,k.

L = sn(D − W) (unnormalized). Spectral Properties: L is positive semi-definite: u, LuRn ∝

j∼k wj,k|uj − uk|2.

Lqj = λjqj; Fully connected ⇒ λ1 > λ0 = 0. Fiedler Vector: q1.

9

slide-10
SLIDE 10

Example: Voting Records

U.S. House of Representatives 1984, 16 key votes. For each congress representative we have an associated feature vector xj ∈ R16 such as xj = (1, −1, 0, · · · , 1)T; 1 is “yes”, −1 is “no” and 0 abstain/no-show. Here d = 16 and n = 435. Figure: Strong Prior Information: Fiedler Vector and Spectrum (Normalized)

10

slide-11
SLIDE 11

Example of Underlying Gaussian (Voting Records)

Figure: Two point correlation of sign(u) for 3 Democrats

11

slide-12
SLIDE 12

Talk Overview

Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions

12

slide-13
SLIDE 13

Problem Statement (Optimization)

Semi-Supervised Learning

Input:

Unlabelled data

  • xj ∈ Rd,

j ∈ Z := {1, . . . , n}

  • ;

Labelled data

  • yj ∈ {±1},

j ∈ Z′ ⊂ Z

  • .

Output:

Labels

  • yj ∈ {±1},

j ∈ Z

  • .

Classification based on sign(u), u the optimizer of: J(u; y) = 1 2u, C−1uRn + Φ(u; y). u is an R−valued function on the graph nodes. C = (L + τ 2I)−α from unlabelled data: wj,k = ηε(xj − xk).

  • Φ(u; y) links real-valued u to the binary-valued labels y.

13

slide-14
SLIDE 14

Problem Statement (Bayesian Formulation)

Semi-Supervised Learning

Input:

Unlabelled data

  • xj ∈ Rd,

j ∈ Z := {1, . . . , n}

  • ; prior

Labelled data

  • yj ∈ {±1},

j ∈ Z′ ⊆ Z

  • . likelihood

Output:

Labels

  • yj ∈ {±1},

j ∈ Z

  • . posterior

Connection between probability and optimization: J(n)(u; y) = 1 2u, C−1uRn + Φ(n)(u; y). P(u|y) ∝ exp

  • −J(n)(u; y)
  • ∝ exp
  • −Φ(n)(u; y)
  • × N(0, C)

∝ P(y|u) × P(u).

14

slide-15
SLIDE 15

Probit

Rasmussen and Williams, 2006. (MIT Press) Bertozzi, Luo, Stuart and Zygalakis, 2017. (SIAM-JUQ)

Probit Model

J(n)

p (u; y) = 1

2u, C−1uRn + Φ(n)

p (u; y).

Here C = (L + τ 2I)−α, Φ(n)

p (u; y) := −

  • j∈Z′

log

  • Ψ(yj uj ; γ)
  • where Ψ is the smoothed Heaviside function:

Ψ(v; γ) = 1

  • 2πγ2

v

−∞

exp

  • − t2/2γ2

dt.

15

slide-16
SLIDE 16

Level Set

Iglesias, Lu and Stuart, 2016. (IFB)

Level Set Model

J(n)

ls (u; y) = 1

2u, C−1uRn + Φ(n)

ls (u; y).

Here C = (L + τ 2I)−α, and Φ(n)

ls (u; y) :=

1 2γ2

  • j∈Z′
  • yj − sign
  • uj
  • |2.

16

slide-17
SLIDE 17

Sampling Algorithm

Cotter, Roberts, Stuart, White, 2013. (Statis. Sci.)

The preconditioned Crank-Nicolson (pCN) Method

1: Define: α(u, v) = min{1, exp(Φ(u) − Φ(v)}. C = (L + τ 2I)−α 2: while k < M do 3:

v(k) =

  • 1 − β2u(k) + βξ(k), where ξ(k) ∼ N(0, C).

4:

Calculate acceptance probability α(u(k), v(k)).

5:

Accept: u(k+1) = v(k) with probability α(u(k), v(k)), otherwise

6:

Reject: u(k+1) = u(k).

7: end while

Bertozzi, Luo, Stuart, 2018. (In preparation.)

E(α(u, v)) = O(Z2

0),

Z0 = µ({S(u(j)) = y(j)|j ∈ Z′})

17

slide-18
SLIDE 18

Example of UQ (Hyperspectral)

Here d = 129 and N ≈ 3 × 105. Use Nystr¨

  • m .

Figure: Spectral Approximation. Uncertain classification in red.

18

slide-19
SLIDE 19

Talk Overview

Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions

19

slide-20
SLIDE 20

Limit Theorem for the Dirichlet Energy

Garcia-Trillos and Slepˇ cev, 2016. (ACHA)

Unlabelled data {xj} sampled i.i.d. from density ρ supported on bounded D ⊂ Rd. Let Lu = −1 ρ∇ ·

  • ρ2∇u
  • x ∈ D,

∂u ∂n = 0, x ∈ ∂D.

Theorem 2

Let sn =

2 C(η)nε2 . Then under connectivity conditions on ε = ε(n) in

ηε, the scaled Dirichlet energy Γ− converges in the TL2 metric: 1 nu, LuRn → u, LuL2

ρ

as n → ∞.

20

slide-21
SLIDE 21

Limit Theorem for Probit

Dunlop, Slepˇ cev, Stuart and Thorpe, In preparation, 2018.

D± two disjoint bounded subsets of D, define D′ = D+ ∪ D− and y(x) = +1, x ∈ D+; y(x) = −1, x ∈ D−. Assume that #Dn/n → const. as n → ∞. For α > 0, define C = (L + τ 2I)−α. Recall Lu = − 1

ρ∇ · (ρ2∇u), and no flux boundary conditions.

Theorem 3

Let sn =

2 C(η)nε2 . Then under connectivity conditions on ε = ε(n) the scaled

probit objective function Γ−converges in the TL2 metric: 1 nJ(n)

p (u; y) → Jp(u; y)

as n → ∞, Jp(u; y) = 1 2

  • u, C−1u
  • L2

ρ + Φp(u; y),

Φp(u; y) := −

  • D′ log
  • Ψ(y(x) u(x) ; γ)
  • ρ(x)dx.

21

slide-22
SLIDE 22

Limit Theorem for Probit

Dunlop, Slepˇ cev, Stuart and Thorpe, In preparation, 2018.

Assume now that #Dn is fixed as n → ∞.

Theorem 4

Let sn =

2 C(η)nε2 with ε = ε(n, α). Suppose that either

1

α > d/2 and ε(n, α)n

1 2α → ∞; or 2

α < d/2. Then with probability one, sequences of minimizers of J(n)

p

converge to zero in the TL2 metric.

22

slide-23
SLIDE 23

Talk Overview

Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions

23

slide-24
SLIDE 24

Example (PDE Two Moons – Unlabelled Data)

Figure: Sampling density ρ of unlabelled data.

24

slide-25
SLIDE 25

Example (PDE Two Moons – Label Data)

Figure: Labelled Data.

25

slide-26
SLIDE 26

Example (PDE Two Moons – Fiedler Vector of L)

Figure: Fiedler Vector.

26

slide-27
SLIDE 27

Example (PDE Two Moons – Posterior Labelling)

Figure: Posterior mode of u and sign(u).

27

slide-28
SLIDE 28

Connecting Probit, Level Set and Regression

Dunlop, Slepˇ cev, Stuart and Thorpe, In preparation, 2017.

Probit and Level Set Probabilistic Models

Prior: Gaussian P(du) = N(0, C). Probit Posterior: Pγ(du|y) ∝ exp

  • −Φp(u; y)
  • P(du).

Level Set Posterior: Pγ(du|y) ∝ exp

  • −Φls(u; y)
  • P(du).

Theorem 4

Let α > d

  • 2. We have Pγ(u|y) ⇒ P(u|y) as γ → 0 where

P(du|y) ∝ 1A(u)P(du), P(du) = N(0, C) A = {u : sign

  • u(x)
  • = y(x),

x ∈ D′}. Compare with regression (Zhu, Ghahramani, Lafferty 2003, (ICML):) A → A0 = {u : u(x) = y(x), x ∈ D′}.

28

slide-29
SLIDE 29

Example (MNIST: Human-in-the-loop labelling)

29

Figure: 100 most uncertain digits, 200 labels. Mean uncertainty: 14.0%

slide-30
SLIDE 30

Example (MNIST)

30

Figure: 100 most uncertain digits, 300 labels. Mean uncertainty: 10.3%

slide-31
SLIDE 31

Example (MNIST)

31

Figure: 100 most uncertain digits, 400 labels. Mean uncertainty: 8.1%

slide-32
SLIDE 32

Talk Overview

Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions

32

slide-33
SLIDE 33

Summary: Graph Based Learning

Single optimization framework for classification algorithms. Single Bayesian framework for classification algorithms. Large graph limit reveals novel inverse problem structure. Links between probit, level set and regression. Gaussian measure conditioned on its sign. UQ for human-in-the-loop learning. Efficient MCMC algorithms.

33

slide-34
SLIDE 34

References

X Zhu, Z Ghahramani, J Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions, ICML,

  • 2003. Harmonic Functions.

C Rasmussen and C Williams, Gaussian processes for machine learning, MIT Press, 2006. Probit. AL Bertozzi, X Luo, AM Stuart Computational Cost of Sampling Methods for Semi-Supervised Learning on Large Graphs, In Preparation, 2018. MA Iglesias, Y Lu, AM Stuart, Bayesian level set method for geometric inverse problems, Interfaces and Free Boundaries, 2016. Level Set. AL Bertozzi, M Luo, AM Stuart and K Zygalakis, Uncertainty quantification in the classification of high dimensional data, https://arxiv.org/abs/1703.08816, 2017. Probit on a graph. N Garcia-Trillos and D Slepˇ cev, A variational approach to the consistency of spectral clustering, ACHA, 2017. M Dunlop, D Slepˇ cev, AM Stuart and M Thorpe, Large data and zero noise limits of graph based semi-supervised learning algorithms, In preparation, 2018. N Garcia-Trillos, D Sanz-Alonso, Continuum Limit of Posteriors in Graph Bayesian Inverse Problems, https://arxiv.org/abs/1706.07193, 2017. 34

slide-35
SLIDE 35

pCN

α(u, v) = min{1, exp(Φ(u) − Φ(v)}.

The preconditioned Crank-Nicolson (pCN) Method

1: while k < M do 2:

v(k) =

  • 1 − β2u(k) + βξ(k), where ξ(k) ∼ N(0, C).

3:

Accept: u(k+1) = v(k) with probability α(u(k), v(k)), otherwise

4:

Reject: u(k+1) = u(k).

5: end while

Why pCN? For given acceptance probability, β is independent of N = |Z|. Can exploit approximation of graph Laplacian (Nystr¨

  • m) and · · ·

35

slide-36
SLIDE 36

Example of UQ (Two Moons)

Recall that d = 102, N = 2 × 103.

Figure: Average Label Posterior Variance vs σ, feature vector noise.

36

slide-37
SLIDE 37

Example of UQ (MNIST)

Here d = 784 and N = 4000.

Figure: “Low confidence” vs “High confidence” nodes in MNIST49 graph.

37

slide-38
SLIDE 38

Saturation of Spectra in Applications

Karhunen-Loeve – if Lqj = λjqj then u ∼ N(0, C) is: u = c

1 2

N−1

  • j=1

(λj + τ 2)− α

2 qjzj, zj ∼ N(0, 1)

i.i.d. (1) Spectrum of graph Laplacian often saturates as j → N − 1. Spectral Projection ⇐ ⇒ λk := ∞, k ≥ ℓ. Spectral Approximation: set λk to some ¯ λ < ∞.

Figure: Two Moons, Hyperspectral, Voting Records.

38

slide-39
SLIDE 39

Example of UQ (Voting)

Recall that d = 16 and N = 435. Mean Absolute Error: Projection: 0.1577, Approximation: 0.0261.

Figure: Mean Label Posterior. Compare Full (black), Spectral Approximation (red) and Spectral Projection (blue).

39

slide-40
SLIDE 40

Example of UQ (Hyperspectral)

Here d = 129 and N ≈ 3 × 105. Use Nystr¨

  • m .

Figure: Spectral Approximation. Uncertain classification in red.

40