large graph limits of learning algorithms matt dunlop
play

Large Graph Limits of Learning Algorithms Matt Dunlop, Xiyang - PowerPoint PPT Presentation

Large Graph Limits of Learning Algorithms Matt Dunlop, Xiyang (Michael) Luo Computing and Mathematical Sciences, Caltech Department of Mathematics, UCLA Andrea Bertozzi (UCLA), Xiyang Luo (UCLA) Andrew Stuart (Caltech) and Kostas Zygalakis


  1. Large Graph Limits of Learning Algorithms Matt Dunlop, Xiyang (Michael) Luo Computing and Mathematical Sciences, Caltech Department of Mathematics, UCLA Andrea Bertozzi (UCLA), Xiyang Luo (UCLA) Andrew Stuart (Caltech) and Kostas Zygalakis (Edinburgh) JUQ, to appear ⋆ Matt Dunlop (Caltech), Dejan Slepˇ cev (CMU) Andrew Stuart (Caltech) and Matt Thorpe (Cambridge) In preparation 1

  2. Talk Overview Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions 2

  3. Talk Overview Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions 3

  4. Regression Let D ⊂ R d be a bounded open set. Let D ′ ⊂ D . Ill-Posed Inverse Problem Find u : D �→ R given x ∈ D ′ . y ( x ) = u ( x ) , Strong prior information needed. 4

  5. Classification Let D ⊂ R d be a bounded open set. Let D ′ ⊂ D . Ill-Posed Inverse Problem Find u : D �→ R given � � x ∈ D ′ . y ( x ) = sign u ( x ) , Even stronger prior information needed. 5

  6. y = sign ( u ) . Red = 1. Blue = − 1 . Yellow: no information. 6

  7. Reconstruction of the function u on D 7

  8. Talk Overview Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions 8

  9. Graph Laplacian Graph Laplacian: Similarity graph G with n vertices Z = { 1 , . . . , n } . � � Weighted adjacency matrix W = { w j , k } , w j , k = η ε ( x j − x k ) . Diagonal D = diag { d jj } , d jj = � k ∈ Z w j , k . L = s n ( D − W ) (unnormalized). Spectral Properties: j ∼ k w j , k | u j − u k | 2 . L is positive semi-definite: � u , Lu � R n ∝ � Lq j = λ j q j ; Fully connected ⇒ λ 1 > λ 0 = 0 . Fiedler Vector: q 1 . 9

  10. Example: Voting Records U.S. House of Representatives 1984, 16 key votes. For each congress representative we have an associated feature vector x j ∈ R 16 such as x j = ( 1 , − 1 , 0 , · · · , 1 ) T ; 1 is “yes”, − 1 is “no” and 0 abstain/no-show. Here d = 16 and n = 435 . Figure: Strong Prior Information: Fiedler Vector and Spectrum (Normalized) 10

  11. Example of Underlying Gaussian (Voting Records) Figure: Two point correlation of sign ( u ) for 3 Democrats 11

  12. Talk Overview Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions 12

  13. Problem Statement (Optimization) Semi-Supervised Learning Input : � x j ∈ R d , � j ∈ Z := { 1 , . . . , n } Unlabelled data ; j ∈ Z ′ ⊂ Z � � y j ∈ {± 1 } , . Labelled data Output : � � y j ∈ {± 1 } , j ∈ Z Labels . Classification based on sign ( u ) , u the optimizer of: J ( u ; y ) = 1 2 � u , C − 1 u � R n + Φ( u ; y ) . u is an R − valued function on the graph nodes. C = ( L + τ 2 I ) − α � � from unlabelled data: w j , k = η ε ( x j − x k ) . Φ( u ; y ) links real-valued u to the binary-valued labels y . 13

  14. Problem Statement (Bayesian Formulation) Semi-Supervised Learning Input : � x j ∈ R d , � Unlabelled data j ∈ Z := { 1 , . . . , n } ; prior j ∈ Z ′ ⊆ Z � � Labelled data y j ∈ {± 1 } , . likelihood Output : � � Labels y j ∈ {± 1 } , j ∈ Z . posterior Connection between probability and optimization: J ( n ) ( u ; y ) = 1 2 � u , C − 1 u � R n + Φ ( n ) ( u ; y ) . − J ( n ) ( u ; y ) � � P ( u | y ) ∝ exp − Φ ( n ) ( u ; y ) � � ∝ exp × N ( 0 , C ) ∝ P ( y | u ) × P ( u ) . 14

  15. Probit Rasmussen and Williams, 2006. (MIT Press) Bertozzi, Luo, Stuart and Zygalakis, 2017. (SIAM-JUQ) Probit Model p ( u ; y ) = 1 J ( n ) 2 � u , C − 1 u � R n + Φ ( n ) p ( u ; y ) . Here C = ( L + τ 2 I ) − α , Φ ( n ) � � � p ( u ; y ) := − log Ψ( y j u j ; γ ) j ∈ Z ′ where Ψ is the smoothed Heaviside function: � v 1 � − t 2 / 2 γ 2 � Ψ( v ; γ ) = exp d t . � 2 πγ 2 −∞ 15

  16. Level Set Iglesias, Lu and Stuart, 2016. (IFB) Level Set Model ls ( u ; y ) = 1 J ( n ) 2 � u , C − 1 u � R n + Φ ( n ) ls ( u ; y ) . Here C = ( L + τ 2 I ) − α , and 1 Φ ( n ) � | 2 . � � � ls ( u ; y ) := � y j − sign u j 2 γ 2 j ∈ Z ′ 16

  17. Sampling Algorithm Cotter, Roberts, Stuart, White, 2013. (Statis. Sci.) The preconditioned Crank-Nicolson (pCN) Method 1: Define: α ( u , v ) = min { 1 , exp(Φ( u ) − Φ( v ) } . C = ( L + τ 2 I ) − α 2: while k < M do v ( k ) = 1 − β 2 u ( k ) + βξ ( k ) , where ξ ( k ) ∼ N ( 0 , C ) . � 3: Calculate acceptance probability α ( u ( k ) , v ( k ) ) . 4: Accept: u ( k + 1 ) = v ( k ) with probability α ( u ( k ) , v ( k ) ) , otherwise 5: Reject: u ( k + 1 ) = u ( k ) . 6: 7: end while Bertozzi, Luo, Stuart, 2018. (In preparation.) E ( α ( u , v )) = O ( Z 2 Z 0 = µ ( { S ( u ( j )) = y ( j ) | j ∈ Z ′ } ) 0 ) , 17

  18. Example of UQ (Hyperspectral) Here d = 129 and N ≈ 3 × 10 5 . Use Nystr¨ om . Figure: Spectral Approximation. Uncertain classification in red. 18

  19. Talk Overview Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions 19

  20. Limit Theorem for the Dirichlet Energy Garcia-Trillos and Slepˇ cev, 2016. (ACHA) Unlabelled data { x j } sampled i.i.d. from density ρ supported on bounded D ⊂ R d . Let ∂ u L u = − 1 � � ρ 2 ∇ u ρ ∇ · x ∈ D , ∂ n = 0 , x ∈ ∂ D . Theorem 2 2 Let s n = C ( η ) n ε 2 . Then under connectivity conditions on ε = ε ( n ) in η ε , the scaled Dirichlet energy Γ − converges in the TL 2 metric: 1 n � u , Lu � R n → � u , L u � L 2 as n → ∞ . ρ 20

  21. Limit Theorem for Probit Dunlop, Slepˇ cev, Stuart and Thorpe, In preparation, 2018. D ± two disjoint bounded subsets of D , define D ′ = D + ∪ D − and y ( x ) = − 1 , x ∈ D − . y ( x ) = + 1 , x ∈ D + ; Assume that # D n / n → const. as n → ∞ . For α > 0, define C = ( L + τ 2 I ) − α . Recall L u = − 1 ρ ∇ · ( ρ 2 ∇ u ) , and no flux boundary conditions. Theorem 3 2 Let s n = C ( η ) n ε 2 . Then under connectivity conditions on ε = ε ( n ) the scaled probit objective function Γ − converges in the TL 2 metric: 1 n J ( n ) p ( u ; y ) → J p ( u ; y ) as n → ∞ , J p ( u ; y ) = 1 � u , C − 1 u � ρ + Φ p ( u ; y ) , L 2 2 � � � Φ p ( u ; y ) := − D ′ log Ψ( y ( x ) u ( x ) ; γ ) ρ ( x ) d x . 21

  22. Limit Theorem for Probit Dunlop, Slepˇ cev, Stuart and Thorpe, In preparation, 2018. Assume now that # D n is fixed as n → ∞ . Theorem 4 2 Let s n = C ( η ) n ε 2 with ε = ε ( n , α ) . Suppose that either 1 2 α → ∞ ; or α > d / 2 and ε ( n , α ) n 1 α < d / 2. 2 Then with probability one, sequences of minimizers of J ( n ) converge p to zero in the TL 2 metric. 22

  23. Talk Overview Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions 23

  24. Example (PDE Two Moons – Unlabelled Data) Figure: Sampling density ρ of unlabelled data. 24

  25. Example (PDE Two Moons – Label Data) Figure: Labelled Data. 25

  26. Example (PDE Two Moons – Fiedler Vector of L ) Figure: Fiedler Vector. 26

  27. Example (PDE Two Moons – Posterior Labelling) Figure: Posterior mode of u and sign ( u ) . 27

  28. Connecting Probit, Level Set and Regression Dunlop, Slepˇ cev, Stuart and Thorpe, In preparation, 2017. Probit and Level Set Probabilistic Models Prior: Gaussian P ( d u ) = N ( 0 , C ) . � � Probit Posterior: P γ ( d u | y ) ∝ exp − Φ p ( u ; y ) P ( d u ) . � � Level Set Posterior: P γ ( d u | y ) ∝ exp − Φ ls ( u ; y ) P ( d u ) . Theorem 4 Let α > d 2 . We have P γ ( u | y ) ⇒ P ( u | y ) as γ → 0 where P ( d u | y ) ∝ 1 A ( u ) P ( d u ) , P ( d u ) = N ( 0 , C ) � � x ∈ D ′ } . A = { u : sign u ( x ) = y ( x ) , Compare with regression ( Zhu, Ghahramani, Lafferty 2003, (ICML): ) x ∈ D ′ } . A �→ A 0 = { u : u ( x ) = y ( x ) , 28

  29. Example (MNIST: Human-in-the-loop labelling) Figure: 100 most uncertain digits, 200 labels. Mean uncertainty: 14 . 0 % 29

  30. Example (MNIST) Figure: 100 most uncertain digits, 300 labels. Mean uncertainty: 10 . 3 % 30

  31. Example (MNIST) Figure: 100 most uncertain digits, 400 labels. Mean uncertainty: 8 . 1 % 31

  32. Talk Overview Learning and Inverse Problems Graph Laplacian Inverse Problem Formulation Large Graph Limits Probability Conclusions 32

  33. Summary: Graph Based Learning Single optimization framework for classification algorithms. Single Bayesian framework for classification algorithms. Large graph limit reveals novel inverse problem structure. Links between probit, level set and regression. Gaussian measure conditioned on its sign. UQ for human-in-the-loop learning. Efficient MCMC algorithms. 33

Recommend


More recommend