large graph limits of learning algorithms andrew m stuart
play

Large Graph Limits of Learning Algorithms Andrew M Stuart Computing - PowerPoint PPT Presentation

Large Graph Limits of Learning Algorithms Andrew M Stuart Computing and Mathematical Sciences, Caltech Andrea Bertozzi, Michael Luo (UCLA) and Kostas Zygalakis (Edinburgh) Matt Dunlop (Caltech), Dejan Slep cev (CMU) and Matt Thorpe


  1. Large Graph Limits of Learning Algorithms Andrew M Stuart Computing and Mathematical Sciences, Caltech Andrea Bertozzi, Michael Luo (UCLA) and Kostas Zygalakis (Edinburgh) ⋆ Matt Dunlop (Caltech), Dejan Slepˇ cev (CMU) and Matt Thorpe (CMU) 1

  2. References X Zhu, Z Ghahramani, J Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions , ICML, 2003. Harmonic Functions. C Rasmussen and C Williams, Gaussian processes for machine learning , MIT Press, 2006. Probit. AL Bertozzi and A Flenner, Diffuse interface models on graphs for classification of high dimensional data , SIAM MMS, 2012. Ginzburg-Landau. MA Iglesias, Y Lu, AM Stuart, Bayesian level set method for geometric inverse problems , Interfaces and Free Boundaries, 2016. Level Set. AL Bertozzi, M Luo, AM Stuart and K Zygalakis, Uncertainty quantification in the classification of high dimensional data , https://arxiv.org/abs/1703.08816 , 2017. Probit on a graph. N Garcia-Trillos and D Slepˇ cev, A variational approach to the consistency of spectral clustering , ACHA, 2017. M Dunlop, D Slepˇ cev, AM Stuart and M Thorpe, Large data and zero noise limits of graph based semi-supervised learning algorithms , In preparation, 2017. N Garcia-Trillos, D Sanz-Alonso, Continuum Limit of Posteriors in Graph Bayesian Inverse Problems , https://arxiv.org/abs/1706.07193 , 2017. 2

  3. Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 3

  4. Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 4

  5. Regression Let D ⊂ R d be a bounded open set. Let D ′ ⊂ D . Ill-Posed Inverse Problem Find u : D �→ R given x ∈ D ′ . y ( x ) = u ( x ) , Strong prior information needed. 5

  6. Classification Let D ⊂ R d be a bounded open set. Let D ′ ⊂ D . Ill-Posed Inverse Problem Find u : D �→ R given x ∈ D ′ . � � y ( x ) = sign u ( x ) , Strong prior information needed. 6

  7. y = sign ( u ) . Red = 1. Blue = − 1 . Yellow: no information. 7

  8. Reconstruction of the function u on D 8

  9. Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 9

  10. Graph Laplacian Graph Laplacian: Similarity graph G with n vertices Z = { 1 , . . . , n } . � � Weighted adjacency matrix W = { w j , k } , w j , k = η ε ( x j − x k ) . Diagonal D = diag { d jj } , d jj = � k ∈ Z w j , k . L = s n ( D − W ) (unnormalized); L ′ = D − 1 2 LD − 1 2 (normalized). Spectral Properties: L is positive semi-definite: � u , Lu � R n ∝ � j ∼ k w j , k | u j − u k | 2 . Lq j = λ j q j ; Fully connected ⇒ λ 1 > λ 0 = 0 . Fiedler Vector: q 1 . 10

  11. Problem Statement (Optimization) Semi-Supervised Learning Input : � x j ∈ R d , � j ∈ Z := { 1 , . . . , n } Unlabelled data ; j ∈ Z ′ ⊆ Z � � y j ∈ {± 1 } , . Labelled data Output : � � y j ∈ {± 1 } , j ∈ Z Labels . Classification based on sign ( u ) , u the optimizer of: J ( u ; y ) = 1 2 � u , C − 1 u � R n + Φ( u ; y ) . u is an R − valued function on the graph nodes. C = ( L + τ 2 I ) − α � � from unlabelled data: w j , k = η ε ( x j − x k ) . Φ( u ; y ) links real-valued u to the binary-valued labels y . 11

  12. Example: Voting Records U.S. House of Representatives 1984, 16 key votes. For each congress representative we have an associated feature vector x j ∈ R 16 such as x j = ( 1 , − 1 , 0 , · · · , 1 ) T ; 1 is “yes”, − 1 is “no” and 0 abstain/no-show. Hence d = 16 and n = 435 . Figure: Fiedler Vector and Spectrum (Normalized Case) 12

  13. Probit Rasmussen and Williams, 2006. (MIT Press) Bertozzi, Luo, Stuart and Zygalakis, 2017. (arXiv) Probit Model p ( u ; y ) = 1 J ( n ) 2 � u , C − 1 u � R n + Φ ( n ) p ( u ; y ) . Here C = ( L + τ 2 I ) − α , Φ ( n ) � � � p ( u ; y ) := − log Ψ( y j u j ; γ ) j ∈ Z ′ and � v 1 − t 2 / 2 γ 2 � � Ψ( v ; γ ) = dt . exp � 2 πγ 2 −∞ 13

  14. Level Set Iglesias, Lu and Stuart, 2016. (IFB) Level Set Model ls ( u ; y ) = 1 J ( n ) 2 � u , C − 1 u � R n + Φ ( n ) ls ( u ; y ) . Here C = ( L + τ 2 I ) − α , and 1 Φ ( n ) � | 2 . � � � ls ( u ; y ) := � y j − sign u j 2 γ 2 j ∈ Z ′ 14

  15. Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 15

  16. Infimization Recall that both optimization problems have the form J ( n ) ( u ; y ) = 1 2 � u , C − 1 u � R n + Φ ( n ) ( u ; y ) . Indeed: Φ ( n ) � � � p ( u ; y ) := − log Ψ( y j u j ; γ ) j ∈ Z ′ and 1 Φ ( n ) � � | 2 . � � ls ( u ; y ) := � y j − sign u j 2 γ 2 j ∈ Z ′ Theorem 1 Probit: J p is convex. Level Set: J ls does not attain its infimum. 16

  17. Limit Theorem for the Dirichlet Energy Garcia-Trillos and Slepˇ cev, 2016. (ACHA) Unlabelled data { x j } sampled i.i.d. from density ρ supported on bounded D ⊂ R d . Let ∂ u L u = − 1 � � ρ 2 ∇ u ρ ∇ · x ∈ D , ∂ n = 0 , x ∈ ∂ D . Theorem 2 2 Let s n = C ( η ) n ε 2 . Then under connectivity conditions on ε = ε ( n ) in η ε , the scaled Dirichlet energy Γ − converges in the TL 2 metric: 1 n � u , Lu � R n → � u , L u � L 2 as n → ∞ . ρ 17

  18. Sketch Proof: Quadratic Forms on Graphs Discrete Dirichlet Energy � w j , k | u j − u k | 2 . � u , Lu � R n ∝ j ∼ k Figure: Connectivity Stencils For Orange Node: PDE, Data, Localized Data. 18

  19. Sketch Proof: Limits of Quadratic Forms on Graphs Garcia-Trillos and Slepˇ cev, 2016. (ACHA) { x j } n j = 1 i.i.d. from density ρ on D ⊂ R d . � � | · | η ε = 1 w jk = η ε ( x j − x k ) , ε d η . ε Limiting Discrete Dirichlet Energy 1 � 2 ; � �� � � u , Lu � R n ∝ � η ε x j − x k � u ( x j ) − u ( x k ) n 2 ε 2 j ∼ k � u ( x ) − u ( y ) � � 2 �� � � n → ∞ ≈ η ε x − y ρ ( x ) ρ ( y ) dxdy ; � � ε � D D � |∇ u ( x ) | 2 ρ ( x ) 2 dx ∝ � u , L u � L 2 ε → 0 ≈ C ( η ) ρ . D 19

  20. Limit Theorem for Probit M. Dunlop, D Slepˇ cev, AM Stuart and M Thorpe, In preparation 2017. Let D ± be two disjoint bounded subsets of D , define D ′ = D + ∪ D − and y ( x ) = + 1 , x ∈ D + ; y ( x ) = − 1 , x ∈ D − . For α > 0, define C = ( L + τ 2 I ) − α . Recall that C = ( L + τ 2 I ) − α . Theorem 3 2 Let s n = C ( η ) n ε 2 . Then under connectivity conditions on ε = ε ( n ) the scaled probit objective function Γ − converges in the TL 2 metric: 1 n J ( n ) p ( u ; y ) → J p ( u ; y ) n → ∞ , as where J p ( u ; y ) = 1 u , C − 1 u � � ρ + Φ p ( u ; y ) , L 2 2 � � � Φ p ( u ; y ) := − D ′ log Ψ( y ( x ) u ( x ) ; γ ) ρ ( x ) dx . 20

  21. Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 21

  22. Problem Statement (Bayesian Formulation) Semi-Supervised Learning Input : � x j ∈ R d , � Unlabelled data j ∈ Z := { 1 , . . . , n } ; prior j ∈ Z ′ ⊆ Z � � Labelled data y j ) ∈ {± 1 } , . likelihood Output : � � Labels y j ∈ {± 1 } , j ∈ Z . posterior Connection between probability and optimization: J ( n ) ( u ; y ) = 1 2 � u , C − 1 u � R n + Φ ( n ) ( u ; y ) . − J ( n ) ( u ; y ) � � P ( u | y ) ∝ exp − Φ ( n ) ( u ; y ) � � ∝ exp × N ( 0 , C ) ∝ P ( y | u ) × P ( u ) . 22

  23. Example of Underlying Gaussian (Voting Records) Figure: Two point correlation of sign ( u ) for 3 democrats 23

  24. Probit (Continuum Limit) Let α > d 2 . Probit Probabilistic Model Prior: Gaussian P ( du ) = N ( 0 , C ) . � � Posterior: P γ ( du | y ) ∝ exp − Φ p ( u ; y ) P ( du ) . � � � Φ p ( u ; y ) := − D ′ log Ψ( y ( x ) u ( x ) ; γ ) ρ ( x ) dx . 24

  25. Level Set (Continuum Limit) Let α > d 2 . Level Set Probabilistic Model Prior: Gaussian P ( du ) = N ( 0 , C ) . � � Posterior: P γ ( du | y ) ∝ exp − Φ ls ( u ; y ) P ( du ) . � 1 � 2 ρ ( x ) dx . � � �� Φ ls ( u ; y ) := � y ( x ) − sign u ( x ) 2 γ 2 D ′ 25

  26. Connecting Probit, Level Set and Regression M. Dunlop, D Slepˇ cev, AM Stuart and M Thorpe, In preparation 2017. Theorem 4 Let α > d 2 . We have P γ ( u | y ) ⇒ P ( u | y ) as γ → 0 where P ( du | y ) ∝ 1 A ( u ) P ( du ) , P ( du ) = N ( 0 , C ) x ∈ D ′ } . � � A = { u : sign u ( x ) = y ( x ) , Compare with regression ( Zhu, Ghahramani, Lafferty 2003, (ICML): ) x ∈ D ′ } . A 0 = { u : u ( x ) = y ( x ) , 26

  27. Example (PDE Two Moons – Unlabelled Data) Figure: Sampling density ρ of unlabelled data. 27

  28. Example (PDE Two Moons – Label Data) Figure: Labelled Data. 28

  29. Example (PDE Two Moons – Fiedler Vector of L ) Figure: Fiedler Vector. 29

  30. Example (PDE Two Moons – Posterior Labelling) Figure: Posterior mean of u and sign ( u ) . 30

  31. Example (One Data Point Makes All The Difference) Figure: Sampling density, Label Data 1, Label Data 2. 31

  32. Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 32

Recommend


More recommend