Large Graph Limits of Learning Algorithms Andrew M Stuart Computing and Mathematical Sciences, Caltech Andrea Bertozzi, Michael Luo (UCLA) and Kostas Zygalakis (Edinburgh) ⋆ Matt Dunlop (Caltech), Dejan Slepˇ cev (CMU) and Matt Thorpe (CMU) 1
References X Zhu, Z Ghahramani, J Lafferty, Semi-supervised learning using Gaussian fields and harmonic functions , ICML, 2003. Harmonic Functions. C Rasmussen and C Williams, Gaussian processes for machine learning , MIT Press, 2006. Probit. AL Bertozzi and A Flenner, Diffuse interface models on graphs for classification of high dimensional data , SIAM MMS, 2012. Ginzburg-Landau. MA Iglesias, Y Lu, AM Stuart, Bayesian level set method for geometric inverse problems , Interfaces and Free Boundaries, 2016. Level Set. AL Bertozzi, M Luo, AM Stuart and K Zygalakis, Uncertainty quantification in the classification of high dimensional data , https://arxiv.org/abs/1703.08816 , 2017. Probit on a graph. N Garcia-Trillos and D Slepˇ cev, A variational approach to the consistency of spectral clustering , ACHA, 2017. M Dunlop, D Slepˇ cev, AM Stuart and M Thorpe, Large data and zero noise limits of graph based semi-supervised learning algorithms , In preparation, 2017. N Garcia-Trillos, D Sanz-Alonso, Continuum Limit of Posteriors in Graph Bayesian Inverse Problems , https://arxiv.org/abs/1706.07193 , 2017. 2
Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 3
Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 4
Regression Let D ⊂ R d be a bounded open set. Let D ′ ⊂ D . Ill-Posed Inverse Problem Find u : D �→ R given x ∈ D ′ . y ( x ) = u ( x ) , Strong prior information needed. 5
Classification Let D ⊂ R d be a bounded open set. Let D ′ ⊂ D . Ill-Posed Inverse Problem Find u : D �→ R given x ∈ D ′ . � � y ( x ) = sign u ( x ) , Strong prior information needed. 6
y = sign ( u ) . Red = 1. Blue = − 1 . Yellow: no information. 7
Reconstruction of the function u on D 8
Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 9
Graph Laplacian Graph Laplacian: Similarity graph G with n vertices Z = { 1 , . . . , n } . � � Weighted adjacency matrix W = { w j , k } , w j , k = η ε ( x j − x k ) . Diagonal D = diag { d jj } , d jj = � k ∈ Z w j , k . L = s n ( D − W ) (unnormalized); L ′ = D − 1 2 LD − 1 2 (normalized). Spectral Properties: L is positive semi-definite: � u , Lu � R n ∝ � j ∼ k w j , k | u j − u k | 2 . Lq j = λ j q j ; Fully connected ⇒ λ 1 > λ 0 = 0 . Fiedler Vector: q 1 . 10
Problem Statement (Optimization) Semi-Supervised Learning Input : � x j ∈ R d , � j ∈ Z := { 1 , . . . , n } Unlabelled data ; j ∈ Z ′ ⊆ Z � � y j ∈ {± 1 } , . Labelled data Output : � � y j ∈ {± 1 } , j ∈ Z Labels . Classification based on sign ( u ) , u the optimizer of: J ( u ; y ) = 1 2 � u , C − 1 u � R n + Φ( u ; y ) . u is an R − valued function on the graph nodes. C = ( L + τ 2 I ) − α � � from unlabelled data: w j , k = η ε ( x j − x k ) . Φ( u ; y ) links real-valued u to the binary-valued labels y . 11
Example: Voting Records U.S. House of Representatives 1984, 16 key votes. For each congress representative we have an associated feature vector x j ∈ R 16 such as x j = ( 1 , − 1 , 0 , · · · , 1 ) T ; 1 is “yes”, − 1 is “no” and 0 abstain/no-show. Hence d = 16 and n = 435 . Figure: Fiedler Vector and Spectrum (Normalized Case) 12
Probit Rasmussen and Williams, 2006. (MIT Press) Bertozzi, Luo, Stuart and Zygalakis, 2017. (arXiv) Probit Model p ( u ; y ) = 1 J ( n ) 2 � u , C − 1 u � R n + Φ ( n ) p ( u ; y ) . Here C = ( L + τ 2 I ) − α , Φ ( n ) � � � p ( u ; y ) := − log Ψ( y j u j ; γ ) j ∈ Z ′ and � v 1 − t 2 / 2 γ 2 � � Ψ( v ; γ ) = dt . exp � 2 πγ 2 −∞ 13
Level Set Iglesias, Lu and Stuart, 2016. (IFB) Level Set Model ls ( u ; y ) = 1 J ( n ) 2 � u , C − 1 u � R n + Φ ( n ) ls ( u ; y ) . Here C = ( L + τ 2 I ) − α , and 1 Φ ( n ) � | 2 . � � � ls ( u ; y ) := � y j − sign u j 2 γ 2 j ∈ Z ′ 14
Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 15
Infimization Recall that both optimization problems have the form J ( n ) ( u ; y ) = 1 2 � u , C − 1 u � R n + Φ ( n ) ( u ; y ) . Indeed: Φ ( n ) � � � p ( u ; y ) := − log Ψ( y j u j ; γ ) j ∈ Z ′ and 1 Φ ( n ) � � | 2 . � � ls ( u ; y ) := � y j − sign u j 2 γ 2 j ∈ Z ′ Theorem 1 Probit: J p is convex. Level Set: J ls does not attain its infimum. 16
Limit Theorem for the Dirichlet Energy Garcia-Trillos and Slepˇ cev, 2016. (ACHA) Unlabelled data { x j } sampled i.i.d. from density ρ supported on bounded D ⊂ R d . Let ∂ u L u = − 1 � � ρ 2 ∇ u ρ ∇ · x ∈ D , ∂ n = 0 , x ∈ ∂ D . Theorem 2 2 Let s n = C ( η ) n ε 2 . Then under connectivity conditions on ε = ε ( n ) in η ε , the scaled Dirichlet energy Γ − converges in the TL 2 metric: 1 n � u , Lu � R n → � u , L u � L 2 as n → ∞ . ρ 17
Sketch Proof: Quadratic Forms on Graphs Discrete Dirichlet Energy � w j , k | u j − u k | 2 . � u , Lu � R n ∝ j ∼ k Figure: Connectivity Stencils For Orange Node: PDE, Data, Localized Data. 18
Sketch Proof: Limits of Quadratic Forms on Graphs Garcia-Trillos and Slepˇ cev, 2016. (ACHA) { x j } n j = 1 i.i.d. from density ρ on D ⊂ R d . � � | · | η ε = 1 w jk = η ε ( x j − x k ) , ε d η . ε Limiting Discrete Dirichlet Energy 1 � 2 ; � �� � � u , Lu � R n ∝ � η ε x j − x k � u ( x j ) − u ( x k ) n 2 ε 2 j ∼ k � u ( x ) − u ( y ) � � 2 �� � � n → ∞ ≈ η ε x − y ρ ( x ) ρ ( y ) dxdy ; � � ε � D D � |∇ u ( x ) | 2 ρ ( x ) 2 dx ∝ � u , L u � L 2 ε → 0 ≈ C ( η ) ρ . D 19
Limit Theorem for Probit M. Dunlop, D Slepˇ cev, AM Stuart and M Thorpe, In preparation 2017. Let D ± be two disjoint bounded subsets of D , define D ′ = D + ∪ D − and y ( x ) = + 1 , x ∈ D + ; y ( x ) = − 1 , x ∈ D − . For α > 0, define C = ( L + τ 2 I ) − α . Recall that C = ( L + τ 2 I ) − α . Theorem 3 2 Let s n = C ( η ) n ε 2 . Then under connectivity conditions on ε = ε ( n ) the scaled probit objective function Γ − converges in the TL 2 metric: 1 n J ( n ) p ( u ; y ) → J p ( u ; y ) n → ∞ , as where J p ( u ; y ) = 1 u , C − 1 u � � ρ + Φ p ( u ; y ) , L 2 2 � � � Φ p ( u ; y ) := − D ′ log Ψ( y ( x ) u ( x ) ; γ ) ρ ( x ) dx . 20
Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 21
Problem Statement (Bayesian Formulation) Semi-Supervised Learning Input : � x j ∈ R d , � Unlabelled data j ∈ Z := { 1 , . . . , n } ; prior j ∈ Z ′ ⊆ Z � � Labelled data y j ) ∈ {± 1 } , . likelihood Output : � � Labels y j ∈ {± 1 } , j ∈ Z . posterior Connection between probability and optimization: J ( n ) ( u ; y ) = 1 2 � u , C − 1 u � R n + Φ ( n ) ( u ; y ) . − J ( n ) ( u ; y ) � � P ( u | y ) ∝ exp − Φ ( n ) ( u ; y ) � � ∝ exp × N ( 0 , C ) ∝ P ( y | u ) × P ( u ) . 22
Example of Underlying Gaussian (Voting Records) Figure: Two point correlation of sign ( u ) for 3 democrats 23
Probit (Continuum Limit) Let α > d 2 . Probit Probabilistic Model Prior: Gaussian P ( du ) = N ( 0 , C ) . � � Posterior: P γ ( du | y ) ∝ exp − Φ p ( u ; y ) P ( du ) . � � � Φ p ( u ; y ) := − D ′ log Ψ( y ( x ) u ( x ) ; γ ) ρ ( x ) dx . 24
Level Set (Continuum Limit) Let α > d 2 . Level Set Probabilistic Model Prior: Gaussian P ( du ) = N ( 0 , C ) . � � Posterior: P γ ( du | y ) ∝ exp − Φ ls ( u ; y ) P ( du ) . � 1 � 2 ρ ( x ) dx . � � �� Φ ls ( u ; y ) := � y ( x ) − sign u ( x ) 2 γ 2 D ′ 25
Connecting Probit, Level Set and Regression M. Dunlop, D Slepˇ cev, AM Stuart and M Thorpe, In preparation 2017. Theorem 4 Let α > d 2 . We have P γ ( u | y ) ⇒ P ( u | y ) as γ → 0 where P ( du | y ) ∝ 1 A ( u ) P ( du ) , P ( du ) = N ( 0 , C ) x ∈ D ′ } . � � A = { u : sign u ( x ) = y ( x ) , Compare with regression ( Zhu, Ghahramani, Lafferty 2003, (ICML): ) x ∈ D ′ } . A 0 = { u : u ( x ) = y ( x ) , 26
Example (PDE Two Moons – Unlabelled Data) Figure: Sampling density ρ of unlabelled data. 27
Example (PDE Two Moons – Label Data) Figure: Labelled Data. 28
Example (PDE Two Moons – Fiedler Vector of L ) Figure: Fiedler Vector. 29
Example (PDE Two Moons – Posterior Labelling) Figure: Posterior mean of u and sign ( u ) . 30
Example (One Data Point Makes All The Difference) Figure: Sampling density, Label Data 1, Label Data 2. 31
Talk Overview Learning and Inverse Problems Optimization Theoretical Properties Probability Conclusions 32
Recommend
More recommend