Poisson Learning: Graph-based semi-supervised learning at very low label rates Jeff Calder 1 , Brendan Cook 1 , Matthew Thorpe 2 , and Dejan Slepˇ cev 3 1 School of Mathematics, University of Minnesota 2 Department of Mathematics, University of Manchester 3 Department of Mathematical Sciences, Carnegie Mellon University International Conference on Machine Learning (ICML) July 12-18, 2020 Research supported by the National Science Foundation, European Research Council, and a University of Minnesota Grant in Aid award. Calder et al. (UofM) Poisson Learning ICML 2020 1 / 46
Outline 1 Introduction Graph-based semi-supervised learning Laplace learning/Label propagation Degeneracy in Laplace learning 2 Poisson learning Random walk perspective Variational interpretation 3 Experimental results Algorithmic details Datasets and algorithms Results 4 References Calder et al. (UofM) Poisson Learning ICML 2020 2 / 46
Graph-based semi-supervised learning Graph: G = ( X , W ) X = { x 1 , . . . , x n } are the vertices of the graph W = ( w ij ) n i , j = 1 are nonnegative and symmetric ( w ij = w ji ) edge weights. w ij ≈ 1 if x i , x j similar, and w ij ≈ 0 when dissimilar. Labels: We assume the first m ≪ n vertices are given labels y 1 , y 2 , . . . , y m ∈ { e 1 , e 2 , . . . , e k } ∈ R k . Task: Extend the labels to the rest of the vertices x m + 1 , . . . , x n . Semi-supervised smoothness assumption Similar points x i , x j ∈ X in high density regions of the graph should have similar labels. Laplace Learning/Label Propagation: Original work [Zhu et al., 2003] Learning [Zhou et al., 2005][Ando and Zhang, 2007] Manifold ranking [He et al., 2006] [Zhou et al., 2011] [Xu et al., 2011] Calder et al. (UofM) Poisson Learning ICML 2020 3 / 46
Laplace learning/Label propagation Laplacian regularized semi-supervised learning solves the Laplace equation � L u ( x i ) = 0 , if m + 1 ≤ i ≤ n , u ( x i ) = y i , if 1 ≤ i ≤ m , where u : X → R k , and L is the graph Laplacian n � L u ( x i ) = w ij ( u ( x i ) − u ( x j )) . j = 1 The label decision for vertex x i is determined by the largest component of u ( x i ) ℓ ( x i ) = argmax { u j ( x ) } . j ∈{ 1 ,..., k } Calder et al. (UofM) Poisson Learning ICML 2020 4 / 46
Label propagation The solution of Laplace learning satisfies n � L u ( x i ) = w ij ( u ( x i ) − u ( x j )) = 0 . ( m + 1 ≤ i ≤ n ) j = 1 Re-arranging, we see that u satisfies the mean-property � n j = 1 w ij u ( x j ) u ( x i ) = . � n j = 1 w ij Label propagation [Zhu 2005] iterates � n j = 1 w ij u k ( x j ) u k + 1 ( x i ) = , � n j = 1 w ij and at convergence is equivalent to Laplace learning. Calder et al. (UofM) Poisson Learning ICML 2020 5 / 46
Ill-posed with small amount of labeled data 1 1 0.9 0.8 0.8 0.7 0.6 0.6 0.4 0.5 0.4 0.2 0.3 0 1 0.2 0.8 0.1 0.6 0.8 0.6 0.4 0 0.4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0.2 0.2 0 0 Graph is n = 10 5 i.i.d. random variables uniformly drawn from [ 0 , 1 ] 2 . w xy = 1 if | x − y | < 0 . 01 and w xy = 0 otherwise. Two labels: y 1 = 0 at the Red point and y 2 = 1 at the Green point. Over 95% of labels in [ 0 . 4975 , 0 . 5025 ]. [Nadler et al., 2009][El Alaoui et al., 2016] Calder et al. (UofM) Poisson Learning ICML 2020 6 / 46
MNIST (70,000 28 × 28 pixel images of digits 0 - 9 ) [Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. “Gradient-based learning applied to document recognition.” Proceedings of the IEEE, 86(11):2278-2324, November 1998.] Calder et al. (UofM) Poisson Learning ICML 2020 7 / 46
Laplace learning on MNIST # Labels/class 1 2 3 4 5 Laplace 16.1 (6.2) 28.2 (10) 42.0 (12) 57.8 (12) 69.5 (12) Graph NN 58.8 (5.6) 66.6 (2.8) 70.2 (4) 71.3 (2.6) 73.4 (1.9) # Labels/class 10 50 100 500 1000 Laplace 93.2 (2.3) 96.9 (0.1) 97.1 (0.1) 97.6 (0.1) 97.7 (0.0) Graph NN 82.3 (1.0) 89.0 (0.5) 90.6 (0.4) 93.4 (0.1) 93.7 (0.1) Average accuracy over 10 trials with standard deviation in brackets. Graph NN: 1-nearest neighbor using graph geodesic distance. Calder et al. (UofM) Poisson Learning ICML 2020 8 / 46
Recent work The low-label rate problem was originally identified in [Nadler 2009]. A lot of recent work has attempted to address this issue with new graph-based classification algorithms at low label rates. Higher-order regularization: [Zhou and Belkin, 2011], [Dunlop et al., 2019] p -Laplace regularization: [Alaoui et al., 2016], [Calder 2018,2019], [Slepcev & Thorpe 2019] Re-weighted Laplacians: [Shi et al., 2017], [Calder & Slepcev, 2019] Centered kernel method: [Mai & Couillet, 2018] While we have lots of new models, the problem with Laplace learning at low label rates was still not well-understood. In this talk: We explain the degeneracy in terms of random walks. 1 We propose a new algorithm: Poisson learning. 2 Calder et al. (UofM) Poisson Learning ICML 2020 9 / 46
Outline 1 Introduction Graph-based semi-supervised learning Laplace learning/Label propagation Degeneracy in Laplace learning 2 Poisson learning Random walk perspective Variational interpretation 3 Experimental results Algorithmic details Datasets and algorithms Results 4 References Calder et al. (UofM) Poisson Learning ICML 2020 10 / 46
Poisson learning We propose to replace Laplace learning � L u ( x i ) = 0 , if m + 1 ≤ i ≤ n , (1) (Laplace equation) u ( x i ) = y i , if 1 ≤ i ≤ m , with Poisson learning m � (Poisson equation) L u ( x i ) = ( y j − c ) δ ij for i = 1 , . . . , n j = 1 subject to � n 1 � m i = 1 d i u ( x i ) = 0 , where c = i = 1 y i . m In both cases, the label decision is the same: ℓ ( x i ) = argmax { u j ( x ) } . j ∈{ 1 ,..., k } Calder et al. (UofM) Poisson Learning ICML 2020 11 / 46
Poisson learning We propose to replace Laplace learning � L u ( x i ) = 0 , if m + 1 ≤ i ≤ n , (2) (Laplace equation) u ( x i ) = y i , if 1 ≤ i ≤ m , with Poisson learning m � (Poisson equation) L u ( x i ) = ( y j − c ) δ ij for i = 1 , . . . , n j = 1 subject to � n � m 1 i = 1 d i u ( x i ) = 0 , where c = i = 1 y i . m For Poisson learning, unbalanced class sizes can be incorporated: � p j � ℓ ( x i ) = argmax n j u j ( x ) , j ∈{ 1 ,..., k } p j = Fraction of data in class j n j = Fraction of training data from class j . Calder et al. (UofM) Poisson Learning ICML 2020 12 / 46
Random Walk Perspective Suppose u solves the Laplace learning equation � L u ( x i ) = 0 , if m + 1 ≤ i ≤ n , u ( x i ) = y i , if 1 ≤ i ≤ m . Let x ∈ X and let X 0 , X 1 , X 2 , . . . be a random walk on X with transition probabilities n P ( X k = x j | X k − 1 = x i ) = w ij � where d i = w ij . d i j = 1 Define the stopping time to be the first time the walk hits a label, that is τ = inf { k ≥ 0 : X k ∈ { x 1 , x 2 , . . . , x m }} . Let i τ ≤ m so that X τ = x i τ . Then (by Doob’s optimal stopping theorem) (3) u ( x ) = E [ y i τ | X 0 = x ] . Calder et al. (UofM) Poisson Learning ICML 2020 13 / 46
Classification experiment Calder et al. (UofM) Poisson Learning ICML 2020 14 / 46
Random walk experiment Calder et al. (UofM) Poisson Learning ICML 2020 15 / 46
Classification experiment Calder et al. (UofM) Poisson Learning ICML 2020 16 / 46
The Random walk perspective At low label rates, the random walker reaches the mixing time before hitting a label. The label eventually hit is largely independent of where the walker starts. After walking for a long time, the probability distribution of the walker approaches the invariant distribution π given by d i π i = j = 1 d j . � n Thus, the solution of Laplace learning is approximately � n j = 1 d j y j =: c ∈ R k . u ( x i ) = E [ y i τ | X 0 = x i ] ≈ � n j = 1 d j Bottom line: Nearly everything is labeled by the one-hot vector closest to c ! Calder et al. (UofM) Poisson Learning ICML 2020 17 / 46
The random walk perspective x j x j x j Let X 0 , X 1 , X be a random walk on the graph X starting from x j ∈ X , and define 2 � T m � � � u T ( x i ) = E y j 1 { X . xj k = x i } k = 0 j = 1 Idea: We release random walkers from the labeled nodes, and record how often each label’s walker visits x i . We can write m T � � x j u T ( x i ) = P ( X = x i ) . y j k j = 1 k = 0 The inner term is a Green’s function for a random walk. As T → ∞ , u T → ∞ . We center u T by its mean value: n T m T m where c = 1 � � � � � u T ( x i ) = y j = mc , y j . m i = 1 j = 1 j = 1 k = 0 k = 0 Calder et al. (UofM) Poisson Learning ICML 2020 18 / 46
The random walk perspective Subtracting off the mean of u T , and normalizing by d i , we arrive at � T m � m 1 where c = 1 � � � u T ( x i ) := E ( y j − c ) 1 { X , y j . xj d i k = x i } m j = 1 j = 1 k = 0 Theorem For every T ≥ 0 we have � m � u T + 1 ( x i ) = u T ( x i ) + 1 � ( y j − c ) δ ij − L u T ( x i ) . d i j = 1 If the graph G is connected and the Markov chain induced by the random walk is aperiodic, then u T → u as T → ∞ , where u : X → R is the solution of m � L u ( x i ) = ( y j − c ) δ ij for i = 1 , . . . , n j = 1 satisfying � n i = 1 d i u ( x i ) = 0 . Calder et al. (UofM) Poisson Learning ICML 2020 19 / 46
Recommend
More recommend