Machine Learning from a Continuous Viewpoint Weinan E Princeton University Joint work with: Chao Ma, Lei Wu https://arxiv.org/pdf/1912.12777.pdf August 3, 2020 1 / 31
Examples Standard ML problems for which we are given the dataset: 1 Supervised learning : Given S = { ( x j , y j = f ∗ ( x j )) , j ∈ [ n ] } , learn f ∗ . Objective: Minimize population risk over the “hypothesis space” R ( f ) = E x ∼ µ ( f ( x ) − f ∗ ( x )) 2 2 Dimension reduction : Given S = { x j , j ∈ [ n ] } ⊂ R D sampled from µ , find mapping: Φ : R D → R d , ( d ≪ D ) that best preserves all important features of µ . “Auto-encoder”: Minimize reconstruction error x − ˜ x : ˜ x = Ψ( z ) , z = Φ( x ) R ( f ) = E x ∼ µ ( x − Ψ( z )) 2 = E x ∼ µ ( x − Ψ(Φ( x ))) 2 August 3, 2020 2 / 31
Non-standard ML problem, no dataset given beforehand: 1 Ground state of quantum many-body problem : Let H = − � 2 2 m ∆ + V be the Hamiltonian operator of the system I ( φ ) = ( φ, Hφ ) φ ( x ) Hφ ( x ) 1 ( φ, φ ) φ 2 ( x ) d x ( φ, φ ) = E x ∼ µ φ , µ φ ( d x ) = φ ( x ) 2 subject to the constraint imposed by Pauli exclusion principle. 2 Stochastic control problems : s t +1 = s t + b t ( s t , a t ) + ξ t +1 , s t = state at time t , a t = control at time t , ξ t = i.i.d. noise � T − 1 � L ( { a t } T − 1 � t =0 ) = E { ξ t } c t ( s t , a t ( s t )) + c T ( s T ) , t =0 Look for feedback control: a t = F ( t, s t ) , t = 0 , 1 , · · · , T − 1 . August 3, 2020 3 / 31
Remark: High dimensionality Benchmark: High dimensional integration � I m ( g ) = 1 � I ( g ) = X =[0 , 1] d g ( x ) dµ, g ( x j ) m j Grid-based quadrature rules: I ( g ) − I m ( g ) ∼ C ( g ) m α/d Appearance of 1 /d in the exponent of m : Curse of dimensionality (CoD) ! If we want m − α/d = 0 . 1 , then m = 10 d/α = 10 d , if α = 1 . Monte Carlo: { x j , j ∈ [ m ] } is uniformly distributed in X . � 2 E ( I ( g ) − I m ( g )) 2 = var ( g ) �� � g 2 ( x ) d x − , var ( g ) = g ( x ) d x m X X However, var ( g ) can be very large in high dimension. Variance reduction! August 3, 2020 4 / 31
Overall strategy: Formulate a “nice” continuous problem, then discretize to get concrete models/algorithms. For PDEs, “nice” = well-posed. For calculus of variation problems, “nice” = “convex”, lower semi-continuous. For machine learning, “nice” = variational problem has simple landscape. August 3, 2020 5 / 31
How do we represent a function? An illustrative example Traditional approach for Fourier transform: � f m ( x ) = 1 � R d a ( ω ) e i ( ω , x ) d ω , a ( ω j ) e i ( ω j , x ) f ( x ) = m j { ω j } is a fixed grid, e.g. uniform. � f − f m � L 2 ( X ) ≤ C 0 m − α/d � f � H α ( X ) “New” approach: Let π be a probability distribution and � R d a ( ω ) e i ( ω , x ) π ( d ω ) = E ω ∼ π a ( ω ) e i ( ω , x ) f ( x ) = � m Let { ω j } be an i.i.d. sample of π , f m ( x ) = 1 j =1 a ( ω j ) e i ( ω j , x ) , m E | f ( x ) − f m ( x ) | 2 = m − 1 var ( f ) � m f m ( x ) = 1 j =1 a j σ ( ω T j x ) = two-layer neural network with activation function σ ( z ) = e iz . m August 3, 2020 6 / 31
Integral transform-based representation Let σ be a scalar nonlinear function (activation function), e.g. σ = ReLU Consider functions represented in the form: � R d a ( w ) σ ( w T x ) π ( d w ) f ( x ; θ ) = = E w ∼ π a ( w ) σ ( w T x ) = E ( a, w ) ∼ ρ aσ ( w T x ) θ = parameter to be optimized: θ = a ( · ) corresponds to a feature-based model θ = ρ corresponds to a two-layer neural network-like model. θ = ( a ( · ) , π ( · )) , a new model August 3, 2020 7 / 31
Discretize Fourier method: π ∼ 1 � j δ ω j where { ω j } lives on a uniform lattice. Optimize a ( · ) : N f ( x ; θ ) ∼ f m ( x ) = 1 � a ( w j ) σ ( w T j x ) m j Neural network-based method: ρ ∼ 1 � j δ ( a j , ω j ) ( { ω j } is also optimized): N f ( x ; θ ) ∼ f m ( x ) = 1 � a j σ ( w T j x ) m j then optimize (say, using L-BFGS) — this is more in line with traditional numerical analysis (e.g. nonlinear finite element or meshless methods). August 3, 2020 8 / 31
For truly large datasets, we need to use stochastic algorithms objective function are all expressed as expectations: R ( θ ) = E x ∼ µ ( f ( x ; θ ) − f ∗ ( x )) 2 R ( θ 1 , θ 2 ) = E x ∼ µ ( x − Ψ(Φ( x ; θ 1 ); θ 2 )) 2 φ ( x ; θ ) Hφ ( x ; θ ) I ( θ ) = E x ∼ µ θ φ ( x ; θ ) 2 Gradient descent (GD) can be readily converted to stochastic gradient descent (SGD). Let F = F ( θ ) = E x ∼ µ g ( θ, x ) be the objective function: GD : θ k +1 = θ k − η ∇ θ E x ∼ µ g ( θ k , x ) SGD : θ k +1 = θ k − η ∇ θ g ( θ k , x k ) where { x k } are i.i.d. random samples. August 3, 2020 9 / 31
Optimization: Defining gradient flows “Free energy” = R ( f ) = E x ∼ µ ( f ( x ) − f ∗ ( x )) 2 � a ( w ) σ ( w T x ) π ( d w ) = E w ∼ π a ( w ) σ ( w T x ) f ( x ) = Follow Halperin and Hohenberg (1977): a = non-conserved, use “model A” dynamics (Allen-Cahn): ∂a ∂t = − δ R δa π = conserved (probability density), use “model B” (Cahn-Hilliard): ∂π ∂t + ∇ · J = 0 J = π v , v = −∇ V, V = δ R δπ . August 3, 2020 10 / 31
Gradient flow for the feature-based model Fix π , optimize a . ∂ t a ( w , t ) = − δ R � w ) + ˜ δa ( w , t ) = − a ( ˜ w , t ) K ( w , ˜ w ) π ( d ˜ f ( w ) ˜ w ) = E x [ σ ( w T x ) σ ( ˜ w T x )] , f ( w ) = E x [ f ∗ ( x ) σ ( w T x )] K ( w , ˜ This is an integral equation with a symmetric positive definite kernel. Decay estimates due to convexity: Let f ∗ ( x ) = E w ∼ π a ∗ ( w ) σ ( w T x ) , I ( t ) = 1 2 � a ( · , t ) − a ∗ ( · ) � 2 + t ( R ( a ( t )) − R ( a ∗ )) Then we have dI R ( a ( t )) ≤ C 0 dt ≤ 0 , t August 3, 2020 11 / 31
Conservative gradient flow Optimize ρ : f ( x ) = E u ∼ ρ φ ( x , u ) Example: u = ( a, w ) , φ ( x , u ) = aσ ( w T x ) ∂ t ρ = ∇ ( ρ ∇ V ) V ( u ) = δ R � u ) − ˜ δρ ( u ) = E x [( f ( x ) − f ∗ ( x )) φ ( x , u )] = K ( u , ˜ u ) ρ ( d ˜ f ( u ) This is the mean-field equation derived by Chizat and Bach (2018), Mei, Montanari and Nguyen (2018), Rotskoff and Vanden-Eijnden (2018), Sirignano and Spiliopoulos (2018), by studying the continuum limit of two-layer neural networks. Does not satisfy displacement convexity. August 3, 2020 12 / 31
Mixture model Optimize ( a, π ) ( a = non-conservative, π =conservative): ∂ t a ( w , t ) = − δ R � w , t ) + ˜ a ( ˜ w , t ) K ( w , ˜ w ) π ( d ˜ δa ( w , t ) = − f ( w ) V ( w ) = δ R ∂ t π = ∇ ( π ∇ ˜ V ) , δπ ( w ) August 3, 2020 13 / 31
Discretizing the gradient flows Discretizing the population risk (into the empirical risk) using data Discretizing the gradient flow particle method – the dynamic version of Monte Carlo smoothed particle method – analog of vortex blob method spectral method – very effective in low dimensions We will see that gradient descent algorithm (GD) for random feature and neural network models are simply the particle method discretization of the gradient flows discussed before. August 3, 2020 14 / 31
Particle method for the feature-based model ∂ t a ( w , t ) = − δ R � w ) + ˜ δa ( w ) = − a ( ˜ w , t ) K ( w , ˜ w ) π ( d ˜ f ( w ) π ( d w ) ∼ 1 � δ w j , a ( w j , t ) ∼ a j ( t ) m j Discretized version: dta j ( t ) = − 1 d � K ( w j , w k ) a k ( t ) + ˜ f ( w j ) m k This is exactly the GD for the random feature model. f ( x ) ∼ f m ( x ) = 1 � a j σ ( w T j x ) m j August 3, 2020 15 / 31
Particle method for the conservative flow ∂ t ρ = ∇ ( ρ ∇ V ) (1) Particle method discretization: ρ ( t, d u ) ∼ 1 � δ u j ( t ) m j Define the loss function f m ( x ) = 1 � I ( u 1 , · · · , u m ) = R ( f m ) , φ ( x , u j ) m j Lemma: Given a set of initial data { u 0 j , j ∈ [ m ] } . The solution of (1) with initial data � m ρ (0) = 1 j =1 δ u 0 j is given by m m ρ ( t ) = 1 � δ u j ( t ) m j =1 where the particles { u j ( · ) , j ∈ [ m ] } solves: d u j u j (0) = u 0 dt = −∇ u j I ( u 1 , · · · , u m ) , j , j ∈ [ m ] This is exactly the GD dynamics for two-layer neural networks. August 3, 2020 16 / 31
Comparison with conventional NN models Continuous viewpoint (in this case same as mean-field): f m ( x ) = 1 j a j σ ( w T � j x ) m j a j σ ( w T Conventional NN models: f m ( x ) = � j x ) Test errors Test errors 0.6 2.6 2.6 1.2 1.8 2.4 2.4 2.4 log 10 ( n ) log 10 ( n ) 3.0 2.2 2.2 3.6 4.2 2.0 2.0 4.8 5.4 1.8 1.8 6.0 2.0 2.5 3.0 3.5 4.0 4.5 2.0 2.5 3.0 3.5 4.0 4.5 log 10 ( m ) log 10 ( m ) Figure: (Left) continuous viewpoint; (Right) conventional NN models. Target function is a single neuron f ∗ ( x ) = σ ( e T 1 x ) . August 3, 2020 17 / 31
Flow-based representation Continuous dynamical system viewpoint (E (2017), Haber and Ruthotto (2017), “Neural ODEs” (Chen et al, 2018)) d z dτ = g ( τ, z ) , z (0) = x The flow-map at time 1 : x → z ( x , 1) . Trial functions: f = α T z (1) Will take α = 1 for simplicity. August 3, 2020 18 / 31
Recommend
More recommend