Approximate Posterior Sampling via Stochastic Optimisation Connie Trojan Supervisor: Srshti Putcha 6 th September 2019 Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Background Large scale machine learning models rely on stochastic optimisation techniques to learn parameters of interest Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Background Large scale machine learning models rely on stochastic optimisation techniques to learn parameters of interest It is useful to understand parameter uncertainty using Bayesian inference Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Background Large scale machine learning models rely on stochastic optimisation techniques to learn parameters of interest It is useful to understand parameter uncertainty using Bayesian inference Usually simulate the Bayesian posterior using Markov Chain Monte Carlo (MCMC) sampling algorithms Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Background Large scale machine learning models rely on stochastic optimisation techniques to learn parameters of interest It is useful to understand parameter uncertainty using Bayesian inference Usually simulate the Bayesian posterior using Markov Chain Monte Carlo (MCMC) sampling algorithms Stochastic gradient MCMC methods combine stochastic optimisation methods with MCMC to reduce computation time Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Notation In the Bayesian approach, the unknown parameter θ is treated as a random variable. The Bayesian posterior distribution π ( θ | ① ) has the form: N � π ( θ | ① ) ∝ p ( θ ) ℓ ( ① | θ ) = p ( θ ) ℓ ( x i | θ ) , i =1 where: p ( θ ) is the prior distribution ℓ ( x i | θ ) is the likelihood associated with observation i N is the size of the dataset Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Notation In particular, gradient-based MCMC algorithms use the log posterior f ( θ ) to propose moves: N N � � f ( θ ) = k + f 0 ( θ ) + f i ( θ ) ≡ k + log p ( θ ) + log ℓ ( x i | θ ) i =1 i =1 Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Stochastic Gradient Ascent (SGA) Set starting value θ 0 , batch size n ≪ N , and step sizes ǫ t . Iterate: Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Stochastic Gradient Ascent (SGA) Set starting value θ 0 , batch size n ≪ N , and step sizes ǫ t . Iterate: 1 Take a subsample S t of size n from the data Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Stochastic Gradient Ascent (SGA) Set starting value θ 0 , batch size n ≪ N , and step sizes ǫ t . Iterate: 1 Take a subsample S t of size n from the data 2 Estimate the gradient at θ t by : f ( θ t ) = ∇ f 0 ( θ t ) + N ∇ ˆ � ∇ f i ( θ t ) n x i ∈ S t Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Stochastic Gradient Ascent (SGA) Set starting value θ 0 , batch size n ≪ N , and step sizes ǫ t . Iterate: 1 Take a subsample S t of size n from the data 2 Estimate the gradient at θ t by : f ( θ t ) = ∇ f 0 ( θ t ) + N ∇ ˆ � ∇ f i ( θ t ) n x i ∈ S t 3 Set θ t +1 = θ t + ǫ t ∇ ˆ f ( θ t ) Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Stochastic Optimisation Efficient way of learning model parameters, typically used in machine learning. Stochastic Gradient Ascent (SGA) Set starting value θ 0 , batch size n ≪ N , and step sizes ǫ t . Iterate: 1 Take a subsample S t of size n from the data 2 Estimate the gradient at θ t by : f ( θ t ) = ∇ f 0 ( θ t ) + N ∇ ˆ � ∇ f i ( θ t ) n x i ∈ S t 3 Set θ t +1 = θ t + ǫ t ∇ ˆ f ( θ t ) + γ ( θ t − θ t − 1 ) There are many ways of speeding up convergence, such as adding in a momentum term. Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Stochastic Optimisation Robbins-Monro criteria for convergence: If � ∞ t =1 ǫ t = ∞ and � ∞ t =1 ǫ 2 t < ∞ , then θ t will converge to a local maximum Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Stochastic Optimisation Robbins-Monro criteria for convergence: If � ∞ t =1 ǫ t = ∞ and � ∞ t =1 ǫ 2 t < ∞ , then θ t will converge to a local maximum Usually set ǫ t = ( α t + β ) − γ with γ ∈ (0 . 5 , 1] Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Stochastic Optimisation Robbins-Monro criteria for convergence: If � ∞ t =1 ǫ t = ∞ and � ∞ t =1 ǫ 2 t < ∞ , then θ t will converge to a local maximum Usually set ǫ t = ( α t + β ) − γ with γ ∈ (0 . 5 , 1] These algorithms only converge to a point estimate of the posterior mode Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
MCMC Many problems for which Bayesian inference would be useful involve non-standard distributions and a large number of parameters, making exact inference challenging. MCMC algorithms aim to generate random samples from the posterior. These samplers construct a Markov chain, often a random walk, which converges to the desired stationary distribution. Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Metropolis-Adjusted Langevin Algorithm (MALA) The Langevin diffusion describes dynamics which converge to π ( θ ): d θ ( t ) = 1 2 ∇ f ( θ ( t )) + db ( t ) MALA uses the following discretisation to propose samples: θ t +1 = θ t + σ 2 2 ∇ f ( θ t ) + ση t A Metropolis-Hastings accept/reject step is then used to correct discretisation errors, ensuring convergence to the desired stationary distribution. Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
MALA algorithm Set starting value θ 0 and step size σ 2 . Iterate the following: Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
MALA algorithm Set starting value θ 0 and step size σ 2 . Iterate the following: 1 Set θ ∗ = θ t + σ 2 2 ∇ f ( θ t ) + ση t , where η t ∼ N (0 , I ) Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
MALA algorithm Set starting value θ 0 and step size σ 2 . Iterate the following: 1 Set θ ∗ = θ t + σ 2 2 ∇ f ( θ t ) + ση t , where η t ∼ N (0 , I ) 2 Accept and set θ t +1 = θ ∗ with probability � � 1 , π ( θ ∗ ) q ( θ t | θ ∗ ) a ( θ ∗ , θ t ) = min , π ( θ t ) q ( θ ∗ | θ t ) where q ( x | y ) = P ( θ ∗ = x | θ t = y ) Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
MALA algorithm Set starting value θ 0 and step size σ 2 . Iterate the following: 1 Set θ ∗ = θ t + σ 2 2 ∇ f ( θ t ) + ση t , where η t ∼ N (0 , I ) 2 Accept and set θ t +1 = θ ∗ with probability � � 1 , π ( θ ∗ ) q ( θ t | θ ∗ ) a ( θ ∗ , θ t ) = min , π ( θ t ) q ( θ ∗ | θ t ) where q ( x | y ) = P ( θ ∗ = x | θ t = y ) 3 If rejected, set θ t +1 = θ t Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
MALA 0.3 0.2 0.2 0.1 0.1 σ = 0 . 03 a = 0 . 99 0.0 0.0 −0.1 −0.1 −0.2 −0.2 −0.2 −0.1 0.0 0.1 0.2 −0.2 −0.1 0.0 0.1 0.2 0.3 0.2 0.2 0.1 0.1 σ = 0 . 13 a = 0 . 57 0.0 0.0 −0.1 −0.1 −0.2 −0.2 −0.2 −0.1 0.0 0.1 0.2 −0.3 −0.2 −0.1 0.0 0.1 0.2 0.2 0.1 σ = 0 . 20 a = 0 . 13 0.1 0.0 0.0 −0.1 −0.1 −0.2 −0.2 −0.2 −0.1 0.0 0.1 0.2 0.3 −0.2 −0.1 0.0 0.1 Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Stochastic Gradient Langevin Dynamics (SGLD) SGLD aims to reduce the computational cost of MALA by replacing the full gradient calculation in the proposal with the stochastic approximation ∇ ˆ f ( θ ): f ( θ t ) + √ ǫ t η t θ t +1 = θ t + ǫ t 2 ∇ ˆ Here, the ǫ t are decreasing to 0 as in SGA. Since the Metropolis-Hastings acceptance rate tends to 1 as the step size decreases, the costly accept/reject step is ignored. Connie TrojanSupervisor: Srshti Putcha Approximate Posterior Sampling via Stochastic Optimisation
Recommend
More recommend