Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Non-asymptotic convergence bound for the Unadjusted Langevin Algorithm Alain Durmus, Eric Moulines, Marcelo Pereyra Telecom ParisTech, Ecole Polytechnique, Bristol University November 23, 2016 LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Introduction Sampling distribution over high-dimensional state-space has recently attracted a lot of research efforts in computational statistics and machine learning community... Applications (non-exhaustive) 1 Bayesian inference for high-dimensional models and Bayesian non parametrics 2 Bayesian linear inverse problems (typically function space problems) 3 Aggregation of estimators and experts Most of the sampling techniques known so far do not scale to high-dimension... Challenges are numerous in this area... LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bayesian setting (I) β ∈ R d is embedded with a prior - In a Bayesian setting, a parameter β β distribution p and the observations are given by a likelihood: Y ∼ L ( ·| β β β ) The inference is then based on the posterior distribution: β | Y ) = p (d β β β ) L ( Y | β β ) β π (d β β L ( Y | u ) p (d u ) . � In most cases the normalizing constant is not tractable: π (d β β β | Y ) ∝ p (d β β β ) L ( Y | β β β ) . LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Bayesian setting (II) Bayesian decision theory relies on computing expectations: � β β β R d f ( β β ) L ( Y | β β ) p (d β β ) Generic problem: estimation of an expectation E π [ f ] , where - π is known up to a multiplicative factor ; - Sampling directly from π is not an option; - π is high dimensional. LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Logistic and probit regression Likelihood: Binary regression set-up in which the binary observations (responses) ( Y 1 , . . . , Y n ) are conditionally independent Bernoulli β T X i ) , where random variables with success probability F ( β β 1 X i is a d dimensional vector of known covariates, β 2 β β is a d dimensional vector of unknown regression coefficient 3 F is a distribution function. Two important special cases: 1 probit regression: F is the standard normal distribution function, 2 logistic regression: F is the standard logistic distribution function, F ( t ) = e t / (1 + e t ) . LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials A daunting problem ? The posterior density distribution of β β β is given , up to a proportionality constant by π ( β β β | ( Y, X )) ∝ exp( − U ( β β β )) ; where the potential U ( β β β ) is given by p β T X i )+(1 − Y i ) log(1 − F ( β β T X i )) } +g( β � β β β β U ( β β ) = − { Y i log F ( β β ) , i =1 where g is the log density of the posterior distribution. Two important cases: β T Σ β Gaussian prior g( β β β ) = (1 / 2) β β β β , ridge regression. β ) = λ � d Laplace prior g( β β i =1 | β β β i | , lasso regression. LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials New challenges Problem the number of predictor variables d is large ( 10 4 and up). Examples: text categorization, genomics and proteomics (gene expression analysis)... The most popular algorithms for Bayesian inference in binary regression models are based on data augmentation: 1 probit link: Albert and Chib (1993). 2 logistic link: Polya-Gamma sampler, Polsson and Scott (2012)... ! LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (I) Data Augmentation: β | ( X, Y )) sample π ( β β β β, W | ( X, Y )) probability Instead on sampling π ( β measure on R d 1 × R d 2 and take the marginal wrt β β β . Typical application of the Gibbs sampler: sample in turn β β | ( X, Y, W )) and π ( W | ( X, Y,β β π ( β β )) . The Gibbs sampler consists in sampling a Markov chain ( β β β k , W k ) k ≥ 0 defined by Given ( β β β k , W k ) , 1 2 Draw W k +1 ∼ π ( ·| ( β β β k , X, Y )) . β β β k +1 ∼ π ( ·| ( W k +1 , X, Y )) . The target density π ( β β β, W | ( X, Y )) is invariant for the Markov chain ( β β β k , W k ) k ≥ 0 ! The choice of the DA should make these two steps reasonably easy... LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (II) β Question: Control the distance between the law of ( β β n , W n ) and the β stationary distribution π ( β β, W | ( X, Y )) ? Definition (Geometric ergodicity) We will say that the Markov kernel P on ( R d , B ( R d )) is geometrically ergodic if there exits κ ∈ (0 , 1) such that for all n ≥ 0 and x ∈ R d , � P n ( x, · ) − π � TV ≤ C ( x ) κ n . where for µ, ν two probabilities measure on R d , define � µ − ν � TV = sup | µ ( f ) − ν ( f ) | . | f |≤ 1 LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Data Augmentation algorithms (III) The algorithm of Albert and Chib and the Polya-Gamma sampler have been shown to be uniformly geometrically ergodic, BUT - The geometric rate of convergence is exponentially small with the dimension - do not allow to construct honest confidence intervals, credible regions The algorithms are very demanding in terms of computational ressources... - applicable only when is d small 10 to moderate 100 but certainly not when d is large ( 10 4 or more). - convergence time prohibitive as soon as d ≥ 10 2 . LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials A daunting problem ? In the case of the ridge regression, the potential U is smooth strongly convex. In the case of the lasso regression, the potential U is non-smooth but still convex... A wealth of reasonably fast optimisation algorithms are available to solve this problem in high-dimension... LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials 1 Motivation 2 Framework 3 Sampling from strongly log-concave distribution 4 Computable bounds in total variation for super-exponential densities 5 Deviation inequalities 6 Non-smooth potentials LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Framework Denote by π a target density w.r.t. the Lebesgue measure on R d , known up to a normalisation factor � x �→ e − U ( x ) / R d e − U ( y ) d y , Implicitly, d ≫ 1 . Assumption: U is L -smooth : twice continuously differentiable and there exists a constant L such that for all x, y ∈ R d , �∇ U ( x ) − ∇ U ( y ) � ≤ L � x − y � . This condition can be weakened. LS 3 seminar
Motivation Framework Sampling from strongly log-concave distribution Computable bounds in total variation for super-exponential densities Deviation inequalities Non-smooth potentials Langevin diffusion Langevin SDE: √ d Y t = −∇ U ( Y t )d t + 2d B t , where ( B t ) t ≥ 0 is a d -dimensional Brownian Motion. ( P t ) t ≥ 0 is a Markov semigroup: - aperiodic, strong Feller (all compact sets are small). - reversible w.r.t. to π (admits π as its unique invariant distribution). π ∝ e − U is reversible ❀ the unique invariant probability measure. For all x ∈ R d , t → + ∞ � δ x P t − π � TV = 0 . lim LS 3 seminar
Recommend
More recommend