Algorithmes Gradient-Proximaux pour linf erence statistique - PowerPoint PPT Presentation

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Gersende Fort Institut de Math´ ematiques de Toulouse, CNRS Toulouse, France

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Based on joint works with Yves Atchad´ e (Univ. Michigan, USA) Jean-Fran¸ cois Aujol (IMB, Bordeaux, France) Eric Moulines (Ecole Polytechnique, France) Adeline Samson et Edouard Ollier (Univ. Grenoble Alpes, France). Charles Dossal (IMT). Laurent Risser (IMT) → On Perturbed Proximal-Gradient algorithms (JMLR, 2017) ֒ → Stochastic Proximal Gradient Algorithms for Penalized Mixed ֒ Models (Stat & Computing, 2018) ֒ → Acceleration for perturbed Proximal Gradient algorithms (work in progress) ֒ → Algorithmes Gradient Proximaux Stochastiques (GRETSI, 2017)

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Outline Motivations Pharmacokinetic General Case: Latent Variable Models Votes in the US congress General case: Discrete graphical models Conclusion, part I Penalized ML through Perturbed Stochastic-Gradient algorithms Asymptotic behavior of the algorithm Numerical illustration

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation 1: Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , evolution of the concentration at times t ij , 1 ≤ j ≤ J i : observations { Y ij , 1 ≤ j ≤ J i } . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation 1: Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , evolution of the concentration at times t ij , 1 ≤ j ≤ J i : observations { Y ij , 1 ≤ j ≤ J i } . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Example of model F : monocompartimental, with digestive absorption � � exp( − Cl F ( t, [ln Cl , ln V , ln A ]) = C ( Cl,V,A,D ) V t ) − exp( − A t ) For each patient i ,         β 1 ,Cl Z i 1 ,Cl + · · · + β K,Cl Z i ln Cl β 0 ,Cl d Cl ,i K,Cl idem, with covariates Z i ln V = β 0 ,V  + k,V and coefficients β k,V  + d V ,i       idem, with covariates Z i ln A β 0 ,A k,A and coefficients β k,A d A ,i i

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation 1: Pharmacokinetic (1/2) N patients. At time 0 : dose D of a drug. For patient i , evolution of the concentration at times t ij , 1 ≤ j ≤ J i : observations { Y ij , 1 ≤ j ≤ J i } . Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = F ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L ∼ N L (0 , Ω) and independent of ǫ • d i Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Statistical analysis: estimation of θ = ( β, σ 2 , Ω) , under sparsity constraints on β selection of the covariates based on ˆ β . ֒ → Penalized Maximum Likelihood

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Pharmacokinetic Motivation : Pharmacokinetic (2/2) Model: i.i.d. ∼ N (0 , σ 2 ) Y ij = f ( t ij , X i ) + ǫ ij ǫ ij i.i.d. X i = Z i β + d i ∈ R L d i ∼ N L (0 , Ω) and independent of ǫ • Z i known matrix s.t. each row of X i has in intercept (fixed effect) and covariates Likelihoods: Complete likelihood: the distribution of { Y ij , X i ; 1 ≤ i ≤ N, 1 ≤ j ≤ J i } has an explicit expression. � N � N � � J i � � � N ( f ( t ij , X j ) , σ 2 )[ Y ij ] N L ( Z i β, Ω)[ X i ] i =1 j =1 i =1 Likelihood: the distribution of { Y ij ; 1 ≤ i ≤ N, 1 ≤ j ≤ J i } is not explicit . ML: here, the likelihood is not concave .

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General Case: Latent Variable Models General case: Latent variable models The log-likelihood of the observations Y is of the form (dependende upon Y is omitted) � θ �→ log L ( θ ) L ( θ ) = p θ ( x ) µ ( d x ) , X where µ is a σ -finite positive measure on a set X. x collects the missing/latent data. previous example: x ← ( X 1 , · · · , XN ) , µ ← lebesgue on R LN In these models, the complete likelihood p θ ( x ) can be evaluated explicitly, the likelihood has no closed expression. The exact integral could be replaced by a Monte Carlo approximation ; known to be inefficient. Numerical methods based on the a posteriori distribution of the missing data are preferred (see e.g. Expectation-Maximization approaches). → What about the gradient of the (log)-likelihood ? ֒

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General Case: Latent Variable Models Latent variable model: Gradient of the likelihood � log L ( θ ) = log p θ ( x ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � p θ ( z ) µ ( d z ) � �� the a posteriori distribution

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General Case: Latent Variable Models Latent variable model: Gradient of the likelihood � log L ( θ ) = log p θ ( x ) µ ( d x ) Under regularity conditions, θ �→ log L ( θ ) is C 1 and � ∂ θ p θ ( x ) µ ( d x ) ∇ log L ( θ ) = � p θ ( z ) µ ( d z ) � p θ ( x ) µ ( d x ) = ∂ θ log p θ ( x ) � p θ ( z ) µ ( d z ) � �� the a posteriori distribution The gradient of the log-likelihood � ∇ θ { log L ( θ ) } = H θ ( x ) π θ ( d x ) is an untractable expectation w.r.t. the conditional distribution of the latent variable given the observations Y (known up to a constant) For all ( x, θ ) , H θ ( x ) can be evaluated.

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Votes in the US congress Motivation 2: relationships in a graph (1/2) p nodes in a graph (e.g. p senators from the US congress) each node takes values in {− 1 , 1 } (e.g. each node codes for no/yes in a vote) N pictures of the graph (e.g. N votes) Model: Each observation Y ( i ) ∈ {− 1 , 1 } p ; i.i.d. observations with distribution � p � p − 1 p � � � π θ ( y ) ∝ exp θ i y i + θ ij y i y j i =1 i =1 j = i +1 Statistical Analysis: estimation of θ , under penalty (sparse graph, regularization N << p 2 / 2 ) classification of the nodes → Penalized Maximum Likelihood ֒

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations Votes in the US congress Motivation 2: relationships in a graph (2/2) Model: Each observation Y ( n ) ∈ {− 1 , 1 } p ; i.i.d. observations with distribution � p � p − 1 p π θ ( y ) = 1 � � � Z θ exp θ i y i + θ ij y i y j i =1 i =1 j = i +1 def = ( Y (1) , · · · , Y ( N ) ) Log-Likelihood: Y � N � N � � p p − 1 p � � � � � Y ( n ) Y ( n ) Y ( n ) − N log Z θ ℓ ( θ ) = θ i + θ ij i i j i =1 n =1 i =1 j = i +1 n =1 = � Θ , S ( Y ) � − N log Z θ = � Ψ( θ ) , S ( Y ) � + Φ( θ ) Likelihood : not explicit � p � p − 1 p � � � � def Z θ = exp θ i y i + θ ij y i y j y ∈{− 1 , 1 } p i =1 i =1 j = i +1 ML: here, the likelihood is concave.

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General case: Discrete graphical models General Case: Discrete graphical models N independent observations of an undirected graph with p nodes. Each node takes values in a finite alphabet X. N i.i.d. observations Y ( i ) in X p with distribution   p 1 � � def y = ( y 1 , · · · , y p ) �→ π θ ( y ) = Z θ exp θ kk B ( y k , y k ) + θ kj B ( y k , y j )   k =1 1 ≤ j<k ≤ p = 1 �� θ, ¯ Z θ exp B ( y ) where ¯ B is a symmetric function. θ is a symmetric p × p matrix. the normalizing constant (partition function) Z θ can not be computed - sum over | X | p terms.

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General case: Discrete graphical models Markov random field: Likelihood ◮ Likelihood of the form (scalar product between matrices = Frobenius inner product) � � N 1 θ, 1 � ¯ N log L ( θ ) = B ( Y i ) − log Z θ N i =1 The likelihood is untractable.

Algorithmes Gradient-Proximaux pour l’inf´ erence statistique Motivations General case: Discrete graphical models Markov random field: Gradient of the likelihood ◮ Gradient of the form � 1 � N � = 1 � ¯ ¯ ∇ θ N log L ( θ ) B ( Y i ) − B ( y ) π θ ( y ) µ ( d y ) N X p i =1 with 1 �� def θ, ¯ π θ ( y ) = Z θ exp B ( y ) . The gradient of the (log)-likelihood is untractable

Algorithmes Gradient-Proximaux pour linf erence statistique - PowerPoint PPT Presentation

Algorithmes Gradient-Proximaux pour linf erence statistique Algorithmes Gradient-Proximaux pour linf erence statistique Gersende Fort Institut de Math ematiques de Toulouse, CNRS Toulouse, France Algorithmes Gradient-Proximaux

Inf erence p enalis ee dans les mod` eles ` a vraisemblance non explicite par des

Algorithmes de traitement dimage pour lestimation des caract eristiques locales de la

Algorithms for integer factorization and discrete logarithms computation Algorithmes pour la

INF562 G eom etrie Algorithmique et Applications Algorithmes dapproximation g

ISO TC67 WG10 ISO TC252 WG2 Normalisation internationale pour les installations et

Gradient Analysis NMDS Indirect Gradient Analysis NMDS Direct Gradient Analysis Objective:

Dessin de triangulations: algorithmes, combinatoire, et analyse Eric Fusy Projet ALGO,

How to use Gradient and Multi-Texture 1. Many situations, we need use the gradient texture for our

Cryptographie ` a base de courbes elliptiques : algorithmes et impl ementation Sorina Ionica

Descripteurs divers niveaux de concepts pour la classification concepts pour la classification

Stochastic Gradient Descent (SGD) Todays Class Stochastic Gradient Descent (SGD) SGD Recap

Conjugate Gradient (CG) Majid Lesani Alireza Masoum Overview Backpropagation Gradient

Evolutionary process of a tetranucleotide microsatellite locus in Acipenseriforms Eric Rivals

Fitting Neural Networks Gradient Descent and Stochastic Gradient Descent CS109A Introduction to

1 Gradient descent with fixed step In this section, we discuss a gradient descent method with

Connectivity and Hyperbolicity of a Graph Nicolas Nisse 1 David Coudert 1 Guillaume Ducoffe 1

OSMOSIS and DIFFUSION Concentration gradient Concentration Gradient - change in the concentration

Machine Learning (CSE 446): Gradient Descent and Stochastic Gradient Descent Sham M Kakade

20 Kelvin cold High gradient RF gun Materials and gradient Some properties of pure metals in low

TH IN 16 16 TH INTERNATIO IONAL DESIG IGN AN AND CHI CHILDREN RENS CONFEREN ERENCE CE

Applied Machine Learning Gradient Descent Methods Siamak Ravanbakhsh COMP 551 (Fall 2020)

Simulations Thermo-Hydro- Mcaniques pour le stockage profond des dchets nuclaires

CS 6316 Machine Learning Gradient Descent Yangfeng Ji Department of Computer Science University

LOGISTIC REGRESSION, GRADIENT LOGISTIC REGRESSION, GRADIENT DESCENT, NEWTON DESCENT, NEWTON