Learning step sizes for unfolded sparse coding Thomas Moreau INRIA Saclay Joint work with Pierre Ablin; Mathurin Massias; Alexandre Gramfort 1/32
Electrophysiology Magnetoencephalography Electroencephalography 2/32
Inverse problems Maxwell’s Equations z z z x x x D D D Electrical activity Observed signal Forward model: x x x = D D Dz z z 3/32
Inverse problems Inverse Problem Maxwell’s Equations z z z x x x D D D Electrical activity Observed signal Forward model: x x x = D D Dz z z Inverse problem: z z z = f ( x x x ) (ill-posed) 3/32
Inverse problems Inverse Problem Maxwell’s Equations z z z x x x D D D Electrical activity Observed signal Forward model: x x x = D D Dz z z Inverse problem: z z z = f ( x x x ) (ill-posed) Optimization with a regularization R encoding prior knowledge z � 2 argmin z z � x x x − D D Dz z 2 + R ( z z z ) z Example: sparsity with R = λ � · � 1 3/32
Other inverse problems Ultra sound fMRI - compress sensing Astrophysic 4/32
Some challenges for inverse problems Evaluation: often there is no ground truth, • In neuroscience, we cannot access the brain electrical activity. • How to evaluate how well it is reconstructed? Open problem in unsupervised learning Modelization: how to better account for the signal structure, • ℓ 2 reconstruction evaluation does not account for localization • Optimal transport could help in this case? Computational: solving these problems can be too long, • Many problems share the same forward operator D D D • Can we use the structure of the problem? Today’s talk topic! 5/32
Better step sizes for Iterative Shrinkage-Thresholding Algorithm (ISTA) 6/32
The Lasso For a fixed design matrix D ∈ R n × m and λ > 0, the Lasso for x ∈ R n is F x ( z ) = 1 z ∗ = argmin 2 � x − Dz � 2 + λ � z � 1 2 z � �� � f x ( z ) a.k.a. sparse coding, sparse linear regression, ... We are interested in the over-complete case where m > n . 7/32
The Lasso For a fixed design matrix D ∈ R n × m and λ > 0, the Lasso for x ∈ R n is F x ( z ) = 1 z ∗ = argmin 2 � x − Dz � 2 + λ � z � 1 2 z � �� � f x ( z ) a.k.a. sparse coding, sparse linear regression, ... We are interested in the over-complete case where m > n . Properties ◮ The problem is convex in z but not strongly convex in general ◮ z = 0 is solution if and only if λ ≥ λ max . = � D ⊤ x � ∞ 7/32
ISTA: [ Daubechies et al. 2004 ] Iterative Shrinkage-Thresholding Algorithm f x is a L-smooth function with L = � D � 2 2 and ∇ f x ( z ( t ) ) = D ⊤ ( Dz ( t ) − x ) The ℓ 1 -norm is proximable with a separable proximal operator prox µ �·� 1 ( x ) = sign ( x ) max( 0 , | x | − µ ) = ST ( x , µ ) 8/32
ISTA: [ Daubechies et al. 2004 ] Iterative Shrinkage-Thresholding Algorithm f x is a L-smooth function with L = � D � 2 2 and ∇ f x ( z ( t ) ) = D ⊤ ( Dz ( t ) − x ) The ℓ 1 -norm is proximable with a separable proximal operator prox µ �·� 1 ( x ) = sign ( x ) max( 0 , | x | − µ ) = ST ( x , µ ) We can use the proximal gradient descent algorithm (ISTA) z ( t + 1 ) = ST z ( t ) − ρ ∇ f x ( z ( t ) ) , ρλ � �� � D ⊤ ( Dz ( t ) − x ) Here, ρ play the role of a step size (in [ 0 , 2 L [ ). 8/32
ISTA: Majoration-Minimization Taylor expansion of f x in z ( t ) F x ( z ) = f x ( z ( t ) ) + ∇ f x ( z ( t ) ) ⊤ ( z − z ( t ) ) + 1 2 � D ( z − z ( t ) ) � 2 2 + λ � z � 1 ≤ f x ( z ( t ) ) + ∇ f x ( z ( t ) ) ⊤ ( z − z ( t ) ) + L 2 � z − z ( t ) � 2 2 + λ � z � 1 ⇒ Replace the Hessian D ⊤ D by L Id . Separable function that can be minimized in close form � � � � 2 � � L � z ( t ) − 1 z ( t ) − 1 L ∇ f x ( z ( t ) ) , λ � L ∇ f x ( z ( t ) ) − z � argmin + λ � z � 1 = ST � 2 L z 2 � � z ( t ) − 1 L ∇ f x ( z ( t ) ) = prox λ L 9/32
ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz 10/32
ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz ≤ L � z � 2 10/32
ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz ≤ z ⊤ A ⊤ Λ Az [ Moreau and Bruna 2017 ] 10/32
ISTA: Majoration for the data-fit ◮ Level lines form z ⊤ D ⊤ Dz ≤ L S � z � 2 for Supp ( z ) ⊂ S 10/32
Oracle ISTA: Majoration-Minimization For all z such that Supp( z ) ⊂ S . = Supp( z ( t ) ) , F x ( z ) ≤ f x ( z ( t ) ) + ∇ f x ( z ( t ) ) ⊤ ( z − z ( t ) ) + L S 2 � z − z ( t ) � 2 2 + λ � z � 1 with L S = � D · , S � 2 2 . Q x,L ( · , z ( t ) ) Q x,L S ( · , z ( t ) ) F x Cost function 0 1 1 L L S Step size 11/32
Better step-sizes for ISTA Oracle ISTA (OISTA): 1. Get the Lipschitz constant L S associated with support S = Supp( z ( t ) ) . 2. Compute y ( t + 1 ) as a step of ISTA with a step-size of 1 / L S � � z ( t ) − 1 D ⊤ ( Dz ( t ) − x ) , λ y ( t + 1 ) = ST L S L S 3. If Supp( y t + 1 ) ⊂ S , accept the update z ( t + 1 ) = y ( t + 1 ) . 4. Else, z ( t + 1 ) is computed with step size 1 / L . 12/32
OISTA: Performances ISTA FISTA OISTA (proposed) 10 − 6 F x − F ∗ x 10 − 12 step Number of iterations 13/32
OISTA – Step-size 3 Oracle step 2 1 1 L 0 50 100 150 Number of iterations 14/32
OISTA – Improved-convergence rates S ∗ = Supp ( Z ∗ ) µ ∗ = min � Dz � 2 2 for � z � 2 = 1 and Supp ( z ) ⊂ S ∗ . If µ ∗ > 0, OISTA converges with a linear rate F x ( z ( t ) ) − F x ( z ∗ ) ≤ ( 1 − µ ∗ L S ∗ ) t − T ∗ ( F x ( z ( T ∗ ) ) − F x ( z ∗ )) . 15/32
OISTA – Gaussian setting Acceleration quantification with Marchenko-Pastur Entries in D ∈ R n × m are sampled from N ( 0 , 1 ) and S is sampled uniformly with | S | = k . Denote m / n → γ, k / m → ζ , with k , m , n → + ∞ . Then � 1 + √ ζγ � 2 L S L → 1 + √ γ . (1) 1 . 00 0 . 75 0 . 50 Empirical law Empirical law L S L ( 1+ √ ζγ 1+ √ γ ) 2 n = 200 , m = 600 0 . 25 ζ 0 . 00 0 . 00 0 . 25 0 . 50 0 . 75 1 . 00 ζ 16/32
OISTA – Limitation ◮ In practice, OISTA is not practical, as you need to compute L S at each iteration and this is costly. ◮ No precomputation possible: there is an exponential number of supports S . 17/32
Using deep learning to approximate OISTA 18/32
Solving the Lasso many times Assume that we want to solve the Lasso for many observation { x 1 , . . . , x N } with a fixed direct operator D i.e. for each x computes 1 I D ( x ) = argmin 2 � x − Dz � + λ � z � 1 z Thus, the goal is not to solve one problem but multiple problems. ⇒ Can we leverage the problem’s structure? ◮ ISTA : worst case algorithm, second order information is L . ◮ OISTA : adaptive algorithm, second order information is L S (NP-hard). ◮ LISTA : adaptive algorithm, use DL to adapt to second order information? 19/32
ISTA is a Neural Network ISTA � � z ( t ) − 1 LD ⊤ ( Dz ( t ) − x ) , λ z ( t + 1 ) = ST L Let W z = I m − 1 L D ⊤ D and W x = 1 L D ⊤ . Then z ( t + 1 ) = ST ( W z z ( t ) + W x x , λ L ) ST( · , λ x W x L ) z ( t + 1 ) One step of ISTA z ( t ) W z 20/32
ISTA is a Neural Network ISTA � � z ( t ) − 1 LD ⊤ ( Dz ( t ) − x ) , λ z ( t + 1 ) = ST L L D ⊤ D and W x = 1 L D ⊤ . Then Let W z = I m − 1 z ( t + 1 ) = ST ( W z z ( t ) + W x x , λ L ) ST( · , λ z ∗ x W x L ) RNN equivalent to ISTA W z 20/32
Learned ISTA [ Gregor and Le Cun 2010 ] Recurrence relation of ISTA define a RNN ST( · , λ x W x L ) z ∗ � � z ( t ) − 1 LD ⊤ ( Dz ( t ) − x ) , λ z ( t + 1 ) = ST L W z This RNN can be unfolded as a feed-forward network. x W ( 0 ) W ( 1 ) W ( 2 ) x x x ST( · , θ ( 0 ) ) W ( 1 ) ST( · , θ ( 1 ) ) W ( 2 ) ST( · , θ ( 2 ) ) z ( 2 ) z z Let Φ Θ ( T ) denote a network with T layers parametrized with Θ ( T ) . If W ( i ) = W x and W ( i ) = W z , then Φ Θ T ( x ) = z ( t ) . x z 21/32
LISTA – Training Empirical risk minimization : We need a training set of { x 1 , . . . x N training sample and our goad is to accelerate ISTA on unseen data x ∼ p . The training solves N � 1 Θ ( T ) ∈ arg min ˜ L x (Φ Θ ( T ) ( x i )) . N Θ ( T ) i = 1 for a loss L x . ⇒ Choice of loss L x ? 22/32
LISTA – Training Supervised: a ground truth z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Solving the inverse problem. 23/32
LISTA – Training Supervised: a ground truth z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Solving the inverse problem. Semi-supervised: the solution of the Lasso z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Accelerating the resolution of the Lasso. 23/32
LISTA – Training Supervised: a ground truth z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Solving the inverse problem. Semi-supervised: the solution of the Lasso z ∗ ( x ) is known L x ( z ) = 1 2 � z − z ∗ ( x ) � Accelerating the resolution of the Lasso. Unsupervised: there is no ground truth L x ( z ) = 1 2 � x − Dz � 2 2 + λ � z � 1 Solving the Lasso. 23/32
Recommend
More recommend