on the convergence rate
play

On the convergence rate of iterative 1 minimization algorithms - PowerPoint PPT Presentation

On the convergence rate of iterative 1 minimization algorithms Ignace Loris Applied Inverse Problems Vienna 21/7/2009 Ignace Loris Convergence of 1 algorithms Theme comparison of iterative algorithms for sparse recovery: sparse


  1. On the convergence rate of iterative ℓ 1 minimization algorithms Ignace Loris Applied Inverse Problems Vienna 21/7/2009 Ignace Loris Convergence of ℓ 1 algorithms

  2. Theme comparison of iterative algorithms for sparse recovery: sparse ↔ “ ℓ 1 ” ← algorithms influence of noise and ill-conditioning on effectiveness � influence of noise and ill-conditioning on speed Ignace Loris Convergence of ℓ 1 algorithms

  3. Overview sparsity through ℓ 1 penalty: illustration on toy example When does it work? ↔ How to make it work? Speed comparison of iterative algorithms Speed comparison of ‘continuation’ algorithms Ignace Loris Convergence of ℓ 1 algorithms

  4. Setting Data vector: y (noisy) Model vector: x (coefficients in some basis) Linear relationship: Kx = y Problems: insufficient data, inconsistent data, ill-conditioning of K Solution: minimize a penalized least-squares functional: x rec = arg min � Kx − y � 2 + penalty x Ignace Loris Convergence of ℓ 1 algorithms

  5. Old and new penalties Traditional: ℓ 2 -norm penalties x rec = arg min � Kx − y � 2 + λ � x � 2 x (advantage: linear equations) Sometimes better reconstructions by using other information Suppose the desired model x is sparse (few nonzero components) In that case, ℓ 1 penalty is suitable (Donoho et al. 2006 etc): x rec = arg min � Kx − y � 2 + 2 λ � x � 1 x Ignace Loris Convergence of ℓ 1 algorithms

  6. Question 1: When does it work? Experiment: x input = sparse Input : y = Kx + n Data : x rec ℓ 1 reconstruction : Compare x rec to x input When does this work? Ignace Loris Convergence of ℓ 1 algorithms

  7. ℓ 1 penalty for sparse recovery: toy example K x input 0.8 30 25 0.6 20 15 10 0.4 5 0 0.2 0 20 40 60 80 100 20 40 60 80 100 y = Kx input + n Synthetic noisy data: Ignace Loris Convergence of ℓ 1 algorithms

  8. ℓ 1 penalty for sparse recovery: toy example K 30 25 20 15 10 5 0 0 20 40 60 80 100 y = Kx input + n Synthetic noisy data: � Kx − y � 2 + λ � x � 2 � Kx − y � 2 + 2 λ � x � 1 min min x x Ignace Loris Convergence of ℓ 1 algorithms

  9. ℓ 1 penalty for sparse recovery: toy example K 30 25 20 15 10 5 0 0 20 40 60 80 100 y = Kx input + n Synthetic noisy data: � Kx − y � 2 + λ � x � 2 � Kx − y � 2 + 2 λ � x � 1 min min x x 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 20 40 60 80 100 20 40 60 80 100 Ignace Loris Convergence of ℓ 1 algorithms

  10. ℓ 1 penalty for sparse recovery: toy example K x input 0.8 30 25 0.6 20 15 10 0.4 5 0 0.2 0 20 40 60 80 100 20 40 60 80 100 y = Kx input + n Synthetic noisy data: � Kx − y � 2 + λ � x � 2 � Kx − y � 2 + 2 λ � x � 1 min min x x 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 20 40 60 80 100 20 40 60 80 100 Ignace Loris Convergence of ℓ 1 algorithms

  11. Success rate of noiseless sparse recovery with ℓ 1 no noise, K=random matrix (elements from a Gaussian distribution) Prepare data: y = Kx input Reconstruct: x rec = arg min Kx = y � x � 1 Ignace Loris Convergence of ℓ 1 algorithms

  12. Success rate of noiseless sparse recovery with ℓ 1 no noise, K=random matrix (elements from a Gaussian distribution) Prepare data: y = Kx input 1 Reconstruct: x rec = arg min Kx = y � x � 1 Gray level ÷ probability of x rec = x input N = number of variables k / n n = number of data k = # of nonzero components in model ← White: success rate of perfect recovery=100% “Compressed Sensing” n / N 1 Bruckstein et al. (2009) Ignace Loris Convergence of ℓ 1 algorithms

  13. Reconstruction error for noisy sparse recovery with ℓ 1 2% noise, K=random matrix (elements from a Gaussian distribution) Prepare data: y = Kx input + n Reconstruct: x rec = arg min x � Kx − y � 2 + 2 λ � x � 1 with � Kx rec − y � = � n � Ignace Loris Convergence of ℓ 1 algorithms

  14. Reconstruction error for noisy sparse recovery with ℓ 1 2% noise, K=random matrix (elements from a Gaussian distribution) Prepare data: y = Kx input + n 1 Reconstruct: x rec = arg min x � Kx − y � 2 + 2 λ � x � 1 with � Kx rec − y � = � n � Gray level ÷ average error � x rec − x input � / � x input � � � N = number of variables k / n n = number of data k = # of nonzero components in x input ← reconstruction error= 2% n / N 1 Loris and Verhoeven (2009). Ignace Loris Convergence of ℓ 1 algorithms

  15. Reconstruction error for noisy sparse recovery with ℓ 1 2% noise, K = submatrix of square matrix with condition # = 10 4 Prepare data: y = Kx input + n Reconstruct: x rec = arg min x � Kx − y � 2 + 2 λ � x � 1 with � Kx rec − y � = � n � Ignace Loris Convergence of ℓ 1 algorithms

  16. Reconstruction error for noisy sparse recovery with ℓ 1 2% noise, K = submatrix of square matrix with condition # = 10 4 Prepare data: y = Kx input + n 1 Reconstruct: x rec = arg min x � Kx − y � 2 + 2 λ � x � 1 with � Kx rec − y � = � n � Gray level ÷ average error � x rec − x input � / � x input � � � N = number of variables k / n n = number of data k = # of nonzero components in x input ← reconstruction error= 2% n / N 1 Loris and Verhoeven (2009). Ignace Loris Convergence of ℓ 1 algorithms

  17. Important consequence for evaluation of speed-up: success of sparse recovery by ℓ 1 minimisation depends on noise and condition number OUR goal here: evaluate convergence rate of iterative ℓ 1 minimization algorithms: x ( n ) � Kx − y � 2 + 2 λ � x � 1 ? x ≡ arg min → ¯ − x Hence: compare x ( n ) with ¯ x , not with x input ! catch 22? Ignace Loris Convergence of ℓ 1 algorithms

  18. x ( λ ) Direct method for finding ¯ Minimization of: F ( x ) = � K x − y � 2 + 2 λ � x � 1 Nonlinear variational equations: � ( K T ( y − K ¯ x )) i = λ sgn (¯ x i ) x i � = 0 ¯ if | ( K T ( y − K ¯ x )) i | ≤ λ x i = 0 ¯ if Variational equations are piecewise linear ⇓ x ( λ ) is piecewise linear in λ ¯ 1 a direct method exists 2 (Efron et al., 2004; Osborne et al., 2000) Ignace Loris Convergence of ℓ 1 algorithms

  19. x ( λ ) Direct method for finding ¯ If λ ≥ λ max ≡ max i | ( K T y ) i | , then ¯ x ( λ ) = 0. x ( λ > 0 ) is found by letting λ ց ¯ Need to solve ‘small’ linear system at every step x ( λ n ) is reached. Continue until desired solution ¯ x − y � 2 = � n � 2 ) ( E.g. when � K ¯ x ( λ ) for λ max ≥ λ ≥ λ min Result: ¯ Time-complexity: at first linear, later cubic 600 time (s) 500 400 300 200 100 200 400 600 800 1000 x # nonzero components of ¯ Alternative: iterative method Ignace Loris Convergence of ℓ 1 algorithms

  20. Iterative soft-thresholding Iterative Soft Thresholding: x ( 0 ) = 0 x n + 1 = T ( x n ) with x + K T ( y − Kx ) � � T ( x ) ≡ S λ and S λ = component-wise soft-thresholding Advantages: very simple 1 x ) ≤ � x 0 − ¯ x � 2 F ( x n ) − F (¯ ∀ n > 0 2 2 n Ignace Loris Convergence of ℓ 1 algorithms

  21. Iterative soft-thresholding IST is not very fast: 1 � x n − ¯ x � / � ¯ x � 0 5h 10h 15h 20h Real inversions: millions of variables Speed-up needed Ignace Loris Convergence of ℓ 1 algorithms

  22. Fast Iterative Soft-Thresholding Algorithm Fista (Beck and Teboulle, 2008) x n + 1 = T ( x n + β n ( x n − x n − 1 )) with same � x + K T ( y − Kx ) � T ( x ) ≡ S λ 1.0 0.8 and where ( β n ) n is a fixed sequence of numbers: 0.6 0.4 0.2 0 20 40 60 80 100 Advantages: very simple 1 x ) ≤ 2 � x 0 − ¯ x � 2 F ( x n ) − F (¯ ∀ n > 0 2 n 2 Goes back to (Nesterov, 1983) Ignace Loris Convergence of ℓ 1 algorithms

  23. Other iterative methods Projected Steepest Descent (Daubechies et al. 2008): x ( n ) + β ( n ) r ( n ) � x ( n + 1 ) = P R � (1) with r ( n ) = K T ( y − Kx ( n ) ) and β ( n ) = � r ( n ) � 2 / � Kr ( n ) � 2 . P R ( · ) = orthogonal projection onto an ℓ 1 ball of radius R : R the ‘GPSR-algorithm’ (Figueiredo et al., 2007) uses auxiliary variables u , v ≥ 0 with x = u − v the ‘ ℓ 1 -ls-algorithm’ (Kim et al., 2007), an interior point method using preconditioned conjugate gradient substeps Sparsa (Wright et al., 2009) . . . How to choose? → How to compare? Ignace Loris Convergence of ℓ 1 algorithms

  24. How to compare? Evaluation of speed-up x � how ? � x n − ¯ convergence: − → 0 1 1 (a) (e) 0 5m 10m (b) (a) (c) (d) 0 10m 2h 4h 6h 8h Only gives an idea for one value of penalization parameter λ . → Prefer a range of λ : λ max ≥ λ ≥ λ min In one picture: error: � x n − ¯ x � / � ¯ x � 1 penalty: λ 2 computer time needed: t 3 How? Ignace Loris Convergence of ℓ 1 algorithms

  25. Approximation isochrones: IST 20 100 200 x ( λ ) supp ¯ ← more sparse less sparse → 1 1m x � x � / � ¯ � x n − ¯ 10m 0 0 2 4 6 8 10 12 14 ← large λ log 2 λ max /λ small λ → Ignace Loris Convergence of ℓ 1 algorithms

  26. Approximation isochrones: FISTA 20 100 200 x ( λ ) supp ¯ ← more sparse less sparse → 1 x � x � / � ¯ � x n − ¯ 0 0 2 4 6 8 10 12 14 ← large λ log 2 λ max /λ small λ → Ignace Loris Convergence of ℓ 1 algorithms

Recommend


More recommend