A Stochastic Convergence Analysis for Tikhonov-Regularization with Sparsity Constraints Daniel Gerth, Ronny Ramlau Sparse Tomo Days, Lyngby, Denmark 28.03.14 Doctoral Program Computational Mathematics Numerical Analysis and Symbolic Computation Gerth,Ramlau 1 / 34
Introduction � � Bayesian approach � Convergence theorem Convergence rates � � Numerical examples Gerth,Ramlau 2 / 34
Introduction Overview Introduction � Bayesian approach � Convergence theorem � Convergence rates � Numerical examples � Gerth,Ramlau 2 / 34
Introduction We study the solution of the linear ill-posed problem Ax = y with A ∈ L ( X , Y ) where X and Y are Hilbert spaces we seek solutions x which are sparse w.r.t to a given ONB the observed data is assumed to be noisy Basic deterministic model: || Ax − y δ || 2 + ˆ α Φ w ,p ( x ) → min (1) x Penalty Φ w ,p ( x ) = � λ ∈ Λ w λ |� x , ψ λ �| p for an ONB { ψ λ } Gerth,Ramlau 3 / 34
Introduction noise modelling two different approaches deterministic stochastic worst case error stochastic information || y δ − y || ≤ δ e.g. y σ ∼ N ( y, σ 2 ) , E ( || y σ − y || ) = f ( σ ) ,... . . . . . . “easy” analysis “hard” “fast” algorithms “slow” δ hard to get parameters σ easy to get ⇐ ? ⇒ Gerth,Ramlau 4 / 34
Introduction noise modelling two different approaches deterministic stochastic worst case error stochastic information || y δ − y || ≤ δ e.g. y σ ∼ N ( y, σ 2 ) , E ( || y σ − y || ) = f ( σ ) ,... . . . . . . “easy” analysis “hard” “fast” algorithms “slow” δ hard to get parameters σ easy to get ⇐ ? ⇒ We want to combine the advantages and find links between both branches. Question: Can we prove convergence (rates) for sparsity regularization, if we use an explicit stochastic noise model instead of the worst case error? Gerth,Ramlau 4 / 34
Introduction stochastic noise model based on discretization, also computation requires discretization, done via projections P m : Y → R m , y �→ y, e.g. point evaluation T n : X → R n , x = T n x = {� x , ψ i �} i =1 ,...,n where { ψ i } ∞ i =1 is ONB in X . each component of y carries stochastic noise, y σ = y + ε , ε ∼ N (0 , σ 2 I m ) . Define A := P m A T ∗ n , then we want to find x s.t. Ax = y σ (2) Gerth,Ramlau 5 / 34
Bayesian approach Overview Introduction � Bayesian approach � Convergence theorem � Convergence rates � Numerical examples � Gerth,Ramlau 5 / 34
Bayesian approach We use Bayes’ formula to characterize the solution. In this framework, every quantity is treated as a random variable in a complete probability space (Ω , F , P ) . π post ( x | y σ ) = π ε ( y σ | x ) π pr ( x ) . π y σ ( y σ ) π post ( x | y σ ) posterior density π ε ( y σ | x ) likelihood function π pr ( x ) prior distribution π y σ ( y σ ) data distribution (irrelevant) Gerth,Ramlau 6 / 34
Bayesian approach We use Bayes’ formula to characterize the solution. In this framework, every quantity is treated as a random variable in a complete probability space (Ω , F , P ) . π post ( x | y σ ) = π ε ( y σ | x ) π pr ( x ) . π y σ ( y σ ) π post ( x | y σ ) posterior density π ε ( y σ | x ) likelihood function π pr ( x ) prior distribution π y σ ( y σ ) data distribution (irrelevant) gaussian error model: π ε ∝ exp ( − 1 2 σ 2 || Ax − y σ || 2 ) , Now we need a prior Gerth,Ramlau 6 / 34
Bayesian approach Besov spaces We are looking for sparse reconstructions w.r.t. a basis in X our choice: Besov-space B s p,p ( R d ) prior Reasons: ”easy” characterization with coefficients of a wavelet expansion sparsity-promoting properties known, connection to TV regularization discretization invariance (Lassas, Saksman, Siltanen ’09), avoiding the following phenomena: solutions diverge as m → ∞ solutions diverge as n → ∞ Representation of a-priori knowledge is incompatible with discretization (this is the case, e.g., for a TV prior) Gerth,Ramlau 7 / 34
Bayesian approach we consider a wavelet basis suitable for multi resolution analysis let { ψ λ : λ ∈ Λ } denote the set of all wavelets ψ , also including the scaling functions where Λ is an appropriate index set, possibly infinite set | λ | = j , then x ∈ B s p,p ( R d ) ⊂ L 2 ( R d ) , s < ˜ s , if 1 /p � 2 ςp | λ | |� x , ψ λ �| p || x || B s p,p ( R d ) := < ∞ � �� � λ ∈ Λ w λ and ς = s + d ( 1 2 − 1 p ) ≥ 0 . We focus on 1 ≤ p ≤ 2 . Gerth,Ramlau 8 / 34
Bayesian approach Besov-space random variables Definition (adapted from Lassas/Saksman/Siltanen, 2009) Let 1 ≤ p < ∞ and s ∈ R . Let X be the random function � 2 − ς | λ | X α t ∈ R d , X ( t ) = λ ψ λ ( t ) , λ ∈ Λ where the coefficients ( X α λ ) λ ∈ Λ are independent identically distributed real-valued random variables with probability density function � α � 1 p exp( − α | τ | p p λ ( τ ) = c α c α p π X α ) , p = p ) , τ ∈ R . 2Γ( 1 2 2 Then we say X is distributed according to a B s p,p -prior, 2 || X || p X ∝ exp( − α p,p ( R d ) ) . B s Gerth,Ramlau 9 / 34
Bayesian approach “Problem”: P ( X ∈ B s p,p ( R d )) = 0 Gerth,Ramlau 10 / 34
Bayesian approach “Problem”: P ( X ∈ B s p,p ( R d )) = 0 Theorem (adapted from Lassas/Saksman/Siltanen, 2009) Let X be as before, 2 < α < ∞ and take r ∈ R . Then the following three conditions are equivalent: (i) || X || B r p,p ( R d ) < ∞ almost surely , � � || X || p < ∞ , (ii) E exp B r p,p ( R d ) (iii) r < s − d p . same result as [LSS 2009], but here R d instead of T d considered Gerth,Ramlau 10 / 34
Bayesian approach How to avoid this phenomenon? “finite model” (MI) “infinite model” (MII) Gerth,Ramlau 11 / 34
Bayesian approach How to avoid this phenomenon? “finite model” (MI) consider discretization level m and n fixed, finite index set Λ n Then � � λ | p < ∞ 2 − ς | λ | X α λ ψ λ ( t ) ⇒ || X n || p | X α X n ( t ) := p,p ( R d ) = B s λ ∈ Λ n λ ∈ Λ n � p , α̺p Γ( n ) ≤ 1 2 n and P ( || X n || B s p,p ( R d ) > ̺ ) = 2 p Γ( n p ) ̺ αp “infinite model” (MII) Gerth,Ramlau 11 / 34
Bayesian approach How to avoid this phenomenon? “finite model” (MI) consider discretization level m and n fixed, finite index set Λ n Then � � λ | p < ∞ 2 − ς | λ | X α λ ψ λ ( t ) ⇒ || X n || p | X α X n ( t ) := p,p ( R d ) = B s λ ∈ Λ n λ ∈ Λ n � p , α̺p Γ( n ) ≤ 1 2 n and P ( || X n || B s p,p ( R d ) > ̺ ) = 2 p Γ( n p ) ̺ αp “infinite model” (MII) define X ( t ) in B r p ( R d ) with s < r − d p , then � � j =0 2 − j (( r − s ) p − d ) �� 1 � ∞ p < ∞ 2 c 1 λ + c 2 E ( || X || B s p,p ( R d ) ) = λ αp p,p ( R d ) > ̺ ) ≤ 1 and P ( || X || B s ̺ E ( || X || B s p,p ( R d ) ) Gerth,Ramlau 11 / 34
Bayesian approach Recall π post ( x | y σ ) = π pr ( x ) π ε ( y σ | x ) . π y σ ( y σ ) π ε ( y σ | x ) Gaussian noise, π pr ( x ) Besov-space prior 2 || x || p ⇒ π post ( x | y σ ) ∝ exp( − 1 2 σ 2 || Ax − y σ || 2 ) · exp( − α p,p ( R d ) ) B s we are interested in the maximum a-priori solution x map π post ( x | y σ ) = argmax α x ∈ R n or equivalently || Ax − y σ || 2 + ασ 2 || x || p x map = argmin (3) α B s p ( R d ) x ∈ R n Gerth,Ramlau 12 / 34
Bayesian approach Recall π post ( x | y σ ) = π pr ( x ) π ε ( y σ | x ) . π y σ ( y σ ) π ε ( y σ | x ) Gaussian noise, π pr ( x ) Besov-space prior 2 || x || p ⇒ π post ( x | y σ ) ∝ exp( − 1 2 σ 2 || Ax − y σ || 2 ) · exp( − α p,p ( R d ) ) B s we are interested in the maximum a-priori solution x map π post ( x | y σ ) = argmax α x ∈ R n or equivalently || Ax − y σ || 2 + ασ 2 || x || p x map = argmin (3) ���� α B s p ( R d ) x ∈ R n α ˆ same functional as in deterministic case, but we only know E ( || y − y σ || ) = f ( σ ) Gerth,Ramlau 12 / 34
Bayesian approach stochastic setting requires different measure for convergence we use the Ky Fan metric Definition Let x 1 and x 2 be random variables in a probability space (Ω , F , P ) with values in a metric space ( χ, d χ ) . The distance between x 1 and x 2 in the Ky Fan metric is defined as ρ K ( x 1 , x 2 ) := inf { ǫ > 0 : P ( d χ ( x 1 ( ω ) , x 2 ( ω )) > ǫ ) < ǫ } . Gerth,Ramlau 13 / 34
Bayesian approach stochastic setting requires different measure for convergence we use the Ky Fan metric Definition Let x 1 and x 2 be random variables in a probability space (Ω , F , P ) with values in a metric space ( χ, d χ ) . The distance between x 1 and x 2 in the Ky Fan metric is defined as ρ K ( x 1 , x 2 ) := inf { ǫ > 0 : P ( d χ ( x 1 ( ω ) , x 2 ( ω )) > ǫ ) < ǫ } . allows combination of deterministic and stochastic quantities metric for convergence in probability Gerth,Ramlau 13 / 34
Recommend
More recommend