Primal-dual algorithms for the sum of two and three functions 1 Ming Yan Michigan State University, CMSE/Mathematics 1 This works is partially supported by NSF.
optimization problems for primal-dual algorithms minimize f ( x ) + g ( x ) + h ( Ax ) x • f , g , and h are convex. • X and Y are two Hilbert spaces (e.g., R m , R n ). • f : X �→ R is differentiable with a 1 /β -Lipschitz continuous gradient for some β ∈ (0 , + ∞ ) . • A : X �→ Y is a bounded linear operator.
applications: statistics Elastic net regularization (Zou-Hastie ’05): µ 2 � x � 2 minimize 2 + µ 1 � x � 1 + l ( Ax , b ) , x where x ∈ R p , A ∈ R n × p , b ∈ R n , and l is the loss function, which may be nondifferentiable.
applications: statistics Elastic net regularization (Zou-Hastie ’05): µ 2 � x � 2 minimize 2 + µ 1 � x � 1 + l ( Ax , b ) , x where x ∈ R p , A ∈ R n × p , b ∈ R n , and l is the loss function, which may be nondifferentiable. Fused lasso (Tibshirani et al. ’05): 1 2 � Ax − b � 2 minimize 2 + µ 1 � x � 1 + µ 2 � Dx � 1 , x where x ∈ R p , A ∈ R n × p , b ∈ R n , and − 1 1 − 1 1 D = . . . . . . − 1 1 is a matrix in R ( p − 1) × p .
applications: decentralized optimization n � minimize f i ( x ) + g i ( x ) x i =1 • f i and g i is known at node i only. • Nodes 1 , · · · , n are connected in a undirected graph. • f i is differentiable with a Lipschitz continuous gradient.
applications: decentralized optimization n � minimize f i ( x ) + g i ( x ) x i =1 • f i and g i is known at node i only. • Nodes 1 , · · · , n are connected in a undirected graph. • f i is differentiable with a Lipschitz continuous gradient. Introduce a copy x i at node i : n � minimize f ( x ) + g ( x ) := f i ( x i ) + g i ( x i ) s . t . Wx = x x i =1 • x i ∈ R p , x = [ x 1 x 2 · · · x n ] ⊤ ∈ R n × p . • W is a symmetric doubly stochastic mixing matrix.
applications: decentralized optimization n � minimize f i ( x ) + g i ( x ) x i =1 • f i and g i is known at node i only. • Nodes 1 , · · · , n are connected in a undirected graph. • f i is differentiable with a Lipschitz continuous gradient. Introduce a copy x i at node i : n � minimize f ( x ) + g ( x ) := f i ( x i ) + g i ( x i ) s . t . Wx = x x i =1 • x i ∈ R p , x = [ x 1 x 2 · · · x n ] ⊤ ∈ R n × p . • W is a symmetric doubly stochastic mixing matrix. The sum of three functions: f ( x ) + g ( x ) + ι 0 (( I − W ) 1 / 2 x ) minimize x
applications: imaging Image restoration with two regularizations: 1 2 � Ax − b � 2 minimize 2 + ι C ( x ) + µ � Dx � 1 , x where x ∈ R n is the image to be reconstructed, A ∈ R m × n is the forward projection matrix, b ∈ R m is the measured data with noise, D is a discrete gradient operator, and ι C is the indicator function that returns zero if x ∈ C (here, C is the set of nonnegative vectors in R n ) and + ∞ otherwise.
applications: imaging Image restoration with two regularizations: 1 2 � Ax − b � 2 minimize 2 + ι C ( x ) + µ � Dx � 1 , x where x ∈ R n is the image to be reconstructed, A ∈ R m × n is the forward projection matrix, b ∈ R m is the measured data with noise, D is a discrete gradient operator, and ι C is the indicator function that returns zero if x ∈ C (here, C is the set of nonnegative vectors in R n ) and + ∞ otherwise. Other problems: • f : data fitting term (infimal convolution for mixed noise) • h ◦ A : total variation; other transforms • g : nonnegativity; box constraint
primal-dual formulation Original problem: minimize f ( x ) + g ( x ) + h ( Ax ) x
primal-dual formulation Original problem: minimize f ( x ) + g ( x ) + h ( Ax ) x Introduce a dual variable s : s f ( x ) + g ( x ) + � Ax , s � − h ∗ ( s ) minimize max x Here h ∗ is the conjugate function of h that is defined as h ∗ ( s ) = max � s , t � − h ( t ) , t
primal-dual formulation Original problem: minimize f ( x ) + g ( x ) + h ( Ax ) x Introduce a dual variable s : s f ( x ) + g ( x ) + � Ax , s � − h ∗ ( s ) minimize max x Here h ∗ is the conjugate function of h that is defined as h ∗ ( s ) = max � s , t � − h ( t ) , t It is equivalent to ( s ∗ ∈ ∂h ( Ax ∗ ) ⇐ ⇒ Ax ∗ ∈ ∂h ∗ ( s ∗ ) ): � ∇ f ( x ∗ ) + ∂g ( x ∗ ) + A ⊤ s ∗ 0 ∈ ∂h ∗ ( s ∗ ) − Ax ∗ 0 ∈ All primal-dual algorithms try to find ( x ∗ , s ∗ ) .
existing algorithms: Condat-Vu, AFBA, and PDFP Condat-Vu (Condat ’13, Vu ’13): • Convergence conditions: λ � AA ⊤ � + γ/ (2 β ) ≤ 1 • Per-iteration computations: A , A ⊤ , ∇ f , one ( I + γ∂g ) − 1 , ( I + λ γ ∂h ∗ ) − 1 2 2 1 ( I + γ∂g ) − 1 (˜ x � 2 . x ) = arg min γg ( x ) + � x − ˜ 2 x This is a backward step (or implicit step) because ( I + γ∂g ) − 1 (˜ x − γ∂g (( I + γ∂g ) − 1 (˜ x ) ∈ ˜ x ))
existing algorithms: Condat-Vu, AFBA, and PDFP Condat-Vu (Condat ’13, Vu ’13): • Convergence conditions: λ � AA ⊤ � + γ/ (2 β ) ≤ 1 • Per-iteration computations: A , A ⊤ , ∇ f , one ( I + γ∂g ) − 1 , ( I + λ γ ∂h ∗ ) − 1 2 AFBA (Latafat-Patrinos ’16): • Convergence conditions: λ � AA ⊤ � / 2 + � λ � AA ⊤ � / 2 + γ/ (2 β ) ≤ 1 • Per-iteration computations: A , A ⊤ , ∇ f , one ( I + γ∂g ) − 1 , ( I + λ γ ∂h ∗ ) − 1 2 1 ( I + γ∂g ) − 1 (˜ x � 2 . x ) = arg min γg ( x ) + � x − ˜ 2 x This is a backward step (or implicit step) because ( I + γ∂g ) − 1 (˜ x − γ∂g (( I + γ∂g ) − 1 (˜ x ) ∈ ˜ x ))
existing algorithms: Condat-Vu, AFBA, and PDFP Condat-Vu (Condat ’13, Vu ’13): • Convergence conditions: λ � AA ⊤ � + γ/ (2 β ) ≤ 1 • Per-iteration computations: A , A ⊤ , ∇ f , one ( I + γ∂g ) − 1 , ( I + λ γ ∂h ∗ ) − 1 2 AFBA (Latafat-Patrinos ’16): • Convergence conditions: λ � AA ⊤ � / 2 + � λ � AA ⊤ � / 2 + γ/ (2 β ) ≤ 1 • Per-iteration computations: A , A ⊤ , ∇ f , one ( I + γ∂g ) − 1 , ( I + λ γ ∂h ∗ ) − 1 PDFP (Chen-Huang-Zhang ’16): • Convergence conditions: λ � AA ⊤ � < 1 ; γ/ (2 β ) < 1 • Per-iteration computations: A , A ⊤ , ∇ f , two ( I + γ∂g ) − 1 , ( I + λ γ ∂h ∗ ) − 1 2 1 ( I + γ∂g ) − 1 (˜ x � 2 . x ) = arg min γg ( x ) + � x − ˜ 2 x This is a backward step (or implicit step) because ( I + γ∂g ) − 1 (˜ x − γ∂g (( I + γ∂g ) − 1 (˜ x ) ∈ ˜ x ))
PDHG (Zhu-Chan ’08) When f = 0 , we have A ⊤ x ∗ � � � � ∂g ∋ 0 ∂h ∗ s ∗ − A
PDHG (Zhu-Chan ’08) When f = 0 , we have � A ⊤ � � x ∗ � ∂g ∋ 0 ∂h ∗ s ∗ − A It is equivalent to x ∗ − A ⊤ x ∗ � � � � � � � � ∂g 0 ∋ ∂h ∗ s ∗ s ∗ − A 0
PDHG (Zhu-Chan ’08) When f = 0 , we have A ⊤ x ∗ � � � � ∂g ∋ 0 ∂h ∗ s ∗ − A It is equivalent to � 1 � � x ∗ � � 1 A ⊤ � � x ∗ � γ I + ∂g γ I ∋ λ I + ∂h ∗ γ s ∗ γ s ∗ − A λ I
PDHG (Zhu-Chan ’08) When f = 0 , we have � A ⊤ � � x ∗ � ∂g ∋ 0 ∂h ∗ s ∗ − A It is equivalent to 1 x ∗ 1 A ⊤ x ∗ � � � � � � � � γ I + ∂g γ I ∋ γ γ λ I + ∂h ∗ s ∗ s ∗ − A λ I Primal-dual hybrid gradient (PDHG) x + = ( I + γ∂g ) − 1 ( x − γ A ⊤ s ) s + = � γ ∂h ∗ � − 1 � I + λ s + λ γ Ax + �
PDHG (Zhu-Chan ’08) When f = 0 , we have A ⊤ x ∗ � � � � ∂g ∋ 0 ∂h ∗ s ∗ − A It is equivalent to � 1 � � x ∗ � � 1 A ⊤ � � x ∗ � γ I + ∂g γ I ∋ γ λ I + ∂h ∗ s ∗ γ s ∗ − A λ I Chambolle-Pock (Chambolle et.al ’09, Esser-Zhang-Chan ’10) x + = ( I + γ∂g ) − 1 ( x − γ A ⊤ s ) x + = x + + x + − x ¯ s + = � γ ∂h ∗ � − 1 � I + λ s + λ x + � γ A ¯
Chambolle-Pock ’11 as proximal point Chambolle-Pock ( x − s order) x + = ( I + γ∂g ) − 1 ( x − γ A ⊤ s ) s + = � γ A (2 x + − x ) � γ ∂h ∗ � − 1 � I + λ s + λ
Chambolle-Pock ’11 as proximal point Chambolle-Pock ( x − s order) x + = ( I + γ∂g ) − 1 ( x − γ A ⊤ s ) s + = � γ A (2 x + − x ) � γ ∂h ∗ � − 1 � I + λ s + λ CP is equivalent to the backward operator applied on the KKT system.
Chambolle-Pock ’11 as proximal point Chambolle-Pock ( x − s order) x + = ( I + γ∂g ) − 1 ( x − γ A ⊤ s ) s + = � γ A (2 x + − x ) � γ ∂h ∗ � − 1 � I + λ s + λ CP is equivalent to the backward operator applied on the KKT system. �� 1 � � �� � x + � � 1 − A ⊤ � � � ∂g γ I γ I x + ∋ γ ∂h ∗ s + γ − A − A − A λ I λ I s • CP is 1/2-averaged under the metric induced by the matrix if λ satisfies the condition λ � AA ⊤ � ≤ 1 .
Recommend
More recommend