Stochastic Iterative Hard Thresholding for Graph-Structured Sparsity Optimization Baojian Zhou 1 , Feng Chen 1 , and Yiming Ying 2 1 Department of Computer Science, 2 Department of Mathematics and Statistics, University at Albany, NY, USA 06/13/2019 Poster # 92 1 / 7
Graph structure information Current limitations: Our goals propose/provide: • only focus on specific loss as a prior often have: • an algo. for general loss • better classification, • expensive full-gradient under stochastic setting regression performance calculation • convergence analysis • stronger interpretation • cannot handle complex • real-world applications structure Structured sparse learning Given M ( M ) = { w : supp( w ) ∈ M } , the structured sparse learning problems can be formulated as n w ∈M ( M ) F ( w ) := 1 � min f i ( w ) , where n w 6 w 6 i =1 w 5 w 4 w 4 w 5 w 1 w F ( w ) is a convex loss such as least square, logistic loss, . . . w 3 w 2 w 3 w 2 M ( M ) models structured sparsity such as connected subgraphs, w 1 G dense subgraphs, and subgraphs isomophic to a query graph, . . . 1 2 / 7
Inspired by two recent works Hegde et al. (2016); Nguyen et al. (2017) Algorithm 1 GraphStoIHT w 6 w 6 w 6 w 6 1: Input : η t , F ( · ) , M H , M T w 5 2: Initialize : w 0 and t = 0 w 4 w 4 w 5 w 1 w 4 w 5 w 1 w 4 w 5 w 1 w 3 3: for t = 0 , 1 , 2 , . . . do w 2 w 3 w 2 w 3 w 2 w 3 w 2 Choose ξ t from [ n ] with prob. p ξ t 4: b t = P ( ∇ f ξ t ( w t ) , M H ) w 1 5: � w t +1 = P ( w t − η t b t , M T ) 6: Weighted Graph Model 7: end for M = { S : | S | ≤ 3 , S is connected } (Hegde et al., 2015a) 8: Return w t +1 Orthogonal Projection Operator P ( · , M ) : Two differences from StoIHT : R p → R p defined as • project the gradient ∇ f ξ t ( · ) • projects the proxy onto M ( M T ). � w − w ′ � 2 P ( w , M ) = arg min Why projection b t = P ( ∇ f ξ t ( w t ) , M H ) ? w ′ ∈M ( M ) • Both of them solve the same projection problem • s -sparse set • Intuitively, sparsity is both in primal and dual space • Weighted Graph Model • Remove some noisy directions at the first stage 2 3 / 7
Two assumptions in M ( M ): 2 � w − w ′ � 2 f i ( w ): β -Restricted Strong Smoothness 1 β F ( w ): α -Restricted Strong Convexity ) ′ ( w , w Efficient Approximated projections: 2 f B α • P ( · , M H ) with approximation factor c H 2 � w − w ′ � 2 • P ( · , M T ) with approximation factor c T B f ( w , w ′ ) = f ( w ) − f ( w ′ ) − �∇ f ( w ′ ) , w − w ′ � Theorem 1 ( Linear Convergence) Let w 0 be the start point and choose η t = η , then w t +1 of Algorithm 1 satisfies σ E ξ [ t ] � w t +1 − w ∗ � ≤ κ t +1 � w 0 − w ∗ � + 1 − κ, where � �� � αβη 2 − 2 αη + 1 + αβτ 2 − 2 ατ + 1 , β 0 = (1 + c H ) τ � 1 − α 2 κ = (1 + c T ) , α 0 = c H ατ − 0 � � β 0 α 0 β 0 E ξ t �∇ I f ξ t ( w ∗ ) � + η E ξ t �∇ I f ξ t ( w ∗ ) � , and η, τ ∈ (0 , 2 /β ) . σ = + α 0 � 1 − α 2 0 3 4 / 7
Graph Linear Regression Contraction factor w ∗ : y = Xw ∗ + ǫ X ∈ R m × p , ǫ ∼ N ( 0 , I m ) Algorithm κ Consider the least square loss � √ δ + 2 √ 1 − δ � √ GraphIHT (1 + c T ) δ 1+ δ + 2 √ � √ �� 2(1 − δ ) 2 (1 + c T ) δ GraphStoIHT n 1+ δ F ( w ) := 1 n � 2 m � X B i w − y B i � 2 . arg min • For GraphIHT , δ ≤ 0 . 0527 n supp( w ) ∈M ( M ) i =1 • For GraphStoIHT , δ ≤ 0 . 0142 Graph Logistic Regression If x i is normalized, then F ( w ) satisfies λ -RSC and each f i ( w ) satisfies ( α + (1 + w ∗ : (1 + e − y i ·� w ∗ , x i � ) − 1 x i ∈ R p , y i ∈ { +1 , − 1 } ν ) θ max )-RSS. The condition of κ < 1 is λ + n (1 + ν ) θ max / 4 m ≥ 243 λ Consider the logistic loss 250 , m / n n F ( w ) := 1 n h ( w , i j )+ λ � � 2 � w � 2 , arg min with prob. 1 − p exp ( − θ max ν/ 4) , where n m supp( w ) ∈M ( M ) i =1 j =1 θ max = λ max ( � m / n j =1 E [ x i j x T i j ]) and ν ≥ 1. where h ( w , i j ) = log(1 + exp ( − y i j · � x i j , w � )). 4 5 / 7
BackGround Angio Text Simulation Dataset each entry √ m X ij ∼ N (0 , 1) NIHT IHT supp( w ∗ ) is generated by random walk StoIHT Entries of w ∗ from N (0 , 1) CoSaMP GraphIHT Weighted Graph Model (Hegde et al., 2015b) GraphCoSaMP GraphStoIHT η = 0 . 1 GraphStoIHT GraphStoIHT η = 0 . 2 1 . 0 10 0 10 0 Probability of Recovery b = 1 η = 0 . 3 b = 2 η = 0 . 4 0 . 8 10 − 2 10 − 2 b = 4 η = 0 . 5 b = 8 η = 0 . 6 b = 16 η = 0 . 7 10 − 4 10 − 4 0 . 6 x � � x − ˆ b = 24 η = 0 . 8 b = 32 η = 0 . 9 10 − 6 10 − 6 0 . 4 b = 40 η = 1 . 0 b = 48 η = 1 . 1 b = 56 η = 1 . 2 10 − 8 10 − 8 0 . 2 b = 64 η = 1 . 3 b = 180 η = 1 . 4 η = 1 . 5 0 . 0 0 5 10 15 20 25 0 300 600 900 η = 1 . 6 Epoch Iteration 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 1 . 5 2 . 0 2 . 5 3 . 0 3 . 5 Oversampling ratio m/s Oversampling ratio m/s Oversampling ratio m/s Breast Cancer Dataset 295 samples with 78 positives (metastatic) � w t � 0 Algorithm Cancer related genes AUC and 217 negatives (non-metastatic) provided GraphStoIHT BRCA2, CCND2, CDKN1A, ATM, AR, TOP2A 051.7 0.715 GraphIHT ATM, CDKN1A, BRCA2, AR, TOP2A 055.2 0.714 in (Van De Vijver et al., 2002). ℓ 1 - Path BRCA1, CDKN1A, ATM, DSC2 061.2 0.675 MKI67, NAT1, AR, TOP2A 059.6 0.708 PPI network with 637 pathways is provided StoIHT ℓ 1 /ℓ 2 - Edge CCND3, ATM, CDH3 051.4 0.705 in (Jacob et al., 2009). We restrict our ℓ 1 - Edge CCND3, AR, CDH3 039.9 0.698 analysis on 3,243 genes (nodes) with 19,938 ℓ 1 /ℓ 2 - Path BRCA1, CDKN1A 147.6 0.705 edges. These cancer-related genes form a IHT NAT1, TOP2A 067.9 0.707 connected subgraph. 5 6 / 7
See you at Poster #92 Thank you! 7 / 7
Recommend
More recommend