On the Statistical Rate of Nonlinear Recovery in Generative Models with Heavy-tailed Data Xiaohan Wei , Zhuoran Yang, and Zhaoran Wang University of Southern California, Princeton University and Northwestern University June 12th, 2019
Generative Model vs Sparsity in Signal Recovery Classical sparsity: structure of the signals depend on basis.
Generative Model vs Sparsity in Signal Recovery Classical sparsity: structure of the signals depend on basis. Generative model: explicit parametrization of low-dimensional signal manifold.
Generative Model vs Sparsity in Signal Recovery Classical sparsity: structure of the signals depend on basis. Generative model: explicit parametrization of low-dimensional signal manifold. Previous works: [Bora et al. 2017] [Hand et al. 2018] [Mardani et al. 2017].
Nonlinear Recovery via Generative Models Given: Generative model G : R k → R d and measurement matrix X ∈ R m × d .
Nonlinear Recovery via Generative Models Given: Generative model G : R k → R d and measurement matrix X ∈ R m × d . Goal: Recovery G ( θ ∗ ) up to scaling from nonlinear observations y = f ( XG ( θ ∗ )) .
Nonlinear Recovery via Generative Models Given: Generative model G : R k → R d and measurement matrix X ∈ R m × d . Goal: Recovery G ( θ ∗ ) up to scaling from nonlinear observations y = f ( XG ( θ ∗ )) . Challenges: 1 High-dimensional recovery: k ≪ d , m ≪ d .
Nonlinear Recovery via Generative Models Given: Generative model G : R k → R d and measurement matrix X ∈ R m × d . Goal: Recovery G ( θ ∗ ) up to scaling from nonlinear observations y = f ( XG ( θ ∗ )) . Challenges: 1 High-dimensional recovery: k ≪ d , m ≪ d . 2 Non-Gaussian X and unknown non-linearity f .
Nonlinear Recovery via Generative Models Given: Generative model G : R k → R d and measurement matrix X ∈ R m × d . Goal: Recovery G ( θ ∗ ) up to scaling from nonlinear observations y = f ( XG ( θ ∗ )) . Challenges: 1 High-dimensional recovery: k ≪ d , m ≪ d . 2 Non-Gaussian X and unknown non-linearity f . 3 Observations y can be heavy-tailed .
Our Method: Stein + Adaptive Thresholding Suppose the rows of X := [ X 1 , · · · , X m ] T ∈ R m × d have density p : R d → R . Define the (row-wise) score transformation: S p ( X ) := [ S p ( X 1 ) , · · · , S p ( X m )] T = [ ∇ log p ( X 1 ) , · · · , ∇ log p ( X m )] T .
Our Method: Stein + Adaptive Thresholding Suppose the rows of X := [ X 1 , · · · , X m ] T ∈ R m × d have density p : R d → R . Define the (row-wise) score transformation: S p ( X ) := [ S p ( X 1 ) , · · · , S p ( X m )] T = [ ∇ log p ( X 1 ) , · · · , ∇ log p ( X m )] T . (First-order) Stein’s identity: when E f ′ ( � X i , G ( θ ∗ ) � ) > 0, � � S p ( X ) T y ∝ G ( θ ∗ ) . E (Second-order) Stein’s identity: when E f ′′ ( � X i , G ( θ ∗ ) � ) > 0, δ is a constant, � � S p ( X ) T diag ( y ) S p ( X ) ∝ G ( θ ∗ ) G ( θ ∗ ) T + δ · I d × d . E
Our Method: Stein + Adaptive Thresholding Suppose the rows of X := [ X 1 , · · · , X m ] T ∈ R m × d have density p : R d → R . Define the (row-wise) score transformation: S p ( X ) := [ S p ( X 1 ) , · · · , S p ( X m )] T = [ ∇ log p ( X 1 ) , · · · , ∇ log p ( X m )] T . (First-order) Stein’s identity: when E f ′ ( � X i , G ( θ ∗ ) � ) > 0, � � S p ( X ) T y ∝ G ( θ ∗ ) . E (Second-order) Stein’s identity: when E f ′′ ( � X i , G ( θ ∗ ) � ) > 0, δ is a constant, � � S p ( X ) T diag ( y ) S p ( X ) ∝ G ( θ ∗ ) G ( θ ∗ ) T + δ · I d × d . E Adaptive thresholding: suppose � y i � L q < ∞ , q > 4, and τ m ∝ m 2 / q , � y i = sign ( y i ) · ( | y i | ∧ τ m ) , i ∈ { 1 , 2 , · · · , m }
Our Method: Stein + Adaptive Thresholding Least-squares estimator: � � 2 � � � G ( θ ) − 1 � � m S p ( X ) T � � θ ∈ argmin θ ∈ R k y . � 2
Our Method: Stein + Adaptive Thresholding Least-squares estimator: � � 2 � � � G ( θ ) − 1 � � m S p ( X ) T � � θ ∈ argmin θ ∈ R k y . � 2 Main performance theorem: Theorem (Wei, Yang and Wang, 2019) For any accuracy level ε ∈ ( 0 , 1 ] , suppose (1) E f ′ ( � X i , G ( θ ∗ ) � ) > 0 , (2) the generative model G is a ReLU network with zero bias, (3) the number of measurements m ∝ k ε − 2 log d . Then, with high probability, � � � � G ( � G ( θ ∗ ) θ ) � � − ≤ ε. � � � G ( � � G ( θ ∗ ) � 2 � � θ ) � 2 2 Similar results hold for more general Lipschitz generators G .
Our Method: Stein + Adaptive Thresholding PCA type estimator: θ ∈ argmax � G ( θ ) � 2 = 1 G ( θ ) T S p ( X ) T diag ( � � y ) S p ( X ) G ( θ )
Our Method: Stein + Adaptive Thresholding PCA type estimator: θ ∈ argmax � G ( θ ) � 2 = 1 G ( θ ) T S p ( X ) T diag ( � � y ) S p ( X ) G ( θ ) Main performance theorem: Theorem (Wei, Yang and Wang, 2019) For any accuracy level ε ∈ ( 0 , 1 ] , suppose (1) E f ′′ ( � X i , G ( θ ∗ ) � ) > 0 , (2) the generative model G is a ReLU network with zero bias, (3) the number of measurements m ∝ k ε − 2 log d . Then, with high probability, � � � G ( θ ∗ ) � � � G ( � � θ ) − ≤ ε. � � G ( θ ∗ ) � 2 2 Similar results hold for more general Lipschitz generators G .
Thank you! Poster 198, Pacific Ballroom, 6:30-9:00 pm
Recommend
More recommend