understanding statistical vs computational tradeoffs via
play

Understanding Statistical-vs-Computational Tradeoffs via the - PowerPoint PPT Presentation

Understanding Statistical-vs-Computational Tradeoffs via the Low-Degree Likelihood Ratio Alex Wein Courant Institute, NYU Joint work with: Afonso Bandeira Yunzi Ding Tim Kunisky (ETH Zurich) (NYU) (NYU) 1 / 27 Motivation 2 / 27


  1. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] 6 / 27

  2. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] 6 / 27

  3. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´ a ’11] 6 / 27

  4. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´ a ’11] ◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ Cern´ y ’10] 6 / 27

  5. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´ a ’11] ◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ Cern´ y ’10] ◮ Statistical query lower bounds [Feldman, Grigorescu, Reyzin, Vempala, Xiao ’12] 6 / 27

  6. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´ a ’11] ◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ Cern´ y ’10] ◮ Statistical query lower bounds [Feldman, Grigorescu, Reyzin, Vempala, Xiao ’12] ◮ Sum-of-squares lower bounds [Barak, Hopkins, Kelner, Kothari, Moitra, Potechin ’16] 6 / 27

  7. How to Show that a Problem is Hard? We don’t know how to prove that average-case problems are hard, but various forms of “rigorous evidence”: ◮ Reductions (e.g. from planted clique) [Berthet, Rigollet ’13; Brennan, Bresler,...] ◮ Failure of MCMC [Jerrum ’92] ◮ Shattering of solution space [Achlioptas, Coja-Oghlan ’08] ◮ Failure of local algorithms [Gamarnik, Sudan ’13] ◮ Statistical physics, belief propagation [Decelle, Krzakala, Moore, Zdeborov´ a ’11] ◮ Optimization landscape, Kac-Rice formula [Auffinger, Ben Arous, ˇ Cern´ y ’10] ◮ Statistical query lower bounds [Feldman, Grigorescu, Reyzin, Vempala, Xiao ’12] ◮ Sum-of-squares lower bounds [Barak, Hopkins, Kelner, Kothari, Moitra, Potechin ’16] ◮ This talk: “low-degree method” [Barak, Hopkins, Kelner, Kothari, Moitra, Potechin ’16; Hopkins, Steurer ’17; Hopkins, Kothari, Potechin, Raghavendra, Schramm, Steurer ’17; Hopkins ’18 (PhD thesis)] 6 / 27

  8. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: 7 / 27

  9. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: ◮ Null model Y ∼ Q n e.g. G ( n , 1 / 2) 7 / 27

  10. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: ◮ Null model Y ∼ Q n e.g. G ( n , 1 / 2) ◮ Planted model Y ∼ P n e.g. G ( n , 1 / 2) ∪ { random k -clique } 7 / 27

  11. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: ◮ Null model Y ∼ Q n e.g. G ( n , 1 / 2) ◮ Planted model Y ∼ P n e.g. G ( n , 1 / 2) ∪ { random k -clique } Look for a degree- D (multivariate) polynomial f : R n × n → R that distinguishes P from Q : 7 / 27

  12. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: ◮ Null model Y ∼ Q n e.g. G ( n , 1 / 2) ◮ Planted model Y ∼ P n e.g. G ( n , 1 / 2) ∪ { random k -clique } Look for a degree- D (multivariate) polynomial f : R n × n → R that distinguishes P from Q : Want f ( Y ) to be big when Y ∼ P and small when Y ∼ Q 7 / 27

  13. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) Suppose we want to hypothesis test with error probability o (1) between two distributions: ◮ Null model Y ∼ Q n e.g. G ( n , 1 / 2) ◮ Planted model Y ∼ P n e.g. G ( n , 1 / 2) ∪ { random k -clique } Look for a degree- D (multivariate) polynomial f : R n × n → R that distinguishes P from Q : Want f ( Y ) to be big when Y ∼ P and small when Y ∼ Q E Y ∼ P [ f ( Y )] mean in P Compute max � E Y ∼ Q [ f ( Y ) 2 ] fluctuations in Q f deg D 7 / 27

  14. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � E Y ∼ Q [ f ( Y ) 2 ] f deg D 8 / 27

  15. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � f , g � = E Y ∼ Q [ f ( Y ) g ( Y )] � E Y ∼ Q [ f ( Y ) 2 ] f deg D � � f � = � f , f � 8 / 27

  16. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � f , g � = E Y ∼ Q [ f ( Y ) g ( Y )] � E Y ∼ Q [ f ( Y ) 2 ] f deg D � E Y ∼ Q [ L ( Y ) f ( Y )] � f � = � f , f � = max � E Y ∼ Q [ f ( Y ) 2 ] f deg D Likelihood ratio: L ( Y ) = d P d Q ( Y ) 8 / 27

  17. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � f , g � = E Y ∼ Q [ f ( Y ) g ( Y )] � E Y ∼ Q [ f ( Y ) 2 ] f deg D � E Y ∼ Q [ L ( Y ) f ( Y )] � f � = � f , f � = max � E Y ∼ Q [ f ( Y ) 2 ] f deg D � L , f � Likelihood ratio: = max L ( Y ) = d P � f � d Q ( Y ) f deg D 8 / 27

  18. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � f , g � = E Y ∼ Q [ f ( Y ) g ( Y )] � E Y ∼ Q [ f ( Y ) 2 ] f deg D � E Y ∼ Q [ L ( Y ) f ( Y )] � f � = � f , f � = max � E Y ∼ Q [ f ( Y ) 2 ] f deg D � L , f � Likelihood ratio: = max L ( Y ) = d P � f � d Q ( Y ) f deg D Maximizer: f = L ≤ D := projection of L onto degree- D subspace 8 / 27

  19. The Low-Degree Method (e.g. [Hopkins, Steurer ’17] ) E Y ∼ P [ f ( Y )] max � f , g � = E Y ∼ Q [ f ( Y ) g ( Y )] � E Y ∼ Q [ f ( Y ) 2 ] f deg D � E Y ∼ Q [ L ( Y ) f ( Y )] � f � = � f , f � = max � E Y ∼ Q [ f ( Y ) 2 ] f deg D � L , f � Likelihood ratio: = max L ( Y ) = d P � f � d Q ( Y ) f deg D = � L ≤ D � Maximizer: f = L ≤ D := projection of L onto degree- D subspace Norm of low-degree likelihood ratio 8 / 27

  20. The Low-Degree Method E Y ∼ P [ f ( Y )] = � L ≤ D � Conclusion: max � E Y ∼ Q [ f ( Y ) 2 ] f deg D 9 / 27

  21. The Low-Degree Method E Y ∼ P [ f ( Y )] = � L ≤ D � Conclusion: max � E Y ∼ Q [ f ( Y ) 2 ] f deg D Heuristically, � ω (1) degree- D polynomial can distinguish Q , P � L ≤ D � = O (1) degree- D polynomials fail 9 / 27

  22. The Low-Degree Method E Y ∼ P [ f ( Y )] = � L ≤ D � Conclusion: max � E Y ∼ Q [ f ( Y ) 2 ] f deg D Heuristically, � ω (1) degree- D polynomial can distinguish Q , P � L ≤ D � = O (1) degree- D polynomials fail Conjecture (informal variant of [Hopkins ’18] ) For “nice” Q , P , if � L ≤ D � = O (1) for some D = ω (log n ) then no polynomial-time algorithm can distinguish Q , P with success probability 1 − o (1) . 9 / 27

  23. The Low-Degree Method E Y ∼ P [ f ( Y )] = � L ≤ D � Conclusion: max � E Y ∼ Q [ f ( Y ) 2 ] f deg D Heuristically, � ω (1) degree- D polynomial can distinguish Q , P � L ≤ D � = O (1) degree- D polynomials fail Conjecture (informal variant of [Hopkins ’18] ) For “nice” Q , P , if � L ≤ D � = O (1) for some D = ω (log n ) then no polynomial-time algorithm can distinguish Q , P with success probability 1 − o (1) . Degree- O (log n ) polynomials ⇔ Polynomial-time algorithms 9 / 27

  24. Formal Consequences of the Low-Degree Method The case D = ∞ : If � L � = O (1) (as n → ∞ ) then no test can distinguish Q from P (with success probability 1 − o (1)) ◮ Classical second moment method 10 / 27

  25. Formal Consequences of the Low-Degree Method The case D = ∞ : If � L � = O (1) (as n → ∞ ) then no test can distinguish Q from P (with success probability 1 − o (1)) ◮ Classical second moment method If � L ≤ D � = O (1) for some D = ω (log n ) then no spectral method can distinguish Q from P (in a particular sense) [Kunisky, W, Bandeira ’19] ◮ Spectral method: threshold top eigenvalue of poly-size matrix M = M ( Y ) whose entries are O (1)-degree polynomials in Y 10 / 27

  26. Formal Consequences of the Low-Degree Method The case D = ∞ : If � L � = O (1) (as n → ∞ ) then no test can distinguish Q from P (with success probability 1 − o (1)) ◮ Classical second moment method If � L ≤ D � = O (1) for some D = ω (log n ) then no spectral method can distinguish Q from P (in a particular sense) [Kunisky, W, Bandeira ’19] ◮ Spectral method: threshold top eigenvalue of poly-size matrix M = M ( Y ) whose entries are O (1)-degree polynomials in Y ◮ Proof: consider polynomial f ( Y ) = Tr ( M q ) with q = Θ(log n ) 10 / 27

  27. Formal Consequences of the Low-Degree Method The case D = ∞ : If � L � = O (1) (as n → ∞ ) then no test can distinguish Q from P (with success probability 1 − o (1)) ◮ Classical second moment method If � L ≤ D � = O (1) for some D = ω (log n ) then no spectral method can distinguish Q from P (in a particular sense) [Kunisky, W, Bandeira ’19] ◮ Spectral method: threshold top eigenvalue of poly-size matrix M = M ( Y ) whose entries are O (1)-degree polynomials in Y ◮ Proof: consider polynomial f ( Y ) = Tr ( M q ) with q = Θ(log n ) ◮ Spectral methods are believed to be as powerful as sum-of-squares for average-case problems [HKPRSS ’17] 10 / 27

  28. Low-Degree Method: Recap Given a hypothesis testing question Q n vs P n 11 / 27

  29. Low-Degree Method: Recap Given a hypothesis testing question Q n vs P n Take D ≈ log n 11 / 27

  30. Low-Degree Method: Recap Given a hypothesis testing question Q n vs P n Take D ≈ log n Compute/bound � L ≤ D � in the limit n → ∞ 11 / 27

  31. Low-Degree Method: Recap Given a hypothesis testing question Q n vs P n Take D ≈ log n Compute/bound � L ≤ D � in the limit n → ∞ ◮ If � L ≤ D � = ω (1), suggests that the problem is poly-time solvable 11 / 27

  32. Low-Degree Method: Recap Given a hypothesis testing question Q n vs P n Take D ≈ log n Compute/bound � L ≤ D � in the limit n → ∞ ◮ If � L ≤ D � = ω (1), suggests that the problem is poly-time solvable ◮ If � L ≤ D � = O (1), suggests that the problem is NOT poly-time solvable (and gives rigorous evidence: spectral methods fail) 11 / 27

  33. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems 12 / 27

  34. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! 12 / 27

  35. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... 12 / 27

  36. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple 12 / 27

  37. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds 12 / 27

  38. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification 12 / 27

  39. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification ◮ General: no assumptions on Q , P 12 / 27

  40. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification ◮ General: no assumptions on Q , P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] 12 / 27

  41. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification ◮ General: no assumptions on Q , P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] ◮ By varying degree D , can explore runtimes other than polynomial ◮ Conjecture (Hopkins ’18): degree- D polynomials ⇔ Θ( D ) algorithms time- n ˜ 12 / 27

  42. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification ◮ General: no assumptions on Q , P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] ◮ By varying degree D , can explore runtimes other than polynomial ◮ Conjecture (Hopkins ’18): degree- D polynomials ⇔ Θ( D ) algorithms time- n ˜ ◮ No ingenuity required 12 / 27

  43. Advantages of the Low-Degree Method ◮ Possible to calculate/bound � L ≤ D � for many problems ◮ Predictions seem “correct”! ◮ Planted clique, sparse PCA, stochastic block model, ... ◮ (Relatively) simple ◮ Much simpler than sum-of-squares lower bounds ◮ Detection vs certification ◮ General: no assumptions on Q , P ◮ Captures sharp thresholds [Hopkins, Steurer ’17] ◮ By varying degree D , can explore runtimes other than polynomial ◮ Conjecture (Hopkins ’18): degree- D polynomials ⇔ Θ( D ) algorithms time- n ˜ ◮ No ingenuity required ◮ Interpretable 12 / 27

  44. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) 13 / 27

  45. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) d Q ( Y ) = E X exp( − 1 2 � Y − X � 2 ) L ( Y ) = d P = E X exp( � Y , X �− 1 2 � X � 2 ) exp( − 1 2 � Y � 2 ) 13 / 27

  46. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) d Q ( Y ) = E X exp( − 1 2 � Y − X � 2 ) L ( Y ) = d P = E X exp( � Y , X �− 1 2 � X � 2 ) exp( − 1 2 � Y � 2 ) Expand L = � α c α h α where { h α } are Hermite polynomials (orthonormal basis w.r.t. Q ) 13 / 27

  47. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) d Q ( Y ) = E X exp( − 1 2 � Y − X � 2 ) L ( Y ) = d P = E X exp( � Y , X �− 1 2 � X � 2 ) exp( − 1 2 � Y � 2 ) Expand L = � α c α h α where { h α } are Hermite polynomials (orthonormal basis w.r.t. Q ) � L ≤ D � 2 = � | α |≤ D c 2 α where c α = � L , h α � = E Y ∼ Q [ L ( Y ) h α ( Y )] 13 / 27

  48. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) d Q ( Y ) = E X exp( − 1 2 � Y − X � 2 ) L ( Y ) = d P = E X exp( � Y , X �− 1 2 � X � 2 ) exp( − 1 2 � Y � 2 ) Expand L = � α c α h α where { h α } are Hermite polynomials (orthonormal basis w.r.t. Q ) � L ≤ D � 2 = � | α |≤ D c 2 α where c α = � L , h α � = E Y ∼ Q [ L ( Y ) h α ( Y )] · · · 13 / 27

  49. How to Compute � L ≤ D � Additive Gaussian noise: P : Y = X + Z vs Q : Y = Z where X ∼ P , any distribution over R N and Z is i.i.d. N (0 , 1) d Q ( Y ) = E X exp( − 1 2 � Y − X � 2 ) L ( Y ) = d P = E X exp( � Y , X �− 1 2 � X � 2 ) exp( − 1 2 � Y � 2 ) Expand L = � α c α h α where { h α } are Hermite polynomials (orthonormal basis w.r.t. Q ) � L ≤ D � 2 = � | α |≤ D c 2 α where c α = � L , h α � = E Y ∼ Q [ L ( Y ) h α ( Y )] · · · D 1 Result: � L ≤ D � 2 = � d ! E X , X ′ [ � X , X ′ � d ] d =0 13 / 27

  50. References For more on the low-degree method... ◮ Samuel B. Hopkins, PhD thesis ’18: “Statistical Inference and the Sum of Squares Method” ◮ Connection to SoS ◮ Survey article: Kunisky, W, Bandeira, “Notes on Computational Hardness of Hypothesis Testing: Predictions using the Low-Degree Likelihood Ratio”, arxiv:1907.11636 14 / 27

  51. Part II: Sparse PCA Based on: Ding, Kunisky, W., Bandeira, “Subexponential-Time Algorithms for Sparse PCA”, arxiv:1907.11635 15 / 27

  52. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio 16 / 27

  53. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x 16 / 27

  54. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W 16 / 27

  55. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g. 16 / 27

  56. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g. ◮ spherical (uniform on unit sphere) 16 / 27

  57. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g. ◮ spherical (uniform on unit sphere) ◮ Rademacher (i.i.d. ± 1 / √ n ) 16 / 27

  58. Spiked Wigner Model Observe n × n matrix Y = λ xx T + W Signal: x ∈ R n , � x � = 1 Noise: W ∈ R n × n with entries W ij = W ji ∼ N (0 , 1 / n ) i.i.d. λ > 0: signal-to-noise ratio Goal: given Y , estimate the signal x Or, even simpler: distinguish (w.h.p.) Y from pure noise W Structure: suppose x is drawn from some prior, e.g. ◮ spherical (uniform on unit sphere) ◮ Rademacher (i.i.d. ± 1 / √ n ) ◮ sparse 16 / 27

  59. PCA (Principal Component Analysis) Y = λ xx T + W J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  60. PCA (Principal Component Analysis) Y = λ xx T + W PCA: top eigenvalue λ 1 ( Y ) and (unit-norm) eigenvector v 1 J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  61. PCA (Principal Component Analysis) Y = λ xx T + W PCA: top eigenvalue λ 1 ( Y ) and (unit-norm) eigenvector v 1 Theorem (BBP’05, FP’06) Almost surely, as n → ∞ , J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  62. PCA (Principal Component Analysis) Y = λ xx T + W PCA: top eigenvalue λ 1 ( Y ) and (unit-norm) eigenvector v 1 Theorem (BBP’05, FP’06) Almost surely, as n → ∞ , ◮ If λ ≤ 1 : λ 1 ( Y ) → 2 and � x , v 1 � → 0 J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  63. PCA (Principal Component Analysis) Y = λ xx T + W PCA: top eigenvalue λ 1 ( Y ) and (unit-norm) eigenvector v 1 Theorem (BBP’05, FP’06) Almost surely, as n → ∞ , ◮ If λ ≤ 1 : λ 1 ( Y ) → 2 and � x , v 1 � → 0 λ > 2 and � x , v 1 � 2 → 1 − 1 /λ 2 > 0 ◮ If λ > 1 : λ 1 ( Y ) → λ + 1 J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  64. PCA (Principal Component Analysis) Y = λ xx T + W PCA: top eigenvalue λ 1 ( Y ) and (unit-norm) eigenvector v 1 Theorem (BBP’05, FP’06) Almost surely, as n → ∞ , ◮ If λ ≤ 1 : λ 1 ( Y ) → 2 and � x , v 1 � → 0 λ > 2 and � x , v 1 � 2 → 1 − 1 /λ 2 > 0 ◮ If λ > 1 : λ 1 ( Y ) → λ + 1 Sharp threshold: PCA can detect and recover the signal iff λ > 1 J. Baik, G. Ben Arous, S. Peche, AoP 2005. D. Feral, S. Peche, CMP 2006. 17 / 27

  65. Is PCA Optimal? 18 / 27

  66. Is PCA Optimal? PCA does not exploit structure of signal x 18 / 27

  67. Is PCA Optimal? PCA does not exploit structure of signal x Is the PCA threshold ( λ = 1) optimal? ◮ Is it statistically possible to detect/recover when λ < 1? 18 / 27

Recommend


More recommend