linear regression without correspondence
play

Linear regression without correspondence Daniel Hsu Columbia - PowerPoint PPT Presentation

Linear regression without correspondence Daniel Hsu Columbia University October 3, 2017 Joint work with Kevin Shi (Columbia University) and Xiaorui Sun (Microsoft Research). 1 Linear regression without correspondence Covariate vectors : x


  1. Alternating minimization w ∈ R d (e.g., randomly). Pick initial ˆ Loop until convergence: n ( ) 2 ∑ y i − ˆ w ⊤ x π ( i ) π ← arg min ˆ . π ∈ S n i = 1 n ( ) 2 ∑ y i − w ⊤ x ˆ w ← arg min ˆ . π ( i ) w ∈ R d i = 1 ▶ Each loop-iteration efficiently computable. 12

  2. Alternating minimization w ∈ R d (e.g., randomly). Pick initial ˆ Loop until convergence: n ( ) 2 ∑ y i − ˆ w ⊤ x π ( i ) π ← arg min ˆ . π ∈ S n i = 1 n ( ) 2 ∑ y i − w ⊤ x ˆ w ← arg min ˆ . π ( i ) w ∈ R d i = 1 ▶ Each loop-iteration efficiently computable. ▶ But can get stuck in local minima. So try many initial w ∈ R d . ˆ 12

  3. Alternating minimization w ∈ R d (e.g., randomly). Pick initial ˆ Loop until convergence: n ( ) 2 ∑ y i − ˆ w ⊤ x π ( i ) π ← arg min ˆ . π ∈ S n i = 1 n ( ) 2 ∑ y i − w ⊤ x ˆ w ← arg min ˆ . π ( i ) w ∈ R d i = 1 ▶ Each loop-iteration efficiently computable. ▶ But can get stuck in local minima. So try many initial w ∈ R d . ˆ ( Questions : How many restarts? How many iterations?) 12

  4. Approximation result Theorem There is an algorithm that given any inputs ( x i ) n i = 1 , ( y i ) n i = 1 , and ϵ ∈ ( 0 , 1 ) , returns a ( 1 + ϵ ) -approximate solution to the least squares problem in time ( n /ϵ ) O ( k ) + poly( n , d ) , where k = dim(span( x i ) n i = 1 ) . 13

  5. Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . 14

  6. Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . Solution is determined by action of π ⋆ on d points (assume dim(span( x i ) d i = 1 ) = d ) . 14

  7. Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . Solution is determined by action of π ⋆ on d points (assume dim(span( x i ) d i = 1 ) = d ) . Algorithm : ▶ Find subset of d linearly independent points x i 1 , x i 2 , . . . , x i d . ▶ “Guess” values of π − 1 ⋆ ( i j ) ∈ [ d ] , j ∈ [ d ] . ▶ Solve linear system y π − 1 ( i j ) = w ⊤ x i j , j ∈ [ d ] , for w ∈ R d . ⋆ ▶ To check correctness of ˆ y i := ˆ w ⊤ x i , i ∈ [ n ] , and w : compute ˆ y π ( i ) ) 2 = 0 . ∑ n check if min π ∈ S n i = 1 ( y i − ˆ 14

  8. Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . Solution is determined by action of π ⋆ on d points (assume dim(span( x i ) d i = 1 ) = d ) . Algorithm : ▶ Find subset of d linearly independent points x i 1 , x i 2 , . . . , x i d . ▶ “Guess” values of π − 1 ⋆ ( i j ) ∈ [ d ] , j ∈ [ d ] . ▶ Solve linear system y π − 1 ( i j ) = w ⊤ x i j , j ∈ [ d ] , for w ∈ R d . ⋆ ▶ To check correctness of ˆ y i := ˆ w ⊤ x i , i ∈ [ n ] , and w : compute ˆ y π ( i ) ) 2 = 0 . ∑ n check if min π ∈ S n i = 1 ( y i − ˆ “Guess” means “enumerate over ≤ n d choices”; rest is poly( n , d ) . 14

  9. Beating brute-force search: general case General case : solution may not be determined by only d points. 15

  10. Beating brute-force search: general case General case : solution may not be determined by only d points. But, for any RHS b ∈ R n , there exist x i 1 , x i 2 , . . . , x i d s.t. every j = 1 ( b i j − w ⊤ x i j ) 2 satisfies w ∈ arg min w ∈ R d ∑ d ˆ n n ∑ ) 2 ≤ ( d + 1 ) · min ∑ ) 2 . ( ( b i − ˆ w ⊤ x i b i − w ⊤ x i w ∈ R d i = 1 i = 1 (Follows from result of Dereziński and Warmuth (2017) on volume sampling.) 15

  11. Beating brute-force search: general case General case : solution may not be determined by only d points. But, for any RHS b ∈ R n , there exist x i 1 , x i 2 , . . . , x i d s.t. every j = 1 ( b i j − w ⊤ x i j ) 2 satisfies w ∈ arg min w ∈ R d ∑ d ˆ n n ∑ ) 2 ≤ ( d + 1 ) · min ∑ ) 2 . ( ( b i − ˆ w ⊤ x i b i − w ⊤ x i w ∈ R d i = 1 i = 1 (Follows from result of Dereziński and Warmuth (2017) on volume sampling.) ⇒ n O ( d ) -time algorithm with approximation ratio d + 1 , = or n O ( d /ϵ ) -time algorithm with approximation ratio 1 + ϵ . 15

  12. Beating brute-force search: general case General case : solution may not be determined by only d points. But, for any RHS b ∈ R n , there exist x i 1 , x i 2 , . . . , x i d s.t. every j = 1 ( b i j − w ⊤ x i j ) 2 satisfies w ∈ arg min w ∈ R d ∑ d ˆ n n ∑ ) 2 ≤ ( d + 1 ) · min ∑ ) 2 . ( ( b i − ˆ w ⊤ x i b i − w ⊤ x i w ∈ R d i = 1 i = 1 (Follows from result of Dereziński and Warmuth (2017) on volume sampling.) ⇒ n O ( d ) -time algorithm with approximation ratio d + 1 , = or n O ( d /ϵ ) -time algorithm with approximation ratio 1 + ϵ . Better way to get 1 + ϵ : exploit first-order optimality conditions (i.e., “normal equations”) and ϵ -nets. Overall time : ( n /ϵ ) O ( k ) + poly( n , d ) for k = dim(span( x i ) n i = 1 ) . 15

  13. Remarks ▶ Algorithm is justified in statistical setting by results of [PWC’16] for MLE, but guarantees also hold when inputs are worst-case. ▶ Algorithm is poly-time only when k = O ( 1 ) . 16

  14. Remarks ▶ Algorithm is justified in statistical setting by results of [PWC’16] for MLE, but guarantees also hold when inputs are worst-case. ▶ Algorithm is poly-time only when k = O ( 1 ) . Open problems : 1. Poly-time approximation algorithm when k = ω ( 1 ) . (Perhaps in average-case or smoothed setting.) 2. (Smoothed) analysis of alternating minimization. Similar to Lloyd’s algorithm for Euclidean k -means. 16

  15. Remarks ▶ Algorithm is justified in statistical setting by results of [PWC’16] for MLE, but guarantees also hold when inputs are worst-case. ▶ Algorithm is poly-time only when k = O ( 1 ) . Open problems : 1. Poly-time approximation algorithm when k = ω ( 1 ) . (Perhaps in average-case or smoothed setting.) 2. (Smoothed) analysis of alternating minimization. Similar to Lloyd’s algorithm for Euclidean k -means. Next : Algorithm for noise-free average-case setting. 16

  16. 2. Exact recovery in the noise-free Gaussian setting 17

  17. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } 18

  18. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . 18

  19. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . 18

  20. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . Number of measurements : If n + 1 ≥ d , then recovery of ¯ π gives exact recovery of ¯ w (a.s.). 18

  21. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . Number of measurements : If n + 1 ≥ d , then recovery of ¯ π gives exact recovery of ¯ w (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d ). 18

  22. Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . Number of measurements : If n + 1 ≥ d , then recovery of ¯ π gives exact recovery of ¯ w (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d ). Claim : n ≥ d suffices to recover ¯ π with high probability. 18

  23. Exact recovery result Theorem w ∈ R d and ¯ Fix any ¯ π ∈ S n , and assume n ≥ d . Suppose ( x i ) n i = 0 are drawn iid from N( 0 , I d ) , and ( y i ) n i = 0 satisfy w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . There is an algorithm that, given inputs ( x i ) n i = 0 and ( y i ) n i = 0 , returns ¯ π and ¯ w with high probability. 19

  24. Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . 20

  25. Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. 20

  26. Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. d ∑ y 0 = ¯ w ⊤ x 0 = ( ¯ w ⊤ x j ) ( x ⊤ j x 0 ) j = 1 20

  27. Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. d ∑ y 0 = ¯ w ⊤ x 0 = ( ¯ w ⊤ x j ) ( x ⊤ j x 0 ) j = 1 d ∑ π − 1 ( j ) ( x ⊤ = y ¯ j x 0 ) j = 1 20

  28. Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. d ∑ y 0 = ¯ w ⊤ x 0 = ( ¯ w ⊤ x j ) ( x ⊤ j x 0 ) j = 1 d ∑ π − 1 ( j ) ( x ⊤ = y ¯ j x 0 ) j = 1 d d ∑ ∑ π ( i ) = j } · y i ( x ⊤ = 1 { ¯ j x 0 ) . i = 1 j = 1 20

  29. Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� � i = 1 j = 1 c i , j 21

  30. Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� � i = 1 j = 1 c i , j ▶ d 2 “source” numbers c i , j := y i ( x ⊤ j x 0 ) , “target” sum T := y 0 . Promised that a size- d subset of the c i , j sum to T . 21

  31. Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� � i = 1 j = 1 c i , j ▶ d 2 “source” numbers c i , j := y i ( x ⊤ j x 0 ) , “target” sum T := y 0 . Promised that a size- d subset of the c i , j sum to T . ▶ Correct subset corresponds to ( i , j ) ∈ [ d ] 2 s.t. ¯ π ( i ) = j . 21

  32. Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� � i = 1 j = 1 c i , j ▶ d 2 “source” numbers c i , j := y i ( x ⊤ j x 0 ) , “target” sum T := y 0 . Promised that a size- d subset of the c i , j sum to T . ▶ Correct subset corresponds to ( i , j ) ∈ [ d ] 2 s.t. ¯ π ( i ) = j . Next : How to solve Subset Sum efficiently? 21

  33. Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . 22

  34. Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. 22

  35. Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction : construct lattice basis in R N + 1 such that ▶ correct subset of basis vectors gives short lattice vector v ⋆ ; ▶ any other lattice vector ̸∝ v ⋆ is more than 2 N / 2 -times longer. 22

  36. Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction : construct lattice basis in R N + 1 such that ▶ correct subset of basis vectors gives short lattice vector v ⋆ ; ▶ any other lattice vector ̸∝ v ⋆ is more than 2 N / 2 -times longer. [ ] I N [ ] 0 b 0 b 1 · · · b N = β T − β c 1 · · · − β c N for sufficiently large β > 0 . 22

  37. Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction : construct lattice basis in R N + 1 such that ▶ correct subset of basis vectors gives short lattice vector v ⋆ ; ▶ any other lattice vector ̸∝ v ⋆ is more than 2 N / 2 -times longer. [ ] I N [ ] 0 b 0 b 1 · · · b N = β T − β c 1 · · · − β c N for sufficiently large β > 0 . Using Lenstra, Lenstra, & Lovász (1982) algorithm to find approximately -shortest vector reveals correct subset. 22

  38. Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . 23

  39. Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . ▶ Instead, have some joint density derived from N( 0 , 1 ) . 23

  40. Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . ▶ Instead, have some joint density derived from N( 0 , 1 ) . ▶ To show that Lagarias & Odlyzko reduction still works, need Gaussian anti-concentration for quadratic and quartic forms. 23

  41. Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . ▶ Instead, have some joint density derived from N( 0 , 1 ) . ▶ To show that Lagarias & Odlyzko reduction still works, need Gaussian anti-concentration for quadratic and quartic forms. Key lemma : (w.h.p.) for every Z ∈ Z d × d that is not an integer multiple of permutation matrix corresponding to ¯ π , � � � � 1 ∑ � � 2 poly( d ) · ∥ ¯ T − Z i , j · c i , j ≥ w ∥ 2 . � � � � � i , j � 23

  42. Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. 24

  43. Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. ▶ Reduction uses real coefficients in lattice basis. For LLL to run in poly -time, need to round ( x i ) n i = 0 and ¯ w coefficients to finite-precision rational numbers. Similar to drawing ( x i ) n i = 0 iid from discretized N( 0 , I d ) . 24

  44. Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. ▶ Reduction uses real coefficients in lattice basis. For LLL to run in poly -time, need to round ( x i ) n i = 0 and ¯ w coefficients to finite-precision rational numbers. Similar to drawing ( x i ) n i = 0 iid from discretized N( 0 , I d ) . ▶ Algorithm strongly exploits assumption of noise-free measurements; likely fails in presence of noise . 24

  45. Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. ▶ Reduction uses real coefficients in lattice basis. For LLL to run in poly -time, need to round ( x i ) n i = 0 and ¯ w coefficients to finite-precision rational numbers. Similar to drawing ( x i ) n i = 0 iid from discretized N( 0 , I d ) . ▶ Algorithm strongly exploits assumption of noise-free measurements; likely fails in presence of noise . ▶ Similar algorithm used by Andoni, H., Shi, & Sun (2017) for different problems (phase retrieval / correspondence retrieval). 24

  46. Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ 25

  47. Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) 25

  48. Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) ▶ Pananjady, Wainwright, & Courtade (2016) Noise-free setting : signal-to-noise conditions trivially satisfied whenever ¯ w ̸ = 0 . 25

  49. Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) ▶ Pananjady, Wainwright, & Courtade (2016) Noise-free setting : signal-to-noise conditions trivially satisfied whenever ¯ w ̸ = 0 . Noisy setting : recovering ¯ w may be easier than recovering ¯ π . 25

  50. Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) ▶ Pananjady, Wainwright, & Courtade (2016) Noise-free setting : signal-to-noise conditions trivially satisfied whenever ¯ w ̸ = 0 . Noisy setting : recovering ¯ w may be easier than recovering ¯ π . Next : Limits for recovering ¯ w . 25

  51. 3. Lower bounds on SNR for approximate recovery 26

  52. Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n 27

  53. Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n π ; observe ( x i ) n i = 1 and � y i � n Equivalent : ignore ¯ i = 1 (where � · � denotes unordered multi-set ) . 27

  54. Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n π ; observe ( x i ) n i = 1 and � y i � n Equivalent : ignore ¯ i = 1 (where � · � denotes unordered multi-set ) . We consider P = N( 0 , I d ) and P = Uniform([ − 1 , 1 ] d ) . 27

  55. Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n π ; observe ( x i ) n i = 1 and � y i � n Equivalent : ignore ¯ i = 1 (where � · � denotes unordered multi-set ) . We consider P = N( 0 , I d ) and P = Uniform([ − 1 , 1 ] d ) . Note : If correspondence between ( x i ) n i = 1 and � y i � n i = 1 is known , then just need SNR ≳ d / n to approximately recover ¯ w . 27

  56. Uniform case Theorem If ( x i ) n i = 1 are iid draws from Uniform([ − 1 , 1 ] d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and SNR ≤ ( 1 − 2 c ) 2 for some c ∈ ( 0 , 1 / 2 ) , then for any estimator ˆ w , there exists w ∈ R d such that ¯ [ ] ∥ ˆ w − ¯ w ∥ 2 ≥ c ∥ ¯ w ∥ 2 . E 28

  57. Uniform case Theorem If ( x i ) n i = 1 are iid draws from Uniform([ − 1 , 1 ] d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and SNR ≤ ( 1 − 2 c ) 2 for some c ∈ ( 0 , 1 / 2 ) , then for any estimator ˆ w , there exists w ∈ R d such that ¯ [ ] ∥ ˆ w − ¯ w ∥ 2 ≥ c ∥ ¯ w ∥ 2 . E Increasing sample size n does not help , unlike in the “known correspondence” setting (where SNR ≳ d / n suffices). 28

  58. Proof sketch We show that no estimator can confidently distinguish between w = e 1 and ¯ ¯ w = − e 1 , where e 1 = ( 1 , 0 , . . . , 0 ) ⊤ . 29

  59. Proof sketch We show that no estimator can confidently distinguish between w = e 1 and ¯ ¯ w = − e 1 , where e 1 = ( 1 , 0 , . . . , 0 ) ⊤ . Let P ¯ w be the data distribution with parameter ¯ w ∈ { e 1 , − e 1 } . Task : show P e 1 and P − e 1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max w ∥ ˆ w − ¯ w ∥ 2 ≥ 1 − ∥ P e 1 − P − e 1 ∥ tv . w ∈{ e 1 , − e 1 } E P ¯ ¯ 29

  60. Proof sketch We show that no estimator can confidently distinguish between w = e 1 and ¯ ¯ w = − e 1 , where e 1 = ( 1 , 0 , . . . , 0 ) ⊤ . Let P ¯ w be the data distribution with parameter ¯ w ∈ { e 1 , − e 1 } . Task : show P e 1 and P − e 1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max w ∥ ˆ w − ¯ w ∥ 2 ≥ 1 − ∥ P e 1 − P − e 1 ∥ tv . w ∈{ e 1 , − e 1 } E P ¯ ¯ Key idea : conditional means of � y i � n i = 1 given ( x i ) n i = 1 , under P e 1 and P − e 1 , are close as unordered multi-sets . 29

  61. Proof sketch (continued) Generative process for P ¯ w : 30

  62. Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 30

  63. Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 30

  64. Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 3. Set y i := u ( i ) + ε i for i ∈ [ n ] , where u ( 1 ) ≤ u ( 2 ) ≤ · · · ≤ u ( n ) . 30

  65. Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 3. Set y i := u ( i ) + ε i for i ∈ [ n ] , where u ( 1 ) ≤ u ( 2 ) ≤ · · · ≤ u ( n ) . Conditional distribution of y = ( y 1 , y 2 , . . . , y n ) given ( x i ) n i = 1 : y | ( x i ) n i = 1 ∼ N( u ↑ , σ 2 I n ) Under P e 1 : y | ( x i ) n i = 1 ∼ N( − u ↓ , σ 2 I n ) Under P − e 1 : where u ↑ = ( u ( 1 ) , u ( 2 ) , . . . , u ( n ) ) and u ↓ = ( u ( n ) , u ( n − 1 ) , . . . , u ( 1 ) ) . 30

  66. Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 3. Set y i := u ( i ) + ε i for i ∈ [ n ] , where u ( 1 ) ≤ u ( 2 ) ≤ · · · ≤ u ( n ) . Conditional distribution of y = ( y 1 , y 2 , . . . , y n ) given ( x i ) n i = 1 : y | ( x i ) n i = 1 ∼ N( u ↑ , σ 2 I n ) Under P e 1 : y | ( x i ) n i = 1 ∼ N( − u ↓ , σ 2 I n ) Under P − e 1 : where u ↑ = ( u ( 1 ) , u ( 2 ) , . . . , u ( n ) ) and u ↓ = ( u ( n ) , u ( n − 1 ) , . . . , u ( 1 ) ) . Data processing : Lose information by going from y to � y i � n i = 1 . 30

  67. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL 31

  68. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 2 = 2 σ 2 31

  69. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 31

  70. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 Some computations show that med ∥ u ↑ + u ↓ ∥ 2 2 ≤ 4 . 31

  71. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 Some computations show that med ∥ u ↑ + u ↓ ∥ 2 2 ≤ 4 . By conditioning + Pinsker’s inequality, √ SNR 1 2 + 1 · ∥ u ↑ + u ↓ ∥ 2 ∥ P e 1 − P − e 1 ∥ tv ≤ 2 med 2 4 31

  72. Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 Some computations show that med ∥ u ↑ + u ↓ ∥ 2 2 ≤ 4 . By conditioning + Pinsker’s inequality, √ SNR 1 2 + 1 · ∥ u ↑ + u ↓ ∥ 2 ∥ P e 1 − P − e 1 ∥ tv ≤ 2 med 2 4 √ 1 2 + 1 ≤ SNR . 2 31

  73. Gaussian case Theorem If ( x i ) n i = 1 are iid draws from N( 0 , I d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and d SNR ≤ C · log log( n ) for some absolute constant C > 0 , then for any estimator w ∈ R d such that w , there exists ¯ ˆ [ ] ≥ C ′ ∥ ¯ ∥ ˆ w − ¯ w ∥ 2 w ∥ 2 E for some other absolute constant C ′ > 0 . 32

  74. Gaussian case Theorem If ( x i ) n i = 1 are iid draws from N( 0 , I d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and d SNR ≤ C · log log( n ) for some absolute constant C > 0 , then for any estimator w ∈ R d such that w , there exists ¯ ˆ [ ] ≥ C ′ ∥ ¯ ∥ ˆ w − ¯ w ∥ 2 w ∥ 2 E for some other absolute constant C ′ > 0 . C.f. “known correspondence” setting, where SNR ≳ d / n suffices. 32

Recommend


More recommend