Linear regression without correspondence Daniel Hsu Columbia - PowerPoint PPT Presentation

Alternating minimization w ∈ R d (e.g., randomly). Pick initial ˆ Loop until convergence: n ( ) 2 ∑ y i − ˆ w ⊤ x π ( i ) π ← arg min ˆ . π ∈ S n i = 1 n ( ) 2 ∑ y i − w ⊤ x ˆ w ← arg min ˆ . π ( i ) w ∈ R d i = 1 ▶ Each loop-iteration efficiently computable. 12

Alternating minimization w ∈ R d (e.g., randomly). Pick initial ˆ Loop until convergence: n ( ) 2 ∑ y i − ˆ w ⊤ x π ( i ) π ← arg min ˆ . π ∈ S n i = 1 n ( ) 2 ∑ y i − w ⊤ x ˆ w ← arg min ˆ . π ( i ) w ∈ R d i = 1 ▶ Each loop-iteration efficiently computable. ▶ But can get stuck in local minima. So try many initial w ∈ R d . ˆ 12

Alternating minimization w ∈ R d (e.g., randomly). Pick initial ˆ Loop until convergence: n ( ) 2 ∑ y i − ˆ w ⊤ x π ( i ) π ← arg min ˆ . π ∈ S n i = 1 n ( ) 2 ∑ y i − w ⊤ x ˆ w ← arg min ˆ . π ( i ) w ∈ R d i = 1 ▶ Each loop-iteration efficiently computable. ▶ But can get stuck in local minima. So try many initial w ∈ R d . ˆ ( Questions : How many restarts? How many iterations?) 12

Approximation result Theorem There is an algorithm that given any inputs ( x i ) n i = 1 , ( y i ) n i = 1 , and ϵ ∈ ( 0 , 1 ) , returns a ( 1 + ϵ ) -approximate solution to the least squares problem in time ( n /ϵ ) O ( k ) + poly( n , d ) , where k = dim(span( x i ) n i = 1 ) . 13

Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . 14

Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . Solution is determined by action of π ⋆ on d points (assume dim(span( x i ) d i = 1 ) = d ) . 14

Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . Solution is determined by action of π ⋆ on d points (assume dim(span( x i ) d i = 1 ) = d ) . Algorithm : ▶ Find subset of d linearly independent points x i 1 , x i 2 , . . . , x i d . ▶ “Guess” values of π − 1 ⋆ ( i j ) ∈ [ d ] , j ∈ [ d ] . ▶ Solve linear system y π − 1 ( i j ) = w ⊤ x i j , j ∈ [ d ] , for w ∈ R d . ⋆ ▶ To check correctness of ˆ y i := ˆ w ⊤ x i , i ∈ [ n ] , and w : compute ˆ y π ( i ) ) 2 = 0 . ∑ n check if min π ∈ S n i = 1 ( y i − ˆ 14

Beating brute-force search: “realizable” case “Realizable” case : Suppose there exist w ⋆ ∈ R d and π ⋆ ∈ S n s.t. y i = w ⊤ ⋆ x π ⋆ ( i ) , i ∈ [ n ] . Solution is determined by action of π ⋆ on d points (assume dim(span( x i ) d i = 1 ) = d ) . Algorithm : ▶ Find subset of d linearly independent points x i 1 , x i 2 , . . . , x i d . ▶ “Guess” values of π − 1 ⋆ ( i j ) ∈ [ d ] , j ∈ [ d ] . ▶ Solve linear system y π − 1 ( i j ) = w ⊤ x i j , j ∈ [ d ] , for w ∈ R d . ⋆ ▶ To check correctness of ˆ y i := ˆ w ⊤ x i , i ∈ [ n ] , and w : compute ˆ y π ( i ) ) 2 = 0 . ∑ n check if min π ∈ S n i = 1 ( y i − ˆ “Guess” means “enumerate over ≤ n d choices”; rest is poly( n , d ) . 14

Beating brute-force search: general case General case : solution may not be determined by only d points. 15

Beating brute-force search: general case General case : solution may not be determined by only d points. But, for any RHS b ∈ R n , there exist x i 1 , x i 2 , . . . , x i d s.t. every j = 1 ( b i j − w ⊤ x i j ) 2 satisfies w ∈ arg min w ∈ R d ∑ d ˆ n n ∑ ) 2 ≤ ( d + 1 ) · min ∑ ) 2 . ( ( b i − ˆ w ⊤ x i b i − w ⊤ x i w ∈ R d i = 1 i = 1 (Follows from result of Dereziński and Warmuth (2017) on volume sampling.) 15

Beating brute-force search: general case General case : solution may not be determined by only d points. But, for any RHS b ∈ R n , there exist x i 1 , x i 2 , . . . , x i d s.t. every j = 1 ( b i j − w ⊤ x i j ) 2 satisfies w ∈ arg min w ∈ R d ∑ d ˆ n n ∑ ) 2 ≤ ( d + 1 ) · min ∑ ) 2 . ( ( b i − ˆ w ⊤ x i b i − w ⊤ x i w ∈ R d i = 1 i = 1 (Follows from result of Dereziński and Warmuth (2017) on volume sampling.) ⇒ n O ( d ) -time algorithm with approximation ratio d + 1 , = or n O ( d /ϵ ) -time algorithm with approximation ratio 1 + ϵ . 15

Beating brute-force search: general case General case : solution may not be determined by only d points. But, for any RHS b ∈ R n , there exist x i 1 , x i 2 , . . . , x i d s.t. every j = 1 ( b i j − w ⊤ x i j ) 2 satisfies w ∈ arg min w ∈ R d ∑ d ˆ n n ∑ ) 2 ≤ ( d + 1 ) · min ∑ ) 2 . ( ( b i − ˆ w ⊤ x i b i − w ⊤ x i w ∈ R d i = 1 i = 1 (Follows from result of Dereziński and Warmuth (2017) on volume sampling.) ⇒ n O ( d ) -time algorithm with approximation ratio d + 1 , = or n O ( d /ϵ ) -time algorithm with approximation ratio 1 + ϵ . Better way to get 1 + ϵ : exploit first-order optimality conditions (i.e., “normal equations”) and ϵ -nets. Overall time : ( n /ϵ ) O ( k ) + poly( n , d ) for k = dim(span( x i ) n i = 1 ) . 15

Remarks ▶ Algorithm is justified in statistical setting by results of [PWC’16] for MLE, but guarantees also hold when inputs are worst-case. ▶ Algorithm is poly-time only when k = O ( 1 ) . 16

Remarks ▶ Algorithm is justified in statistical setting by results of [PWC’16] for MLE, but guarantees also hold when inputs are worst-case. ▶ Algorithm is poly-time only when k = O ( 1 ) . Open problems : 1. Poly-time approximation algorithm when k = ω ( 1 ) . (Perhaps in average-case or smoothed setting.) 2. (Smoothed) analysis of alternating minimization. Similar to Lloyd’s algorithm for Euclidean k -means. 16

Remarks ▶ Algorithm is justified in statistical setting by results of [PWC’16] for MLE, but guarantees also hold when inputs are worst-case. ▶ Algorithm is poly-time only when k = O ( 1 ) . Open problems : 1. Poly-time approximation algorithm when k = ω ( 1 ) . (Perhaps in average-case or smoothed setting.) 2. (Smoothed) analysis of alternating minimization. Similar to Lloyd’s algorithm for Euclidean k -means. Next : Algorithm for noise-free average-case setting. 16

2. Exact recovery in the noise-free Gaussian setting 17

Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } 18

Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . 18

Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . 18

Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . Number of measurements : If n + 1 ≥ d , then recovery of ¯ π gives exact recovery of ¯ w (a.s.). 18

Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . Number of measurements : If n + 1 ≥ d , then recovery of ¯ π gives exact recovery of ¯ w (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d ). 18

Setting Noise-free Gaussian linear model (with n + 1 measurements) : w ⊤ x ¯ y i = ¯ π ( i ) , i ∈ { 0 , 1 , . . . , n } ▶ Covariate vectors: ( x i ) n i = 0 iid from N( 0 , I d ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S { 0 , 1 ,..., n } “Equivalent” problem : We’re promised that ¯ π ( 0 ) = 0 . So can just consider ¯ π as unknown permutation over { 1 , 2 , . . . , n } . Number of measurements : If n + 1 ≥ d , then recovery of ¯ π gives exact recovery of ¯ w (a.s.). We’ll assume n + 1 ≥ d + 1 (i.e., n ≥ d ). Claim : n ≥ d suffices to recover ¯ π with high probability. 18

Exact recovery result Theorem w ∈ R d and ¯ Fix any ¯ π ∈ S n , and assume n ≥ d . Suppose ( x i ) n i = 0 are drawn iid from N( 0 , I d ) , and ( y i ) n i = 0 satisfy w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . There is an algorithm that, given inputs ( x i ) n i = 0 and ( y i ) n i = 0 , returns ¯ π and ¯ w with high probability. 19

Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . 20

Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. 20

Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. d ∑ y 0 = ¯ w ⊤ x 0 = ( ¯ w ⊤ x j ) ( x ⊤ j x 0 ) j = 1 20

Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. d ∑ y 0 = ¯ w ⊤ x 0 = ( ¯ w ⊤ x j ) ( x ⊤ j x 0 ) j = 1 d ∑ π − 1 ( j ) ( x ⊤ = y ¯ j x 0 ) j = 1 20

Main idea: hidden subset Measurements: w ⊤ x 0 ; w ⊤ x ¯ y 0 = ¯ y i = ¯ π ( i ) , i ∈ [ n ] . For simplicity : assume n = d , and x 1 , x 2 , . . . , x d orthonormal. d ∑ y 0 = ¯ w ⊤ x 0 = ( ¯ w ⊤ x j ) ( x ⊤ j x 0 ) j = 1 d ∑ π − 1 ( j ) ( x ⊤ = y ¯ j x 0 ) j = 1 d d ∑ ∑ π ( i ) = j } · y i ( x ⊤ = 1 { ¯ j x 0 ) . i = 1 j = 1 20

Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� i = 1 j = 1 c i , j 21

Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� i = 1 j = 1 c i , j ▶ d 2 “source” numbers c i , j := y i ( x ⊤ j x 0 ) , “target” sum T := y 0 . Promised that a size- d subset of the c i , j sum to T . 21

Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� i = 1 j = 1 c i , j ▶ d 2 “source” numbers c i , j := y i ( x ⊤ j x 0 ) , “target” sum T := y 0 . Promised that a size- d subset of the c i , j sum to T . ▶ Correct subset corresponds to ( i , j ) ∈ [ d ] 2 s.t. ¯ π ( i ) = j . 21

Reduction to subset sum d d ∑ ∑ y 0 = 1 { ¯ π ( i ) = j } · y i ( x ⊤ j x 0 ) � �� i = 1 j = 1 c i , j ▶ d 2 “source” numbers c i , j := y i ( x ⊤ j x 0 ) , “target” sum T := y 0 . Promised that a size- d subset of the c i , j sum to T . ▶ Correct subset corresponds to ( i , j ) ∈ [ d ] 2 s.t. ¯ π ( i ) = j . Next : How to solve Subset Sum efficiently? 21

Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . 22

Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. 22

Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction : construct lattice basis in R N + 1 such that ▶ correct subset of basis vectors gives short lattice vector v ⋆ ; ▶ any other lattice vector ̸∝ v ⋆ is more than 2 N / 2 -times longer. 22

Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction : construct lattice basis in R N + 1 such that ▶ correct subset of basis vectors gives short lattice vector v ⋆ ; ▶ any other lattice vector ̸∝ v ⋆ is more than 2 N / 2 -times longer. [ ] I N [ ] 0 b 0 b 1 · · · b N = β T − β c 1 · · · − β c N for sufficiently large β > 0 . 22

Reducing subset sum to shortest vector problem Lagarias & Odlyzko (1983) : random instances of Subset Sum efficiently solvable when N source numbers chosen independently and u.a.r. from sufficiently wide interval of Z . Main idea : (w.h.p.) every incorrect subset will “miss” the target sum T by noticeable amount. Reduction : construct lattice basis in R N + 1 such that ▶ correct subset of basis vectors gives short lattice vector v ⋆ ; ▶ any other lattice vector ̸∝ v ⋆ is more than 2 N / 2 -times longer. [ ] I N [ ] 0 b 0 b 1 · · · b N = β T − β c 1 · · · − β c N for sufficiently large β > 0 . Using Lenstra, Lenstra, & Lovász (1982) algorithm to find approximately -shortest vector reveals correct subset. 22

Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . 23

Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . ▶ Instead, have some joint density derived from N( 0 , 1 ) . 23

Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . ▶ Instead, have some joint density derived from N( 0 , 1 ) . ▶ To show that Lagarias & Odlyzko reduction still works, need Gaussian anti-concentration for quadratic and quartic forms. 23

Our random subset sum instance Catch : Our source numbers c i , j = y i x ⊤ j x 0 are not independent , and not uniformly distributed on some wide interval of Z . ▶ Instead, have some joint density derived from N( 0 , 1 ) . ▶ To show that Lagarias & Odlyzko reduction still works, need Gaussian anti-concentration for quadratic and quartic forms. Key lemma : (w.h.p.) for every Z ∈ Z d × d that is not an integer multiple of permutation matrix corresponding to ¯ π , � � � � 1 ∑ � � 2 poly( d ) · ∥ ¯ T − Z i , j · c i , j ≥ w ∥ 2 . � � � � � i , j � 23

Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. 24

Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. ▶ Reduction uses real coefficients in lattice basis. For LLL to run in poly -time, need to round ( x i ) n i = 0 and ¯ w coefficients to finite-precision rational numbers. Similar to drawing ( x i ) n i = 0 iid from discretized N( 0 , I d ) . 24

Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. ▶ Reduction uses real coefficients in lattice basis. For LLL to run in poly -time, need to round ( x i ) n i = 0 and ¯ w coefficients to finite-precision rational numbers. Similar to drawing ( x i ) n i = 0 iid from discretized N( 0 , I d ) . ▶ Algorithm strongly exploits assumption of noise-free measurements; likely fails in presence of noise . 24

Some details ▶ In general, x 1 , x 2 , . . . , x n are not (exactly) orthonormal, but similar reduction works via Moore-Penrose pseudoinverse. ▶ Reduction uses real coefficients in lattice basis. For LLL to run in poly -time, need to round ( x i ) n i = 0 and ¯ w coefficients to finite-precision rational numbers. Similar to drawing ( x i ) n i = 0 iid from discretized N( 0 , I d ) . ▶ Algorithm strongly exploits assumption of noise-free measurements; likely fails in presence of noise . ▶ Similar algorithm used by Andoni, H., Shi, & Sun (2017) for different problems (phase retrieval / correspondence retrieval). 24

Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ 25

Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) 25

Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) ▶ Pananjady, Wainwright, & Courtade (2016) Noise-free setting : signal-to-noise conditions trivially satisfied whenever ¯ w ̸ = 0 . 25

Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) ▶ Pananjady, Wainwright, & Courtade (2016) Noise-free setting : signal-to-noise conditions trivially satisfied whenever ¯ w ̸ = 0 . Noisy setting : recovering ¯ w may be easier than recovering ¯ π . 25

Connections to prior works ▶ Unnikrishnan, Haghighatshoar, & Vetterli (2015) Recall : [UHV’15] show that n ≥ 2 d is necessary for w ∈ R d . measurements to uniquely determine every ¯ w ∈ R d , d + 1 measurements suffice to Our result : For fixed ¯ w ′ ∈ R d . recover ¯ w ; same covariate vectors may fail for other ¯ (C.f. “for all” vs. “for each” results in compressive sensing.) ▶ Pananjady, Wainwright, & Courtade (2016) Noise-free setting : signal-to-noise conditions trivially satisfied whenever ¯ w ̸ = 0 . Noisy setting : recovering ¯ w may be easier than recovering ¯ π . Next : Limits for recovering ¯ w . 25

3. Lower bounds on SNR for approximate recovery 26

Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n 27

Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n π ; observe ( x i ) n i = 1 and � y i � n Equivalent : ignore ¯ i = 1 (where � · � denotes unordered multi-set ) . 27

Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n π ; observe ( x i ) n i = 1 and � y i � n Equivalent : ignore ¯ i = 1 (where � · � denotes unordered multi-set ) . We consider P = N( 0 , I d ) and P = Uniform([ − 1 , 1 ] d ) . 27

Setting Linear model with Gaussian noise w ⊤ x ¯ y i = ¯ π ( i ) + ε i , i ∈ [ n ] ▶ Covariate vectors: ( x i ) n i = 1 iid from P ▶ Measurement errors: ( ε i ) iid from N( 0 , σ 2 ) ▶ Unknown linear function: ¯ w ∈ R d ▶ Unknown permutation: ¯ π ∈ S n π ; observe ( x i ) n i = 1 and � y i � n Equivalent : ignore ¯ i = 1 (where � · � denotes unordered multi-set ) . We consider P = N( 0 , I d ) and P = Uniform([ − 1 , 1 ] d ) . Note : If correspondence between ( x i ) n i = 1 and � y i � n i = 1 is known , then just need SNR ≳ d / n to approximately recover ¯ w . 27

Uniform case Theorem If ( x i ) n i = 1 are iid draws from Uniform([ − 1 , 1 ] d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and SNR ≤ ( 1 − 2 c ) 2 for some c ∈ ( 0 , 1 / 2 ) , then for any estimator ˆ w , there exists w ∈ R d such that ¯ [ ] ∥ ˆ w − ¯ w ∥ 2 ≥ c ∥ ¯ w ∥ 2 . E 28

Uniform case Theorem If ( x i ) n i = 1 are iid draws from Uniform([ − 1 , 1 ] d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and SNR ≤ ( 1 − 2 c ) 2 for some c ∈ ( 0 , 1 / 2 ) , then for any estimator ˆ w , there exists w ∈ R d such that ¯ [ ] ∥ ˆ w − ¯ w ∥ 2 ≥ c ∥ ¯ w ∥ 2 . E Increasing sample size n does not help , unlike in the “known correspondence” setting (where SNR ≳ d / n suffices). 28

Proof sketch We show that no estimator can confidently distinguish between w = e 1 and ¯ ¯ w = − e 1 , where e 1 = ( 1 , 0 , . . . , 0 ) ⊤ . 29

Proof sketch We show that no estimator can confidently distinguish between w = e 1 and ¯ ¯ w = − e 1 , where e 1 = ( 1 , 0 , . . . , 0 ) ⊤ . Let P ¯ w be the data distribution with parameter ¯ w ∈ { e 1 , − e 1 } . Task : show P e 1 and P − e 1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max w ∥ ˆ w − ¯ w ∥ 2 ≥ 1 − ∥ P e 1 − P − e 1 ∥ tv . w ∈{ e 1 , − e 1 } E P ¯ ¯ 29

Proof sketch We show that no estimator can confidently distinguish between w = e 1 and ¯ ¯ w = − e 1 , where e 1 = ( 1 , 0 , . . . , 0 ) ⊤ . Let P ¯ w be the data distribution with parameter ¯ w ∈ { e 1 , − e 1 } . Task : show P e 1 and P − e 1 are “close”, then appeal to Le Cam’s standard “two-point argument”: max w ∥ ˆ w − ¯ w ∥ 2 ≥ 1 − ∥ P e 1 − P − e 1 ∥ tv . w ∈{ e 1 , − e 1 } E P ¯ ¯ Key idea : conditional means of � y i � n i = 1 given ( x i ) n i = 1 , under P e 1 and P − e 1 , are close as unordered multi-sets . 29

Proof sketch (continued) Generative process for P ¯ w : 30

Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 30

Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 30

Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 3. Set y i := u ( i ) + ε i for i ∈ [ n ] , where u ( 1 ) ≤ u ( 2 ) ≤ · · · ≤ u ( n ) . 30

Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 3. Set y i := u ( i ) + ε i for i ∈ [ n ] , where u ( 1 ) ≤ u ( 2 ) ≤ · · · ≤ u ( n ) . Conditional distribution of y = ( y 1 , y 2 , . . . , y n ) given ( x i ) n i = 1 : y | ( x i ) n i = 1 ∼ N( u ↑ , σ 2 I n ) Under P e 1 : y | ( x i ) n i = 1 ∼ N( − u ↓ , σ 2 I n ) Under P − e 1 : where u ↑ = ( u ( 1 ) , u ( 2 ) , . . . , u ( n ) ) and u ↓ = ( u ( n ) , u ( n − 1 ) , . . . , u ( 1 ) ) . 30

Proof sketch (continued) Generative process for P ¯ w : iid iid 1. Draw ( x i ) n ∼ Uniform([ − 1 , 1 ] d ) , ( ε i ) n ∼ N( 0 , σ 2 ) . i = 1 i = 1 2. Set u i := ¯ w ⊤ x i for i ∈ [ n ] . 3. Set y i := u ( i ) + ε i for i ∈ [ n ] , where u ( 1 ) ≤ u ( 2 ) ≤ · · · ≤ u ( n ) . Conditional distribution of y = ( y 1 , y 2 , . . . , y n ) given ( x i ) n i = 1 : y | ( x i ) n i = 1 ∼ N( u ↑ , σ 2 I n ) Under P e 1 : y | ( x i ) n i = 1 ∼ N( − u ↓ , σ 2 I n ) Under P − e 1 : where u ↑ = ( u ( 1 ) , u ( 2 ) , . . . , u ( n ) ) and u ↓ = ( u ( n ) , u ( n − 1 ) , . . . , u ( 1 ) ) . Data processing : Lose information by going from y to � y i � n i = 1 . 30

Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL 31

Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 2 = 2 σ 2 31

Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 31

Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 Some computations show that med ∥ u ↑ + u ↓ ∥ 2 2 ≤ 4 . 31

Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 Some computations show that med ∥ u ↑ + u ↓ ∥ 2 2 ≤ 4 . By conditioning + Pinsker’s inequality, √ SNR 1 2 + 1 · ∥ u ↑ + u ↓ ∥ 2 ∥ P e 1 − P − e 1 ∥ tv ≤ 2 med 2 4 31

Proof sketch (continued) By data processing inequality, ( ) P e 1 ( · | ( x i ) n i = 1 ) , P − e 1 ( · | ( x i ) n KL i = 1 ) ( ) N( u ↑ , σ 2 I n ) , N( − u ↓ , σ 2 I n ) ≤ KL ∥ u ↑ − ( − u ↓ ) ∥ 2 = SNR · ∥ u ↑ + u ↓ ∥ 2 2 = 2 . 2 σ 2 2 Some computations show that med ∥ u ↑ + u ↓ ∥ 2 2 ≤ 4 . By conditioning + Pinsker’s inequality, √ SNR 1 2 + 1 · ∥ u ↑ + u ↓ ∥ 2 ∥ P e 1 − P − e 1 ∥ tv ≤ 2 med 2 4 √ 1 2 + 1 ≤ SNR . 2 31

Gaussian case Theorem If ( x i ) n i = 1 are iid draws from N( 0 , I d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and d SNR ≤ C · log log( n ) for some absolute constant C > 0 , then for any estimator w ∈ R d such that w , there exists ¯ ˆ [ ] ≥ C ′ ∥ ¯ ∥ ˆ w − ¯ w ∥ 2 w ∥ 2 E for some other absolute constant C ′ > 0 . 32

Gaussian case Theorem If ( x i ) n i = 1 are iid draws from N( 0 , I d ) , ( y i ) n i = 1 follow the linear model with N( 0 , σ 2 ) noise, and d SNR ≤ C · log log( n ) for some absolute constant C > 0 , then for any estimator w ∈ R d such that w , there exists ¯ ˆ [ ] ≥ C ′ ∥ ¯ ∥ ˆ w − ¯ w ∥ 2 w ∥ 2 E for some other absolute constant C ′ > 0 . C.f. “known correspondence” setting, where SNR ≳ d / n suffices. 32

Linear regression without correspondence Daniel Hsu Columbia - PowerPoint PPT Presentation

Linear regression without correspondence Daniel Hsu Columbia University October 3, 2017 Joint work with Kevin Shi (Columbia University) and Xiaorui Sun (Microsoft Research). 1 Linear regression without correspondence Covariate vectors : x

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Linear regression Petr Po s k P. Po s k c 2015 Artificial Intelligence 1

Linear Regression 1 / 10 The Linear Model So far weve dealt with classification, where the

1D Regression i.i.d. with mean 0. Univariate Linear

Learning From Data Lecture 8 Linear Classification and Regression Linear Classification Linear

Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade

Classification or Regression? Regression Classification: want to learn a discrete target

https://bit.ly/2Lwa3g4 S AMPLE P HOTOGRAPHY A GREEMENTS Work Made for Hire Agreement: you own

*Sum of new residential listings added to the MLS within the same week Comparison of the Number of

Linear regression without correspondence Daniel Hsu Columbia - PowerPoint PPT Presentation

Linear regression without correspondence Daniel Hsu Columbia University October 3, 2017 Joint work with Kevin Shi (Columbia University) and Xiaorui Sun (Microsoft Research). 1 Linear regression without correspondence Covariate vectors : x

Regression 1: Linear Regression Marco Baroni Practical Statistics in R Outline Classic linear

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Regression Methods 1. Linear Regression and Logistic Regression: definitions, and a common

Linear regression How to measure the accuracy of linear regression models Linear Regression

Linear Models for Regression Greg Mori - CMPT 419/726 Bishop PRML Ch. 3 Regression Linear Basis

STAT 213 Simple Linear Regression I Colin Reimer Dawson Oberlin College 5 October 2016 Outline

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Logistic regression CS 446 1. Linear classifiers Linear regression Last two lectures, we studied

LINEAR REGRESSION LINEAR REGRESSION - FROM A MACHINE LEARNING POINT OF VIEW 25 SIMPLE LINEAR

Notes on the Non-linear Regression The model Non-linear regression models, like ordinary linear

CS70: Lecture 35. Regression (contd.): Linear and Beyond CS70: Lecture 35. Regression (contd.):

Chapter 7 Linear Regression 04/05/2016 Huamei Dong 1. Review Least square regression line 2.

Technical conditions for linear regression Jo Hardin Professor, Pomona College DataCamp

CS 7616 Pattern Recognition Linear, Linear, Linear Aaron Bobick School of Interactive

Lecture 8: Regression Trees Instructor: Saravanan Thirumuruganathan CSE 5334 Saravanan

Linear regression Petr Po s k P. Po s k c 2015 Artificial Intelligence 1

Linear Regression 1 / 10 The Linear Model So far weve dealt with classification, where the

1D Regression i.i.d. with mean 0. Univariate Linear

Learning From Data Lecture 8 Linear Classification and Regression Linear Classification Linear

Machine Learning - MT 2016 4 &amp; 5. Basis Expansion, Regularization, Validation Varun Kanade

Classification or Regression? Regression Classification: want to learn a discrete target

https://bit.ly/2Lwa3g4 S AMPLE P HOTOGRAPHY A GREEMENTS Work Made for Hire Agreement: you own

*Sum of new residential listings added to the MLS within the same week Comparison of the Number of

Machine Learning - MT 2016 4 & 5. Basis Expansion, Regularization, Validation Varun Kanade