Resilience: A Criterion for Learning in the Presence of Arbitrary Outliers Jacob Steinhardt, Moses Charikar, Gregory Valiant ITCS 2018 January 14, 2018
Motivation: Robust Learning Question What concepts can be learned robustly , even if some data is arbitrarily corrupted? 1
Example: Mean Estimation Problem Given data x 1 , . . . , x n ∈ R d , of which (1 − ǫ ) n come from p ∗ (and remaining ǫn are arbitrary outliers), estimate mean µ of p ∗ . 2
Example: Mean Estimation Problem Given data x 1 , . . . , x n ∈ R d , of which (1 − ǫ ) n come from p ∗ (and remaining ǫn are arbitrary outliers), estimate mean µ of p ∗ . 2
Example: Mean Estimation Problem Given data x 1 , . . . , x n ∈ R d , of which (1 − ǫ ) n come from p ∗ (and remaining ǫn are arbitrary outliers), estimate mean µ of p ∗ . 2
Example: Mean Estimation Problem Given data x 1 , . . . , x n ∈ R d , of which (1 − ǫ ) n come from p ∗ (and remaining ǫn are arbitrary outliers), estimate mean µ of p ∗ . 2
Example: Mean Estimation Problem Given data x 1 , . . . , x n ∈ R d , of which (1 − ǫ ) n come from p ∗ (and remaining ǫn are arbitrary outliers), estimate mean µ of p ∗ . Issue: high dimensions 2
Mean Estimation: Gaussian Example Suppose clean data is Gaussian: x i ∼ N ( µ, I ) � �� � Gaussian mean µ variance 1 each coord. 3
Mean Estimation: Gaussian Example Suppose clean data is Gaussian: x i ∼ N ( µ, I ) � �� � Gaussian mean µ variance 1 each coord. √ √ √ 1 2 + · · · + 1 2 = � x i − µ � 2 ≈ d d 3
Mean Estimation: Gaussian Example Suppose clean data is Gaussian: x i ∼ N ( µ, I ) � �� � Gaussian mean µ variance 1 each coord. √ √ √ 1 2 + · · · + 1 2 = � x i − µ � 2 ≈ d d √ ǫ d 3
Mean Estimation: Gaussian Example Suppose clean data is Gaussian: x i ∼ N ( µ, I ) � �� � Gaussian mean µ variance 1 each coord. √ √ √ 1 2 + · · · + 1 2 = � x i − µ � 2 ≈ d d √ ǫ d Cannot filter independently even if know true density! 3
History Progress in high dimensions only recently: • Tukey median [1975]: robust but NP-hard • Donoho estimator [1982]: high error • [DKKLMS16, LRV16]: first dimension-independent error bounds 4
History Progress in high dimensions only recently: • Tukey median [1975]: robust but NP-hard • Donoho estimator [1982]: high error • [DKKLMS16, LRV16]: first dimension-independent error bounds • large body of work since then [CSV17, DKKLMS17, L17, DBS17] • many other problems including PCA [XCM10], regression [NTN11], classification [FHKP09], etc. 4
This Talk Question What general and simple properties enable robust estimation? 5
This Talk Question What general and simple properties enable robust estimation? New information-theoretic criterion: resilience . 5
Resilience Suppose { x i } i ∈ S is a set of points in R d . Definition (Resilience) A set S is ( σ, ǫ ) -resilient in a norm � · � around a point µ if for all subsets T ⊆ S of size at least (1 − ǫ ) | S | , � 1 � � � ( x i − µ ) � ≤ σ. � � | T | i ∈ T Intuition: all large subsets have similar mean. 6
Main Result Let S ⊆ R d be a set of of (1 − ǫ ) n “good” points. Let S out be a set of ǫn arbitrary outliers. We observe ˜ S = S ∪ S out . Theorem ǫ If S is ( σ, 1 − ǫ ) -resilient around µ , then it is possible to output µ such that � ˆ ˆ µ − µ � ≤ 2 σ . In fact, outputting the center of any resilient subset of ˜ S will work! 7
Pigeonhole Argument Claim: If S and S ′ are ( σ, 1 − ǫ ) -resilient around µ and µ ′ and have size ǫ (1 − ǫ ) n , then � µ − µ ′ � ≤ 2 σ . 8
Pigeonhole Argument Claim: If S and S ′ are ( σ, 1 − ǫ ) -resilient around µ and µ ′ and have size ǫ (1 − ǫ ) n , then � µ − µ ′ � ≤ 2 σ . Proof: S ˜ S 8
Pigeonhole Argument Claim: If S and S ′ are ( σ, 1 − ǫ ) -resilient around µ and µ ′ and have size ǫ (1 − ǫ ) n , then � µ − µ ′ � ≤ 2 σ . Proof: S ˜ S S ′ 8
Pigeonhole Argument Claim: If S and S ′ are ( σ, 1 − ǫ ) -resilient around µ and µ ′ and have size ǫ (1 − ǫ ) n , then � µ − µ ′ � ≤ 2 σ . Proof: S ˜ S S ∩ S ′ S ′ • Let µ S ∩ S ′ be the mean of S ∩ S ′ . ǫ • By Pigeonhole, | S ∩ S ′ | ≥ 1 − ǫ | S ′ | . 8
Pigeonhole Argument Claim: If S and S ′ are ( σ, 1 − ǫ ) -resilient around µ and µ ′ and have size ǫ (1 − ǫ ) n , then � µ − µ ′ � ≤ 2 σ . Proof: S ˜ S S ∩ S ′ S ′ • Let µ S ∩ S ′ be the mean of S ∩ S ′ . ǫ • By Pigeonhole, | S ∩ S ′ | ≥ 1 − ǫ | S ′ | . • Then � µ ′ − µ S ∩ S ′ � ≤ σ by resilience. • Similarly, � µ − µ S ∩ S ′ � ≤ σ . • Result follows by triangle inequality. 8
Implication: Mean Estimation Lemma If a dataset has bounded covariance, it is ( ǫ, O ( √ ǫ )) -resilient (in the ℓ 2 -norm). 9
Implication: Mean Estimation Lemma If a dataset has bounded covariance, it is ( ǫ, O ( √ ǫ )) -resilient (in the ℓ 2 -norm). Proof: If ǫn points ≫ 1 / √ ǫ from mean, would make variance ≫ 1 . Therefore, deleting ǫn points changes mean by at most ≈ ǫ · 1 / √ ǫ = √ ǫ . 9
Implication: Mean Estimation Lemma If a dataset has bounded covariance, it is ( ǫ, O ( √ ǫ )) -resilient (in the ℓ 2 -norm). Proof: If ǫn points ≫ 1 / √ ǫ from mean, would make variance ≫ 1 . Therefore, deleting ǫn points changes mean by at most ≈ ǫ · 1 / √ ǫ = √ ǫ . Corollary If the clean data has bounded covariance, its mean can be estimated to ℓ 2 -error O ( √ ǫ ) in the presence of ǫn outliers. 9
Implication: Mean Estimation Lemma If a dataset has bounded covariance, it is ( ǫ, O ( √ ǫ )) -resilient (in the ℓ 2 -norm). Proof: If ǫn points ≫ 1 / √ ǫ from mean, would make variance ≫ 1 . Therefore, deleting ǫn points changes mean by at most ≈ ǫ · 1 / √ ǫ = √ ǫ . Corollary If the clean data has bounded k th moments, its mean can be estimated to ℓ 2 -error O ( ǫ 1 − 1 /k ) in the presence of ǫn outliers. 9
Implication: Learning Discrete Distributions Suppose we observe samples from a distribution π on { 1 , . . . , m } . Samples come in r -tuples, which are either all good or all outliers. 10
Implication: Learning Discrete Distributions Suppose we observe samples from a distribution π on { 1 , . . . , m } . Samples come in r -tuples, which are either all good or all outliers. Corollary The distribution π can be estimated (in TV distance) to error � O ( ǫ log(1 /ǫ ) /r ) in the presence of ǫn outliers. 10
Implication: Learning Discrete Distributions Suppose we observe samples from a distribution π on { 1 , . . . , m } . Samples come in r -tuples, which are either all good or all outliers. Corollary The distribution π can be estimated (in TV distance) to error � O ( ǫ log(1 /ǫ ) /r ) in the presence of ǫn outliers. • follows from resilience in ℓ 1 -norm • see also [Qiao & Valiant, 2018] later in this session! 10
A Majority of Outliers Can also handle the case where clean set has size only αn ( α < 1 2 ): ˜ S S 11
A Majority of Outliers Can also handle the case where clean set has size only αn ( α < 1 2 ): ˜ S S • cover ˜ S by resilient sets 11
A Majority of Outliers Can also handle the case where clean set has size only αn ( α < 1 2 ): S ′ ˜ S S S ∩ S ′ • cover ˜ S by resilient sets • at least one set S ′ must have high overlap with S ... 11
A Majority of Outliers Can also handle the case where clean set has size only αn ( α < 1 2 ): S ′ ˜ S S S ∩ S ′ • cover ˜ S by resilient sets • at least one set S ′ must have high overlap with S ... • ...and hence � µ ′ − µ � ≤ 2 σ as before. 11
A Majority of Outliers Can also handle the case where clean set has size only αn ( α < 1 2 ): S ′ ˜ S S S ∩ S ′ • cover ˜ S by resilient sets • at least one set S ′ must have high overlap with S ... • ...and hence � µ ′ − µ � ≤ 2 σ as before. • Recovery in list-decodable model [BBV08]. 11
Implication: Stochastic Block Models Set of αn good and (1 − α ) n bad vertices. 12
Implication: Stochastic Block Models Set of αn good and (1 − α ) n bad vertices. • good ↔ good: dense (avg. deg. = a ) • good ↔ bad: sparse (avg. deg. = b ) 12
Implication: Stochastic Block Models Set of αn good and (1 − α ) n bad vertices. • good ↔ good: dense (avg. deg. = a ) • good ↔ bad: sparse (avg. deg. = b ) • bad ↔ bad: arbitrary 12
Implication: Stochastic Block Models Set of αn good and (1 − α ) n bad vertices. • good ↔ good: dense (avg. deg. = a ) • good ↔ bad: sparse (avg. deg. = b ) • bad ↔ bad: arbitrary Question: when can good set be recovered (in terms of α, a, b )? 12
Implication: Stochastic Block Models Using resilience in “truncated ℓ 1 -norm”, can show: Corollary The set of good vertices can be approximately recovered when- ever ( a − b ) 2 ≫ log(2 /α ) . a α 2 13
Recommend
More recommend