Restriction Access, Population Recovery & Partial Identification Avi Wigderson IAS, Princeton Joint with Zeev Dvir Anup Rao Amir Yehudayoff
Restriction Access, A new model of “ Grey-box ” access
Systems, Models, Observations From Input-Output (I 1 ,O 1 ), (I 2 ,O 2 ), (I 3 ,O 3 ), ….? Typically more!
Black-box access Successes & Limits Learning: PAC, membership, statistical…queries Decision trees, DNFs? Cryptography: semantic, CPA, CCA, … security Cold boot, microwave,… attacks? Optimization: Membership, separation,… oracles Strongly polynomial algorithms? Pseudorandomness: Hardness vs. Randomness Derandomizing specific algorithms? Complexity: Σ 2 = NP NP What problems can we solve if P=NP?
The gray scale of access f: Σ n → Σ m D: “ device ” computing f (from a family of devices) D x 1, f(x 1 ) x 2, f(x 2 ) Black Box How to model? x 3, f(x 3 ) Many specific ideas. …. Ours: general, clean Gray Box D – natural starting point - natural intermediate pt D Clear Box
f(x) Restriction Access (RA) D | ρ f: Σ n → Σ m D: “ device ” computing f x 1 , x 2, *,* …. * Restriction: ρ = (x, L ), L ⊆ [n], x ∈ Σ n , L L live vars Observations: ( ρ , D| ρ ) D| ρ (simplified after fixing) computes f| ρ on L Black L = φ Gray Clear L = [n] (x,f(x)) ( ρ , D| ρ ) ( x,D)
Example: Decision Tree D ρ = (x,L) x 1 L = {3,4} x 3 0 1 x = (1010) x 2 x 3 0 x 4 D| ρ = 0 1 x 4 x 2 1 0 1 0 0 1
Modeling choices (RA-PAC) Restriction: ρ = (x,L), L ⊆ [n], x ∈ Σ n , unknown D Input x : friendly, adversarial, random Unknown distribution (as in PAC) Live vars L : friendly, adversarial, random µ - independent dist (as in random restrictions)
Positive - RA-PAC Results In contrast to PAC !!! Probably, Approximately Correct (PAC) learning of D, from restrictions with each variable remains alive with prob µ Thm 1 [DRWY] : A poly(s, µ ) alg for RA-PAC learning size-s decision trees , for every µ >0 (reconstruction from pairs of live variables) Thm 2 [DRWY] : A poly(s, µ ) alg for RA-PAC learning size-s DNFs , for every µ > .365… (reduction to “ Population Recovery Problem ” )
Population Recovery (learning a mixture of binomials)
Population Recovery Problem k species, n attributes, from Σ , Vectors v 1 , v 2 , … v k ∈ Σ n Distribution p 1 , p 2 , … p k µ , ε >0 n Red: Known p 1 1/2 0000 v 1 k p 2 1/3 0110 v 2 Blue: Unknown p 3 1/6 1100 v 3 Task: Recover all v i , p i (upto ε ) from samples
Population Recovery Problem k species, n attributes, from Σ , µ , ε >0 v 1 , v 2 , … v k ∈ Σ n p 1 1/2 0000 v 1 p 2 1/3 0110 v 2 p 1 , p 2 , … p k fraction in population p 3 1/6 1100 v 3 Task: Recover all v i , p i (upto ε ) from samples Samplers : (1) u ← v i with prob. p i 0110 µ -Lossy Sampler: (2) u(j) ← ? with prob. 1- µ ∀ j ∈ [n] ?1?0 µ -Noisy Sampler: (2) u(j) flipped w.p. 1/2- µ ∀ j ∈ [n] 1100
Loss – Paleontology 26 % True Data 11 % 13 % 30 % 20 %
Loss – Paleontology From samples Dig #1 Dig #2 Dig #3 Dig #4 …… each finding common to many species! How do they do it?
Noise – Privacy 2 % 0 1 1 0 1 0 0 1 % 1 1 0 0 0 1 1 True Data …… …… From samples Joe 0 0 0 0 0 1 1 Jane 0 0 0 0 1 1 1 … . Who flipped every correct answer with probability 49% Deniability? Recovery?
PRP - applications Recovering from loss & noise - Clustering / Learning / Data mining - Computational biology / Archeology / …… - Error correction - Database privacy - …… Numerous related papers & books
PRP - Results Facts : µ =0 obliterates all information. - No polytime algorithm for µ = o(1) Thm 3 [DRWY] A poly(k, n, ε ) algorithm, from lossy samples, for every µ > .365… Thm 4 [WY] : A poly(k log k , n, ε ) algorithm, from lossy and/or noisy samples, for every µ > 0 Kearns, Mansour, Ron, Rubinfeld, Schapire, Sellie exp(k) algorithm for this discrete version Moitra, Valiant exp(k) algorithm for Gaussian version (even when noise is unknown)
Proof of Thm 4 n Reconstruct v i , p i p 1 1/2 0000 v 1 p 2 1/3 0110 v 2 k p 3 1/6 1100 v 3 From samples , , ,…. ?1?0 0??0 1100 Lemma 1 : Can assume we know the v i ’ s ! Proof : Exposing one column at a time. n Lemma 2 : Easy in exp( n ) time ! Proof : Lossy - enough samples without “ ? ” Noisy – linear algebra on sample probabilities. Idea : Make n =O(log k ) [Dimension Reduction]
Partial IDs a new dimension-reduction technique
Dimension Reduction and small IDs IDs 1 2 3 4 5 6 7 8 n = 8 p 1 0 0 0 0 0 1 0 1 v 1 S 1 = {1,2} k = 9 p 2 0 1 1 0 1 0 1 0 v 2 S 2 = {8} p 3 0 1 0 0 1 0 1 1 v 3 S 3 = {1,5,6} p 4 1 1 1 0 1 0 1 1 v 4 p 5 1 1 0 0 0 1 1 1 v 5 p 6 1 1 0 0 1 0 0 1 v 6 p 7 0 1 0 0 0 1 1 1 v 7 u – random sample p 8 1 1 0 1 1 0 1 1 v 8 p 9 1 1 0 0 0 1 1 1 v 9 q i = Pr[u[S i ]=v i [S i ]] Lemma : Can approximate p i in exp(| S i |) time ! Does one always have small IDs?
Small IDs ? IDs 1 2 3 4 5 6 7 8 n = 8 p 1 1 0 0 0 0 0 0 0 v 1 S 1 = {1} k = 9 p 2 0 1 0 0 0 0 0 0 v 2 S 2 = {2} p 3 0 0 1 0 0 0 0 0 v 3 S 3 = {3} p 4 0 0 0 1 0 0 0 0 v 4 … p 5 0 0 0 0 1 0 0 0 v 5 p 6 0 0 0 0 0 1 0 0 v 6 p 7 0 0 0 0 0 0 1 0 v 7 p 8 0 0 0 0 0 0 0 1 v 8 S 8 = {8} p 9 0 0 0 0 0 0 0 0 v 9 S 9 = {1,2, … ,8} NO! However,…
Linear algebra & Partial IDs P IDs 1 2 3 4 5 6 7 8 n = 8 p 1 1 0 0 0 0 0 0 0 v 1 S 1 = {1} k = 9 p 2 0 1 0 0 0 0 0 0 v 2 S 2 = {2} p 3 0 0 1 0 0 0 0 0 v 3 S 3 = {3} p 4 0 0 0 1 0 0 0 0 v 4 … p 5 0 0 0 0 1 0 0 0 v 5 p 6 0 0 0 0 0 1 0 0 v 6 p 7 0 0 0 0 0 0 1 0 v 7 p 8 0 0 0 0 0 0 0 1 v 8 S 8 = {8} p 9 0 0 0 0 0 0 0 0 v 9 S 9 = ∅ However, we can compute p 9 = 1- p 1 - p 2 - … - p 8
Back substitution and Imposters PIDs 1 2 3 4 5 6 7 8 p 1 0 0 1 0 0 1 0 1 v 1 S 1 = {1,2} q 1 = p 2 0 1 1 0 1 0 1 0 v 2 S 2 = {8} q 2 = p 3 0 1 0 0 1 0 1 1 v 3 S 3 = {1,5,6} q 3 = p 4 1 1 1 0 1 0 1 1 v 4 S 4 = {3} q 4 - p 1 - p 2 = p 5 1 1 0 0 0 1 1 1 v 5 u – random sample p 6 1 1 0 0 1 0 0 1 v 6 q i = any p 7 0 1 0 0 0 1 1 1 v 7 subset Pr[u[S i ]=v i [S i ]] p 8 1 1 0 1 1 0 1 1 v 8 p 9 1 1 0 0 0 1 1 1 v 9 Can use back substitution if no cycles ! Are there always acyclic small partial IDs?
Acyclic small partial IDs exist PIDs 1 2 3 4 5 6 7 8 n = 8 p 1 0 0 0 0 0 0 0 1 v 1 k = 9 p 2 0 1 1 0 1 0 1 0 v 2 p 3 1 1 0 0 1 0 1 1 v 3 p 4 1 1 1 0 1 0 1 1 v 4 p 5 1 1 0 0 0 1 1 1 v 5 p 6 1 1 0 0 1 0 0 1 v 6 p 7 1 1 1 1 1 0 1 1 v 7 p 8 0 1 0 0 0 1 1 1 v 8 S 8 = {1,5,6} p 9 0 1 0 0 1 1 1 1 v 9 Lemma : There is always an ID of length log k Idea : Remove and iterate to find more PIDs Lemma : Acyclic (log k)-PIDs always exists!
Chains of small Partial IDs PIDs 1 2 3 4 5 6 7 8 n = 8 p 1 1 1 1 1 1 1 1 1 v 1 S 1 = {1} k = 6 p 2 0 1 1 1 1 1 1 1 v 2 S 2 = {2} p 3 0 0 1 1 1 1 1 1 v 3 S 3 = {3} p 4 0 0 0 1 1 1 1 1 v 4 … p 5 0 0 0 0 1 1 1 1 v 5 p 6 0 0 0 0 0 1 1 1 v 6 S 6 = {6} Compute: q i = Pr[u i = 1] = Σ j ≤ i p i from sample u Back substitution : p i = q i - Σ j<i p j Problem : Long chains! Error doubles each step, so is exponential in the chain length. Want : Short chains!
The PID (imposter) graph Given: V=(v 1 , v 2 , … v k ) ∈ Σ n S=(S 1, S 2 , … ,S k ) ⊆ [ n ] n Construct G (V;S) by connecting v j à à v i iff v i is an imposter of v j : v i [ S j ] = v j [ S j ] PIDs 1 2 3 4 5 6 7 8 1 1 1 1 1 1 1 1 v 1 S 1 = {1} 0 1 1 1 1 1 1 1 v 2 S 2 = {2} 0 0 1 1 1 1 1 1 v 3 S 3 = {3} v i à à v j 0 0 0 1 1 1 1 1 v 4 … S 5 = {5} iff i > j 0 0 0 0 1 1 1 1 v 5 width = max i |S i | depth = depth ( G ) Want : PIDs w/small width and depth for all V
Constructing cheap PID graphs Theorem: For every V=(v 1 , v 2 , … v k ), v i ∈ Σ n we can efficiently find PIDs S=(S 1 ,S 2 , … ,S k ), S i ⊆ [ n ] of width and depth at most log k Algorithm: Initialize S i = ∅ for all i Invariant : |imposters( v i ;S i )| ≤ k /2 | Si | 1 2 3 4 Repeat : (1) Make S i maximal 0 0 1 0 v 1 if not, add minority coordinates to S i 0 0 0 0 v 2 (2) Make chains monotone: 0 0 0 1 v 3 v j à à v i then | S j |<| S i | (so G acyclic) 1 0 0 1 v 4 if not, set S i to S j ( and apply (1) to S i ) 1 1 1 0 v 5 1 0 1 0 v 6
Analysis of the algorithm Theorem: For every V=(v 1 , v 2 , … v k ) ∈ Σ n we can efficiently find PIDs S=(S 1 ,S 2 , … ,S k ) ⊆ [ n ] n of width and depth at most log k Algorithm: Initialize S i = ∅ for all i Invariant : |imposters( v i ;S i )| ≤ k /2 | Si | Repeat : (1) Make S i maximal (2) Make chains monotone ( v j à à v i then | S j |<| S i |) Analysis: - | S i | ≤ log k throughout for all i - ∑ i | S i | increases each step - Termination in k log k steps. - width ≤ log k and so depth ≤ log k
Conclusions - Restriction access: a new, general model of “ gray box ” access (largely unexplored!) - A general problem of population recovery - Efficient reconstruction from loss & noise - Partial IDs, a new dimension reduction technique for databases. Open : polynomial time algorithm in k ? (currently k log k , PIDs can’t beat k loglog k ) Open : Handle unknown errors ?
Recommend
More recommend