Homomorphic Sketches Shrinking Big Data without Sacrificing Structure Andrew McGregor University of Massachusetts
?=? Can test whether two n bit files are identical by comparing O(log n) bit fingerprints of each file.
? ≈ ? More generally, can construct sketches of files to estimate Hamming distance between the files. Many results such as distinct elements, entropy, frequency moments, quantiles, histograms, linear regression, clustering, shape approximation...
M Mv Mv = = Mv = v Basic Idea: Treat file as vector; use linear projections to reduce dimension while preserving properties. Extensive theory with connections to compressed sensing, metric embeddings; widely applicable since parallelizable and suitable for stream processing. Most existing work concerns numerical statistics of data such as frequency and feature vectors...
Is it possible to analyze richer combinatorial and group-theoretic structure via linear sketches? Can we make compression “homomorphic” and run algorithms on sketched data? BIG small Compress DATA data Algorithm Algorithm ANSWER
Suppose n files encode rows of an adjacency matrix, e.g., each file is a list of friends in a social network. Theorem: Can check graph connectivity with O(polylog n) bit fingerprints of each file.
“Ti e quick brow n “q uick brown fo x CYCLIC ROTATION f ox jumpe d jumped over tie over ti e lazy dog. ” lazy dog. Ti e ” FINGERPRINT OPERATION Hamming distance isn’t robust to misalignments. Theorem: Can check equality of files up to rotation with fingerprints of length D(n) polylog n. More generally, we have homomorphic fingerprints : given a fingerprint, can compute the fingerprint of rotation. * D(n) is the number of divisors of n.
I. Connectivity I. Connectivity II. Misalignment a) Connectivity via O(polylog n) bit Fingerprints b) Extension to Estimating Cuts and Eigenvalues Joint work with Kook Jin Ahn and Sudipto Guha
Sketches for Connectivity • Theorem: Can check graph connectivity w.h.p. using O(polylog n) bit fingerprint of each adjacency list. • Corollary: Can monitor connectivity in a dynamic graph stream where edges are both inserted and deleted. • Note: Previous stream work assumed no edge deletions. • e.g., [Feigenbaum, Kannan, McGregor, Suri, Zhang 2004, 2005], [McGregor 2005] • [Jowhari, Ghodsi 2005], [Zelke 2008], [Sarma, Gollapudi, Panigrahy 2008, 2009] • [Ahn, Guha 2009, 2011], [Konrad, Magniez, Mathieu 2012], [Goel, Kapralov, Khanna 2012]
This can’t be possible?! • Suppose there’s a bridge (u,v) in the graph, i.e., Alice and Bob have a friendship that is essential to global connectivity. • It seems that at least one of their fingerprints needs Ω (n) bits: ‣ One of their fingerprints must contain info about the bridge. ‣ Alice and Bob don’t know their friendship is special. ‣ Alice and Bob may each have Ω (n) friends.
How we do it... • Template: Exploit homomorphic properties of linear sketches and emulate a classical algorithm in sketch space . Sketch ANSWER Algorithm Algorithm Original Graph Sketch Space
Ingredient 1: Basic Algorithm Algorithm (Spanning Forest): 1. For each node: pick incident edge 2. For each connected comp: pick incident edge 3. Repeat until no edges between connected comp. Lemma: After O(log n) rounds selected edges include spanning forest.
Ingredient 2: Sketching Neighborhoods For node i, let a i be vector indexed by node pairs. Non- zero entries: a i [i,j]=1 if j>i and a i [i,j]=-1 if j<i. {1,2} {1,3} {1,4} {1,5} {2,3} {2,4} {2,5} {3,4} {3,5} {4,5} � 1 0 � 2 5 a 1 = 1 0 0 0 0 0 0 0 � − 1 0 � a 2 = 0 0 0 1 0 0 0 0 1 � 0 0 � a 1 + a 2 = 1 0 0 1 0 0 0 0 3 4 Lemma: For any subset of nodes S ⊂ V , X support ( a i ) = E ( S , V \ S ) i ∈ S Lemma: There exists random M: ℝ N → ℝ polylog N such that for any a ∈ ℝ N , can deduce some e ∈ support(a) from Ma. [Jowhari, Saglam, Tardos 2011]
Recipe: Sketch & Compute on Sketches Sketch for node j: Ma j Runs Algorithm in Sketch Space: Use Ma j to get incident edge on each node j For i=2 to log n: To get incident edge on component S ⊂ V use: X X X → e ∈ support( a j ) = ) = E ( S , V \ S ) M a j = M ( a j ) − j ∈ S j ∈ S j ∈ S Detail: Actually each player sends log n independent sketches M 1 a j , M 2 a j , ... and central player uses M i a j when emulating i th iteration of the algorithm.
Extension to Sparsification • Theorem: Can test k-connectivity using O(k polylog n) bit fingerprints of each adjacency list. • Theorem: Can (1+ ε )-approximate every graph cut using O( ε -2 polylog n) bit fingerprints of each adjacency list. • Theorem: Can construct a spectral sparsifier H using O( ε -2 n 2/3 polylog n) bit fingerprints of each adjacency list. • where L G and L H are the Laplacians of G and H.
k-Connectivity Basic Algorithm Algorithm: For i=1 to k: • Let F i be spanning forest of G(V ,E-F 1 -...-F i-1 ) Lemma: F 1 +...+F k contains either all the edges across a cut in G or ≥ k of them. Call such a graph, a k-skeleton. Sketch: Simultaneously construct k independent Emulation in Sketch Space connectivity sketches M 1 (G), M 2 (G), ..., M k (G). Run Algorithm in Sketch Space: Use M 1 (G) to find a spanning forest F 1 of G Use M 2 (G)-M 2 (F 1 )=M 2 (G-F 1 ) to find F 2 Use M 3 (G)-M 3 (F 1 )-M 3 (F 2 )=M 3 (G-F 1 -F 2 ) to find F 3 ...
(1+ ε )-Approx of All Cuts Theorem (Fung et al.) Sample edge e w/p p e and weight by 1/p e . If p e = ε -2 log 2 n/c e where c e is size of min e cut, then all cuts are preserved up to factor 1+ ε . Algorithm: Let G i be graph with edges sampled w/p 2 -i . Construct k-skeleton H i for each G i where k= 2 ε -2 log 2 n. Theorem: e is in some H i w/p at least p e Proof: Let C be edges in min u-v cut in G. i 1 2 3 4 ... -log p e ... log n P[e ∊ G i ] 1/2 1/ 4 1/8 1/16 ... p e ... 1/n E[|C ∩ G i |] c e /2 c e / 4 c e /8 c e /16 ... ε -2 log 2 n ... c e /n For i= -log p e , we have |C ∩ G i |<k by the Chernoff bound. Hence e ∊ H i iff e ∊ G i which happens w/p p e
II. Misalignment I. Connectivity II. Misalignment a) Testing Equality with Rotation b) Matching Lower Bound Joint work with Alexandr Andoni, Assaf Goldberger, Ely Porat
Fingerprints for Rotation “Ti e quick brow n “q uick brown fo x CYCLIC ROTATION f ox jumpe d jumped over tie over ti e lazy dog. ” lazy dog. Ti e ” • Theorem: There’s a D(n) polylog n bit fingerprint F that is: ‣ Useful: F(a) and F(b) determine if a, b ∈ ℤ n are rotations w.h.p. ‣ Homomorphic: From F(a) can construct F(any rotation of a) ‣ Linear: From F(a) and F(b) can compute F(a+b). • Theorem: Fingerprints with above properties need D(n) bits. • Extension: (t + D(n)) polylog n bit fingerprints F(a) and F(b) determine if a,b are within t substitutions of being rotations.
False Start: Fermat’ s Little Theorem Rabin-Karp: For some p and r, encode a=a 0 a 1 a 2 ...a n-1 as f ( r , a ) = a 0 + a 1 r + a 2 r 2 + ... a n − 1 r n − 1 mod p Fermat’ s Little Thm: If p=n+1 prime, r n =1 mod p and so, rf ( r , a 0 a 1 ... a n − 1 ) = a 0 r + a 1 r 2 + a 2 r 3 + ... + a n − 1 r n = a n − 1 + a 0 r + a 1 r 2 + ... + a n − 2 r n − 1 = f ( r , a n − 1 a 0 ... a n − 2 ) So, if b is k-shift of a then g ( r ) = r k f ( r , a ) − f ( r , b ) = 0 Schwartz-Zippel: If r is random and g non-zero: P [ g ( r ) = 0] ≤ ( n − 1) / p = 1 − O (1 / n ) Conclusion: No false negatives but likely false positives.
Beyond Schwartz-Zippel Evaluate g on roots of x n -1 but work in larger field x n -1 factorizes as D(n) irreducible polys over rationals: x 10 − 1 = Φ 1 ( x ) Φ 2 ( x ) Φ 5 ( x ) Φ 10 ( x ) = ( x − 1)(1 + x )(1 − x + x 2 − x 3 + x 4 )(1 + x + x 2 + x 3 + x 4 ) At least one ɸ i has no shared roots with g: If ɸ i shares one root, ɸ i divides g (Abel’ s Irred. Thm) Can’ t all divide g because g has degree ≤ n-1 Suffices to test g on an arbitrary root of each ɸ i Bad News: Can’ t guarantee g(r) has finite precision. Good News: Work modulo a random p. Can show ɸ i still doesn’ t share roots with g whp by analyzing resultant.
Lower Bound: Basic Idea Can recover D(n) bits about a from F(a) by summing the fingerprints of rotations To deduce from X F ( a 0 a 1 a 2 a 3 a 4 a 5 ) a i α = F ( a 0 a 1 a 2 a 3 a 4 a 5 ) + F ( a 1 a 2 a 3 a 4 a 5 a 0 ) + ... + F ( a 5 a 0 a 1 a 2 a 3 a 4 ) = F ( αααααα ) and compare for all g until matches. F ( gggggg ) To deduce β = a 1 + a 3 + a 5 F ( a 0 a 1 a 2 a 3 a 4 a 5 ) + F ( a 2 a 3 a 4 a 5 a 0 a 1 ) + F ( a 4 a 5 a 0 a 1 a 2 a 3 ) = F ( βγβγβγ ) and compare for all g, g’= α -g until matches. F ( gg 0 gg 0 gg 0 ) And so on for other divisors of n...
Thanks! • Homomorphic Sketches: Compress using sketches such that we can run algorithms on compressed data directly. Resulting algorithms are parallelizable + streamable . • Graphs: Dimensionality reduction for preserving structural properties. Enables dynamic graph streaming. • Fingerprinting with Misalignments: Tight bounds on size of fingerprint necessary for testing equality up to rotations.
Recommend
More recommend