Communication Complexity in the Field: New Questions from Practice Qin Zhang Indiana University Bloomington BIRS Workshop March 20, 2017 1-1
This talk Not on a particular problem Try to present a few new questions that I have encountered when trying to apply comm. complexity in various settings 2-1
Agenda I will talk about 1. Number-in-hand CC with input sharing – Distributed computation of graph problems 2. Primitive problems overlap; direct-sum does not apply – Distributed joins 3. Higher LB in simultaneous comm. than one-way comm.? – Sketching edit distance 3-1
Distributed graph computation Real world systems: Pregel, Giraph, GPS, GraphLab, etc. 4-1
The coordinator model The coordinator model : We have k machines (sites) and one central server (coordinator). – Each site has a 2-way comm. channel with the coordinator. – Each site has a piece of data x i . – Task : compute f ( x 1 , . . . , x k ) together via comm., for some f . Coordinator outputs the answer. – Goal : minimize total communication C · · · S k S 1 S 3 S 2 5-1
Distributed graph computation Let’s think about the graph connectivity problem: k sites each holds a portion of a graph. Goal: compute whether the graph is connected. 6-1
Distributed graph computation Let’s think about the graph connectivity problem: k sites each holds a portion of a graph. Goal: compute whether the graph is connected. 6-2
Distributed graph computation Let’s think about the graph connectivity problem: k sites each holds a portion of a graph. Goal: compute whether the graph is connected. C · · · S k S 1 S 3 S 2 6-3
Distributed graph computation Let’s think about the graph connectivity problem: k sites each holds a portion of a graph. Goal: compute whether the graph is connected. C A trivial solution: each S i sends a local spanning forest to C . Cost · · · O ( kn log n ) bits. S k S 1 S 3 S 2 n : # nodes of the graph 6-4
Distributed graph computation Let’s think about the graph connectivity problem: k sites each holds a portion of a graph. Goal: compute whether the graph is connected. C A trivial solution: each S i sends a local spanning forest to C . Cost · · · O ( kn log n ) bits. S k S 1 S 3 S 2 n : # nodes of the graph Can we do better, e.g., o ( kn ) bits of comm. in total? 6-5
Distributed graph computation Let’s think about the graph connectivity problem: k sites each holds a portion of a graph. Goal: compute whether the graph is connected. C A trivial solution: each S i sends a local spanning forest to C . Cost · · · O ( kn log n ) bits. S k S 1 S 3 S 2 n : # nodes of the graph Can we do better, e.g., o ( kn ) bits of comm. in total? If graph is edge partitioned among k sites, Ω( kn ) [Woodruff, Z. ’13] 6-6
LB graph for edge partition LB graph for edge partition: For each i ∈ [ k ], ( X i , Y ) ∼ µ which is a hard input distribution for set-disjointness. Each site S i holding X i = { X i , 1 , . . . , X i , n } creates an edge ( u i , v j ) for each X i , j = 1. The coordinator holding Y = { Y 1 , . . . , Y n } creates a path containing { v j | Y j = 1 } and a path containing { v j | Y j = 0 } . v j | Y j = 0 v j | Y j = 1 v j | Y | +1 v j | Y | +2 v j | Y | +3 v j n v j 1 v j 2 v j | Y | u 1 u 2 u 3 u k ( X 1 ) ( X 2 ) ( X 3 ) ( X k ) 7-1
LB graph for edge partition LB graph for edge partition: For each i ∈ [ k ], ( X i , Y ) ∼ µ which is a hard input distribution for set-disjointness. Each site S i holding X i = { X i , 1 , . . . , X i , n } creates an edge ( u i , v j ) for each X i , j = 1. The coordinator holding Y = { Y 1 , . . . , Y n } creates a path containing { v j | Y j = 1 } and a path containing { v j | Y j = 0 } . v j | Y j = 0 v j | Y j = 1 v j | Y | +1 v j | Y | +2 v j | Y | +3 v j n v j 1 v j 2 v j | Y | Graph connected ⇔ DISJ ( X 1 , Y ) ∨ . . . ∨ DISJ ( X k , Y ) = 1 (LB: Ω( kn )) u 1 u 2 u 3 u k ( X 1 ) ( X 2 ) ( X 3 ) ( X k ) 7-2
What if the graph is node partitioned? In most practical systems, graph is node partitioned . Can we prove a similar LB? 8-1
What if the graph is node partitioned? In most practical systems, graph is node partitioned . Can we prove a similar LB? v j | Y | +1 v j | Y | +2 v j | Y | +3 v j n v j 1 v j 2 v j | Y | Graph connected ⇔ DISJ ( X 1 , Y ) ∨ . . . ∨ DISJ ( X k , Y ) = 1 u 1 u 2 u 3 u k Basically, only bottom nodes (and their adjacent edges) are partitioned 8-2
What if the graph is node partitioned? In most practical systems, graph is node partitioned . Can we prove a similar LB? v j | Y | +1 v j | Y | +2 v j | Y | +3 v j n v j 1 v j 2 v j | Y | Graph connected ⇔ DISJ ( X 1 , Y ) ∨ . . . ∨ DISJ ( X k , Y ) = 1 u 1 u 2 u 3 u k Basically, only bottom nodes (and their adjacent edges) are partitioned If we also partition the top nodes (and their adjacent edges), then the Ω( kn ) LB does not hold. 8-3
What if the graph is node partitioned? In most practical systems, graph is node partitioned . Can we prove a similar LB? v j | Y | +1 v j | Y | +2 v j | Y | +3 v j n v j 1 v j 2 v j | Y | Graph connected ⇔ DISJ ( X 1 , Y ) ∨ . . . ∨ DISJ ( X k , Y ) = 1 u 1 u 2 u 3 u k Basically, only bottom nodes (and their adjacent edges) are partitioned If we also partition the top nodes (and their adjacent edges), then the Ω( kn ) LB does not hold. Not a surprise. If a graph is node partitioned, ˜ O ( n ) suffices. [Ahn, Guha, McGregor ’12] 8-4
Input sharing Input sharing To prove LB in the node partition model, one needs to deal with input sharing: each edge may be stored in two sites. Need new techniques? 9-1
Input sharing Input sharing To prove LB in the node partition model, one needs to deal with input sharing: each edge may be stored in two sites. Need new techniques? A concrete problem: Breadth First Search Tree Given a node u , the parties want to jointly compute a BSF tree rooted at u . The coordinator outputs the final BFS tree. What is the comm. complexity? 9-2
Distributed joins 10-1
Set-intersection join A 1 , . . . , A m ⊆ [ n ] = { 1 , 2 , . . . , n } , and B 1 , . . . , B m ⊆ [ n ] A 1 = = B 1 B m B A A m e.g., skills e.g., skills of required by a applicants job positions Set-Intersection Join (cardinality version) SIJ ( A , B ) = |{ ( i , j ) for which C i , j > 0 , where C = A · B }| An important operation in databases 11-1
Set-intersection join (cont.) The problem : estimate SIJ ( A , B ) up to a (1 + ǫ ) factor. Useful e.g. in query planning. 12-1
Set-intersection join (cont.) The problem : estimate SIJ ( A , B ) up to a (1 + ǫ ) factor. Useful e.g. in query planning. Current LB Ω( n /ǫ 2 / 3 ) : (Van Gucht, Williams, Woodruff, Z. ’15) 12-2
Set-intersection join (cont.) The problem : estimate SIJ ( A , B ) up to a (1 + ǫ ) factor. Useful e.g. in query planning. Current LB Ω( n /ǫ 2 / 3 ) : (Van Gucht, Williams, Woodruff, Z. ’15) For each i ∈ [ m ], choose ( A i , B i ) ∼ µ where µ is a hard input distribution for set-disjointness. Define SUM ( A , B ) = � i ∈ [ m ] DISJ ( A i , B i ). W.h.p. SIJ ( A , B ) = SUM ( A , B ) + m ( m − 1) . Using basically a direct-sum (Gap-hamming + DISJ), any rand. algo. that computes SUM ( A , B ) w.pr. 0.99 � up to an additive error m / 2 needs Ω( mn ) comm. Set m = 1 /ǫ 2 / 3 to get Ω( n /ǫ 2 / 3 ) LB 12-3
Set-intersection join (cont.) The current best UB : ˜ O ( m /ǫ 2 ) using F 0 -sketch, and is one-way Can we prove an Ω( n /ǫ 2 ) LB? Not enough to apply a direct-sum type argument on ( A 1 , B 1 ) , . . . , ( A m , B m ), since each A i is going to join each B j . In other words, the primitive problems overlap. Need new techniques? 13-1
Sketching threshold edit distance 14-1
Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . 15-1
Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 15-2
Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 Applications: numerous. E.g., bioinformatics (measuring similarity between DNA seq. 15-3
Edit Distance Definition: Given two strings s , t ∈ Σ n : ed ( s , t ) = minimum number of character operations (insertion/deletion/substitution) that transform s to t . ed( banana , ananas ) = 2 Applications: numerous. E.g., bioinformatics (measuring automatic spelling correction similarity between DNA seq. 15-4
Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. 16-1
Problems The threshold version of ED: Given two strings s , t ∈ { 0 , 1 } n and a threhold K , output all the edits if ed ( s , t ) ≤ K , output “ Error ” otherwise. sk(s) t s document exchange App: remote file sync; file transmission through a noisy channel One-way comm. 16-2
Recommend
More recommend