Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University
Motivating Example: PageRank for each node X in graph: Repeat until for each edge X Z: convergence next[Z] += curr[X] Fits in Input Graph Curr Next memory! A B,C,D A: 0.25 A: 0.2 A: 0.12 A: 0.2 A: 0.25 A: 0 B E B: 0.17 B: 0.16 B: 0.15 B: 0 B: 0.16 B: 0.17 C D C: 0.22 C: 0.21 C: 0.2 C: 0 C: 0.21 C: 0.22 … … … … … … …
PageRank in MapReduce Data flow models do not expose global state. 1 Rank stream Graph stream Rank stream A:0.1, B:0.2 A->B,C, B->D A:0.1, B:0.2 Distributed Storage 2 3
PageRank in MapReduce Data flow models do not expose global state. 1 Rank stream Graph stream Rank stream A:0.1, B:0.2 A->B,C, B->D A:0.1, B:0.2 Distributed Storage 2 3
PageRank With MPI/RPC Graph Ranks A->B,C A: 0 User explicitly 1 … … programs communication Distributed Ranks Storage Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 B->D … …
Piccolo’s Goal: Distributed Shared State 1 Distributed Distributed in- Storage memory state read/write Graph Ranks A->B,C A: 0 B->D B: 0 … … 2 3
Piccolo’s Goal: Distributed Shared State Graph Ranks A->B,C A: 0 Piccolo runtime 1 … … handles communication Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 B->D … …
Ease of use Performance
Talk outline Motivation Piccolo's Programming Model Runtime Scheduling Evaluation
Programming Model Implemented as 1 library for C++ x and Python get/put read/write update/iterate Graph Ranks A B,C A: 0 B D B: 0 … … 2 3
Naïve PageRank with Piccolo curr = Table(key=PageID, value=double) next = Table(key=PageID, value=double) def pr_kernel pr_kernel(graph, curr, next): i = my_instance Jobs run by n = len(graph)/NUM_MACHINES many machines for s in graph[(i-1)*n:i*n] for t in s.out: next[t] += curr[s.id] / len(s.out) Controller launches def main main(): for i in range(50): jobs in parallel launch_jobs(NUM_MACHINES, pr_kernel, Run by a single graph, curr, next) swap(curr, next) controller next.clear()
Naïve PageRank is Slow get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …
PageRank: Exploiting Locality Control table curr = Table(…,partitions=100,partition_by=site) partitioning next = Table(…,partitions=100,partition_by=site) group_tables(curr,next,graph) Co-locate tables def pr_kernel pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next[t] += curr[s.id] / len(s.out) def main main(): for i in range(50): launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, Co-locate execution with locality=curr) table swap(curr, next) next.clear()
Exploiting Locality get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …
Exploiting Locality get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …
Synchronization How to handle Graph Ranks 1 B->D A: 0 synchronization? … … put (a=0.3) Ranks put (a=0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …
Synchronization Primitives Avoid write conflicts with accumulation functions NewValue = Accum(OldValue, Update) sum, product, min, max Global barriers are sufficient Tables provide release consistency
PageRank: Efficient Synchronization Accumulation curr = Table(…,partition_by=site,accumulate=sum) via sum next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next,graph) Update invokes def pr_kernel pr_kernel(graph, curr, next): accumulation function for s in graph.get_iterator(my_instance) for t in s.out: next.update(t, curr.get(s.id)/len(s.out)) def main main(): for i in range(50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) Explicitly wait barrier(handle) between iterations swap(curr, next) next.clear()
Efficient Synchronization Runtime Graph Ranks 1 B->D A: 0 computes sum … … Workers buffer updates locally Release consistency put (a=0.3) update (a, 0.3) Ranks put (a=0.2) update (a, 0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …
Table Consistency Graph Ranks 1 B->D A: 0 … … update (a, 0.3) put (a=0.3) Ranks put (a=0.2) update (a, 0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …
PageRank with Checkpointing curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next) def pr_kernel pr_kernel(graph, curr, next): Restore previous for node in graph.get_iterator(my_instance) computation for t in s.out: next.update(t,curr.get(s.id)/len(s.out)) User decides which def main main(): tables to checkpoint curr, userdata = restore() and when last = userdata.get(‘iter’, 0) for i in range(last,50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) cp_barrier(handle, tables=(next), userdata={‘iter’:i}) swap(curr, next) next.clear()
Recovery via Checkpointing Graph Ranks 1 B->D A: 0 … … Runtime uses Chandy-Lamport Distributed protocol Storage Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …
Talk Outline Motivation Piccolo's Programming Model Runtime Scheduling Evaluation
Other workers Load Balancing are updating P6! J1, J3, J5, P1, P2 P3, P4 P5, P6 J1 J3 J2 J4 J6 P1 P3 P5 J6 P6 1 2 3 J3 J5 Pause updates! Coordinates master work- stealing
Talk Outline Motivation Piccolo's Programming Model System Design Evaluation
Piccolo is Fast 400 PageRank iteration time Main Hadoop Overheads: 350 Sorting Hadoop 300 HDFS (seconds) Piccolo 250 Serialization 200 150 100 50 0 8 16 32 64 Workers NYU cluster, 12 nodes, 64 cores 100M-page graph
1 billion page Piccolo Scales Well graph ideal PageRank iteration time 70 60 (seconds) 50 40 30 20 10 0 12 24 48 100 200 Workers EC2 Cluster - linearly scaled input graph
Other applications Iterative Applications N-Body Simulation No straightforward Matrix Multiply Hadoop implementation Asynchronous Applications Distributed web crawler
Related Work Data flow MapReduce, Dryad Tuple Spaces Linda, JavaSpaces Distributed Shared Memory CRL, TreadMarks, Munin, Ivy UPC, Titanium
Conclusion Distributed shared table model User-specified policies provide for Effective use of locality Efficient synchronization Robust failure recovery
Gratuitous Cat Picture I can haz kwestions? Try it out: piccolo.news.cs.nyu.edu
Recommend
More recommend