piccolo building fast distributed programs with
play

Piccolo: Building fast distributed programs with partitioned tables - PowerPoint PPT Presentation

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University Motivating Example: PageRank for each node X in graph: Repeat until for each edge X Z: convergence next[Z] += curr[X] Fits


  1. Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University

  2. Motivating Example: PageRank for each node X in graph: Repeat until for each edge X  Z: convergence next[Z] += curr[X] Fits in Input Graph Curr Next memory! A  B,C,D A: 0.25 A: 0.2 A: 0.12 A: 0.2 A: 0.25 A: 0 B  E B: 0.17 B: 0.16 B: 0.15 B: 0 B: 0.16 B: 0.17 C  D C: 0.22 C: 0.21 C: 0.2 C: 0 C: 0.21 C: 0.22 … … … … … … …

  3. PageRank in MapReduce  Data flow models do not expose global state. 1 Rank stream Graph stream Rank stream A:0.1, B:0.2 A->B,C, B->D A:0.1, B:0.2 Distributed Storage 2 3

  4. PageRank in MapReduce  Data flow models do not expose global state. 1 Rank stream Graph stream Rank stream A:0.1, B:0.2 A->B,C, B->D A:0.1, B:0.2 Distributed Storage 2 3

  5. PageRank With MPI/RPC Graph Ranks A->B,C A: 0 User explicitly 1 … … programs communication Distributed Ranks Storage Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 B->D … …

  6. Piccolo’s Goal: Distributed Shared State 1 Distributed Distributed in- Storage memory state read/write Graph Ranks A->B,C A: 0 B->D B: 0 … … 2 3

  7. Piccolo’s Goal: Distributed Shared State Graph Ranks A->B,C A: 0 Piccolo runtime 1 … … handles communication Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 B->D … …

  8. Ease of use Performance

  9. Talk outline  Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

  10. Programming Model Implemented as 1 library for C++ x and Python get/put read/write update/iterate Graph Ranks A  B,C A: 0 B  D B: 0 … … 2 3

  11. Naïve PageRank with Piccolo curr = Table(key=PageID, value=double) next = Table(key=PageID, value=double) def pr_kernel pr_kernel(graph, curr, next): i = my_instance Jobs run by n = len(graph)/NUM_MACHINES many machines for s in graph[(i-1)*n:i*n] for t in s.out: next[t] += curr[s.id] / len(s.out) Controller launches def main main(): for i in range(50): jobs in parallel launch_jobs(NUM_MACHINES, pr_kernel, Run by a single graph, curr, next) swap(curr, next) controller next.clear()

  12. Naïve PageRank is Slow get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  13. PageRank: Exploiting Locality Control table curr = Table(…,partitions=100,partition_by=site) partitioning next = Table(…,partitions=100,partition_by=site) group_tables(curr,next,graph) Co-locate tables def pr_kernel pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next[t] += curr[s.id] / len(s.out) def main main(): for i in range(50): launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, Co-locate execution with locality=curr) table swap(curr, next) next.clear()

  14. Exploiting Locality get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  15. Exploiting Locality get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  16. Synchronization How to handle Graph Ranks 1 B->D A: 0 synchronization? … … put (a=0.3) Ranks put (a=0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  17. Synchronization Primitives  Avoid write conflicts with accumulation functions  NewValue = Accum(OldValue, Update)  sum, product, min, max  Global barriers are sufficient  Tables provide release consistency

  18. PageRank: Efficient Synchronization Accumulation curr = Table(…,partition_by=site,accumulate=sum) via sum next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next,graph) Update invokes def pr_kernel pr_kernel(graph, curr, next): accumulation function for s in graph.get_iterator(my_instance) for t in s.out: next.update(t, curr.get(s.id)/len(s.out)) def main main(): for i in range(50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) Explicitly wait barrier(handle) between iterations swap(curr, next) next.clear()

  19. Efficient Synchronization Runtime Graph Ranks 1 B->D A: 0 computes sum … … Workers buffer updates locally  Release consistency put (a=0.3) update (a, 0.3) Ranks put (a=0.2) update (a, 0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  20. Table Consistency Graph Ranks 1 B->D A: 0 … … update (a, 0.3) put (a=0.3) Ranks put (a=0.2) update (a, 0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  21. PageRank with Checkpointing curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next) def pr_kernel pr_kernel(graph, curr, next): Restore previous for node in graph.get_iterator(my_instance) computation for t in s.out: next.update(t,curr.get(s.id)/len(s.out)) User decides which def main main(): tables to checkpoint curr, userdata = restore() and when last = userdata.get(‘iter’, 0) for i in range(last,50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) cp_barrier(handle, tables=(next), userdata={‘iter’:i}) swap(curr, next) next.clear()

  22. Recovery via Checkpointing Graph Ranks 1 B->D A: 0 … … Runtime uses Chandy-Lamport Distributed protocol Storage Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

  23. Talk Outline  Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

  24. Other workers Load Balancing are updating P6! J1, J3, J5, P1, P2 P3, P4 P5, P6 J1 J3 J2 J4 J6 P1 P3 P5 J6 P6 1 2 3 J3 J5 Pause updates! Coordinates master work- stealing

  25. Talk Outline  Motivation  Piccolo's Programming Model  System Design  Evaluation

  26. Piccolo is Fast 400 PageRank iteration time Main Hadoop Overheads: 350  Sorting Hadoop 300  HDFS (seconds) Piccolo 250  Serialization 200 150 100 50 0 8 16 32 64 Workers  NYU cluster, 12 nodes, 64 cores  100M-page graph

  27. 1 billion page Piccolo Scales Well graph ideal PageRank iteration time 70 60 (seconds) 50 40 30 20 10 0 12 24 48 100 200 Workers  EC2 Cluster - linearly scaled input graph

  28. Other applications  Iterative Applications  N-Body Simulation No straightforward  Matrix Multiply Hadoop implementation  Asynchronous Applications  Distributed web crawler ‏

  29. Related Work  Data flow  MapReduce, Dryad  Tuple Spaces  Linda, JavaSpaces  Distributed Shared Memory  CRL, TreadMarks, Munin, Ivy  UPC, Titanium

  30. Conclusion  Distributed shared table model  User-specified policies provide for  Effective use of locality  Efficient synchronization  Robust failure recovery

  31. Gratuitous Cat Picture I can haz kwestions? Try it out: piccolo.news.cs.nyu.edu

Recommend


More recommend