Piccolo: Building fast distributed programs with partitioned tables - PowerPoint PPT Presentation

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University

Motivating Example: PageRank for each node X in graph: Repeat until for each edge X  Z: convergence next[Z] += curr[X] Fits in Input Graph Curr Next memory! A  B,C,D A: 0.25 A: 0.2 A: 0.12 A: 0.2 A: 0.25 A: 0 B  E B: 0.17 B: 0.16 B: 0.15 B: 0 B: 0.16 B: 0.17 C  D C: 0.22 C: 0.21 C: 0.2 C: 0 C: 0.21 C: 0.22 … … … … … … …

PageRank in MapReduce  Data flow models do not expose global state. 1 Rank stream Graph stream Rank stream A:0.1, B:0.2 A->B,C, B->D A:0.1, B:0.2 Distributed Storage 2 3

PageRank With MPI/RPC Graph Ranks A->B,C A: 0 User explicitly 1 … … programs communication Distributed Ranks Storage Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 B->D … …

Piccolo’s Goal: Distributed Shared State 1 Distributed Distributed in- Storage memory state read/write Graph Ranks A->B,C A: 0 B->D B: 0 … … 2 3

Piccolo’s Goal: Distributed Shared State Graph Ranks A->B,C A: 0 Piccolo runtime 1 … … handles communication Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 B->D … …

Ease of use Performance

Talk outline  Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

Programming Model Implemented as 1 library for C++ x and Python get/put read/write update/iterate Graph Ranks A  B,C A: 0 B  D B: 0 … … 2 3

Naïve PageRank with Piccolo curr = Table(key=PageID, value=double) next = Table(key=PageID, value=double) def pr_kernel pr_kernel(graph, curr, next): i = my_instance Jobs run by n = len(graph)/NUM_MACHINES many machines for s in graph[(i-1)*n:i*n] for t in s.out: next[t] += curr[s.id] / len(s.out) Controller launches def main main(): for i in range(50): jobs in parallel launch_jobs(NUM_MACHINES, pr_kernel, Run by a single graph, curr, next) swap(curr, next) controller next.clear()

Naïve PageRank is Slow get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

PageRank: Exploiting Locality Control table curr = Table(…,partitions=100,partition_by=site) partitioning next = Table(…,partitions=100,partition_by=site) group_tables(curr,next,graph) Co-locate tables def pr_kernel pr_kernel(graph, curr, next): for s in graph.get_iterator(my_instance) for t in s.out: next[t] += curr[s.id] / len(s.out) def main main(): for i in range(50): launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, Co-locate execution with locality=curr) table swap(curr, next) next.clear()

Exploiting Locality get Graph Ranks B->D A: 0 1 put … … get get put put Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

Synchronization How to handle Graph Ranks 1 B->D A: 0 synchronization? … … put (a=0.3) Ranks put (a=0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

Synchronization Primitives  Avoid write conflicts with accumulation functions  NewValue = Accum(OldValue, Update)  sum, product, min, max  Global barriers are sufficient  Tables provide release consistency

PageRank: Efficient Synchronization Accumulation curr = Table(…,partition_by=site,accumulate=sum) via sum next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next,graph) Update invokes def pr_kernel pr_kernel(graph, curr, next): accumulation function for s in graph.get_iterator(my_instance) for t in s.out: next.update(t, curr.get(s.id)/len(s.out)) def main main(): for i in range(50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) Explicitly wait barrier(handle) between iterations swap(curr, next) next.clear()

Efficient Synchronization Runtime Graph Ranks 1 B->D A: 0 computes sum … … Workers buffer updates locally  Release consistency put (a=0.3) update (a, 0.3) Ranks put (a=0.2) update (a, 0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

Table Consistency Graph Ranks 1 B->D A: 0 … … update (a, 0.3) put (a=0.3) Ranks put (a=0.2) update (a, 0.2) Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

PageRank with Checkpointing curr = Table(…,partition_by=site,accumulate=sum) next = Table(…,partition_by=site,accumulate=sum) group_tables(curr,next) def pr_kernel pr_kernel(graph, curr, next): Restore previous for node in graph.get_iterator(my_instance) computation for t in s.out: next.update(t,curr.get(s.id)/len(s.out)) User decides which def main main(): tables to checkpoint curr, userdata = restore() and when last = userdata.get(‘iter’, 0) for i in range(last,50): handle = launch_jobs(curr.num_partitions, pr_kernel, graph, curr, next, locality=curr) cp_barrier(handle, tables=(next), userdata={‘iter’:i}) swap(curr, next) next.clear()

Recovery via Checkpointing Graph Ranks 1 B->D A: 0 … … Runtime uses Chandy-Lamport Distributed protocol Storage Ranks Ranks C: 0 B: 0 … … Graph Graph C->E,F 2 3 A->B,C … …

Talk Outline  Motivation  Piccolo's Programming Model  Runtime Scheduling  Evaluation

Other workers Load Balancing are updating P6! J1, J3, J5, P1, P2 P3, P4 P5, P6 J1 J3 J2 J4 J6 P1 P3 P5 J6 P6 1 2 3 J3 J5 Pause updates! Coordinates master work- stealing

Talk Outline  Motivation  Piccolo's Programming Model  System Design  Evaluation

Piccolo is Fast 400 PageRank iteration time Main Hadoop Overheads: 350  Sorting Hadoop 300  HDFS (seconds) Piccolo 250  Serialization 200 150 100 50 0 8 16 32 64 Workers  NYU cluster, 12 nodes, 64 cores  100M-page graph

1 billion page Piccolo Scales Well graph ideal PageRank iteration time 70 60 (seconds) 50 40 30 20 10 0 12 24 48 100 200 Workers  EC2 Cluster - linearly scaled input graph

Other applications  Iterative Applications  N-Body Simulation No straightforward  Matrix Multiply Hadoop implementation  Asynchronous Applications  Distributed web crawler ‏

Related Work  Data flow  MapReduce, Dryad  Tuple Spaces  Linda, JavaSpaces  Distributed Shared Memory  CRL, TreadMarks, Munin, Ivy  UPC, Titanium

Conclusion  Distributed shared table model  User-specified policies provide for  Effective use of locality  Efficient synchronization  Robust failure recovery

Gratuitous Cat Picture I can haz kwestions? Try it out: piccolo.news.cs.nyu.edu

Piccolo: Building fast distributed programs with partitioned tables - PowerPoint PPT Presentation

Piccolo: Building fast distributed programs with partitioned tables Russell Power Jinyang Li New York University Motivating Example: PageRank for each node X in graph: Repeat until for each edge X Z: convergence next[Z] += curr[X] Fits

Scene Graphs and Piccolo SMD158 Interactive Systems Spring 2005 Overview Zoomable user

Being a METS Startup Fast Failure; Fast Reward November 2016 Fast Failure; Fast Reward

Multiple Programs How do programs communicate? 1 Multiple Programs How do programs communicate?

ROCKAWAY WATER PLAZA Sustainable Redesign of a Vacant Plot In Arverne Zo Piccolo Sustainable

SLIDE LIST 1 Cover Slide 2 Abies balsamea 'Piccolo' Balsam Fir 3 Abies concolor 'Bryce

finanszrozsi lehetsgei, gyakorlati tmutat nkormnyzatoknak Giustino Piccolo

Predictor-Corrector Policy Optimization (PicCoLO) Ching-An Cheng #$ , Xinyan Yan # , Nathan

Distributed Systems (ICE 601) Distributed Transactions Dongman Lee ICU Class Overview

Unleashing Talent in A Distributed Workforce C O R E N E T 2 0 2 0 HACKATHON: DISTRIBUTED W O R K

FAST DISTRIBUTED RSA KEY GENERATION FOR FAST DISTRIBUTED RSA KEY GENERATION FOR semi-honest

Community Update MST T Fast st Facts cts MST T Fast st Facts cts MST T Fast st Facts

Fast Food and Your Health www.ddssafety.net Last updated October 2009 What is fast food?

Lurssen 32,9 A classic fast Lurssen 32,9 A classic fast A F T D E C K Lurssen 32,9 A

Distributed Databases Distributed database management system A distributed database (DDB) is

Distributed File Systems Distributed File Systems A distributed file system (DFS) is a

Outline Introduction Background Distributed DBMS Architecture Distributed Database Design

A Uniq ue I nve stme nt Ve hic le in L a b o ra to ry T e sting SAFE HARBOR STATEMENT Ce

Factors of Siberian stone pine ecological and geographical differentiation along latitudinal and

High-Throughput, Scalable Nanomanufacturing of Nanocomposites via Micellular Electrospray Jessica

Ebola Vaccine Case Study for Session 5a BIOLOGICALS (PROCESS VALIDATION AND CONTROL STRATEGY)

INVESTOR PRESENTATION 2019 Better Data, Better Health WHO ARE WE? Avricore is a total health

ART, ANTHROPOLOGY AND THE MODES OF RE- PRESENTATION: MUSEUMS AND CONTEMPORARY NON-WESTERN ART

at Jordan Health Gabriela Pauli Anthony L. Jordan Health Center Rochester, NY Introduction

T14 6/29/2006 1:30 PM P ROCESS I MPROVEMENT - C AN I M AKE A D IFFERENCE ? Stephanie Penland SAS

Sambuz

Useful Links

Newsletter

Mail Us