CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University CS535 BIG DATA PART B. GEAR SESSIONS SESSION 3: BIG GRAPH ANALYSIS Sangmi Lee Pallickara Computer Science, Colorado State University http://www.cs.colostate.edu/~cs535 CS535 Big Data | Computer Science | Colorado State University FAQs • Online GEAR presentation will be available on 4/6 • You will have 3 days of discussion period on Piazza • 4/6 ~ 4/8 http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 1
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Topics of Todays Class • GraphX: Graph Processing in a Distributed Dataflow Framework • Part 1: Introduction and Graph parallelism • Part 2: Distributed Graph Representation • Part 3: Implementation of Distributed Graph Processing CS535 Big Data | Computer Science | Colorado State University GEAR Session 3. Big Graph Analysis Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 2
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University This material is built based on • Gonzalez, J.E., Xin, R.S., Dave, A., Crankshaw, D., Franklin, M.J. and Stoica, I., 2014. Graphx: Graph processing in a distributed dataflow framework. In 11th {USENIX} Symposium on Operating Systems Design and Implementation ({OSDI} 14) (pp. 599- 613). • KARYPIS, G., AND KUMAR, V. Multilevel k-way partitioning scheme for irregular graphs. J. Parallel Distrib. Comput. • 48 , 1 (1998), 96–129. • GraphX Programming Guide https://spark.apache.org/docs/latest/graphx-programming- guide.html CS535 Big Data | Computer Science | Colorado State University Introduction • GraphX is a library built on top of the Apache Spark for graphs and graph-parallel computation • Introduces a Graph abstraction • Directed multigraph with properties attached to each vertex and edge • Provides a set of graph operators • E.g. subgraph, JoinVertices, and aggregateMessages • Provides an optimized variant of the Pregel API • Implements graph algorithms and builders • PageRank • Connected Components • Triangle Counting http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 3
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Computational Challenges • Graph processing systems outperform general-purpose distributed dataflow frameworks with own specialized optimization schemes • E.g. Pregel, PowerGraph, BLAS, Kineograph • Graphs are often only a part of the large analytics process • Combines graphs with unstructured and tabular data • Analytics pipelines are forced to compose multiple systems • Extra data movement and duplication • Fault tolerance • Design of graph processing systems on top of general purpose distributed dataflow systems is needed CS535 Big Data | Computer Science | Colorado State University GEAR Session 3. Big Graph Analysis Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework Distributed Dataflow Model and Optimization Schemes for Graph Processing http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 4
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Dataflow Models - Traditional Network Programming • Message-passing between nodes (e.g. MPI) • Very difficult to do at scale • How to split the problem across nodes? • Network communication & data locality • How to deal with failures ? (inevitable at scale) • Stragglers ? • Node not failed but slow • Writing programs for each machine • Rarely used in commodity datacenters! CS535 Big Data | Computer Science | Colorado State University Dataflow Models – Modern distributed dataflow models • Restrict the programming interface • System can do more automatically • Express jobs as graphs of high-level operators • System picks how to split each operator into tasks and where to run each task • Run parts multiple times for fault recovery • Examples: MapReduce, Spark, Dryad, Storm, Pig, Hive… • Examples of dataflow operators • join, map, groupby , … most of the operators introduced in the Apache Spark discussion http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 5
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Why did these graph processing systems evolve separately from distributed dataflow frameworks? • Early emphasis on single stage computation and on-disk processing • Limited capability to handle iterative graph algorithms • Repeatedly and randomly access subsets of the graph • E.g. MapReduce • Early distributed dataflow frameworks did not support fine-grained control over the data partitioning • Recent frameworks (e.g. Spark and Naiad) support in-memory representation and fine-grained control over data partitioning CS535 Big Data | Computer Science | Colorado State University Optimization used in GraphX • Encoding graph as a collections • Vertex-cut partitioning • Executing graph algorithms as the common dataflow operators • Join optimizations • E.g. CSR indexing, join elimination and join-site specification • Materialized view maintenance • Vertex mirroring and delta updates • Applying above techniques and provides a new set of the Spark dataflow operators for graph processing • Reducing memory overhead and improve system performance • Immutability GraphX reuses indices across graph and collection views over multiple iterations http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 6
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University GEAR Session 3. Big Graph Analysis Lecture 2. Distributed Large Graph Analysis-II GraphX: Graph Processing in a Distributed Dataflow Framework The Property Graphs as Collections and Executing Graph Algorithms CS535 Big Data | Computer Science | Colorado State University Property Graph • User-defined properties with each vertex and edge • Meta-data • e.g. user profiles and time stamps • Program state • E.g. the PageRank of vertices or inferred affinities • Applicable for natural phenomena such as social networks and web graphs • Often highly skewed • Power-law degree distributions • Orders of magnitude more edges than vertices http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 7
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University Transforming a Property Graph to a Pair of Collections • Vertex collection • Vertex properties (with a unique key: Vertex Identifier) • Vertex Identifiers are 64-bit integer • Derived externally (e.g. using userID) or applying a hash function to the vertex property (e.g. URL) • Edge collection • Edge properties (with source and destination vertex identifiers) • Having a pair of collection enables the system to compute graph algorithms with existing dataflow operations • Join: adding additional vertex properties • Creating new collections: creating a new graph • E.g. maintaining a graph for PageRanks and another graph for membership information while sharing the same edge collection CS535 Big Data | Computer Science | Colorado State University The Graph-Parallel Abstraction (Discussed in W10-A) • Iterative local def PageRank(v: Id, msgs: List[Double]) { transformations // Compute the message sum • E.g. PageRank algorithm var msgSum = 0 for (m <- msgs) { msgSum += m } • Vertex program // Update the PageRank • Launches the vertex program PR(v) = 0.15 + 0.85 * msgSum for each vertex and interacts // Broadcast messages with new PR with adjacent vertex programs for (j <- OutNbrs(v)) { through messages (e.g. pregel), msg = PR(v) / NumLinks(v) send_msg(to=j, msg) or shared state (e.g. } PowerGraph) // Check for termination • Example with the PageRank if (converged(PR(v))) voteToHalt(v) algorithm } PageRank in Pregel http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 8
CS535 Big Data 4/7/2020 Week 11-A Sangmi Lee Pallickara CS535 Big Data | Computer Science | Colorado State University The Graph-Parallel Abstraction (Discussed in W10-A) • Advantage • Well-suited for iterative graph algorithms for the static neighborhood structure of the graph • Disadvantage • It cannot express computation where disconnected vertices interact • It cannot process graph data that changes the graph structure in the course of the computation CS535 Big Data | Computer Science | Colorado State University The GAS Decomposition • Gonzalez et al. 1 observed that most vertex programs interact with neighboring vertices by collecting messages in the form of a generalized commutative associative sum and then broadcasting new messages in an inherently parallel loop 1 GONZALEZ, J. E., LOW, Y., GU, H., BICKSON, D., AND GUESTRIN, C. “Powergraph: Distributed graph-parallel computation on natural graphs,” OSDI’12, USENIX Association, pp. 17–30. http://www.cs.colostate.edu/~cs535 Spring 2020 Colorado State University, page 9
Recommend
More recommend