Big Data I: Graph Processing, Distributed Machine Learning CS 240: Computing Systems and Concurrency Lecture 21 Marco Canini Credits: Michael Freedman and Kyle Jamieson developed much of the original material. Selected content adapted from J. Gonzalez.
Patient ate which Also sold contains Diagnoses Patient presents to abdominal purchased pain. from Diagnosis? with E. Coli infection
Big Data is Everywhere 6 Billion 900 Million 72 Hours a Minute 28 Million Flickr Photos Facebook Users YouTube Wikipedia Pages • Machine learning is a reality • How will we design and implement “Big Learning” systems? 3
We could use …. Threads, Locks, & Messages “Low-level parallel primitives”
Shift Towards Use Of Parallelism in ML GPUs Multicore Clusters Clouds Supercomputers • Programmers repeatedly solve the same parallel design challenges: – Race conditions, distributed state, communication… • Resulting code is very specialized : – Difficult to maintain, extend, debug… Idea: Avoid these problems by using high-level abstractions 5
... a better answer: MapReduce / Hadoop Build learning algorithms on top of high-level parallel abstractions
MapReduce – Map Phase 4 2 2 1 2 1 5 CPU 1 CPU 2 CPU 3 CPU 4 2 . . . . 3 3 8 9 Embarrassingly Parallel independent computation No Communication needed 7
MapReduce – Map Phase 8 1 8 2 4 8 4 CPU 1 CPU 2 CPU 3 CPU 4 4 . . . . 3 4 4 1 1 4 2 2 2 2 1 5 . . . . 9 3 3 8 Image Features 8
MapReduce – Map Phase 6 1 3 1 7 4 4 CPU 1 CPU 2 CPU 3 CPU 4 7 . . . . 5 9 3 5 8 1 8 1 2 4 2 2 4 8 4 2 4 2 1 5 . . . . . . . . 3 4 4 9 1 3 3 8 Embarrassingly Parallel independent computation 9
MapReduce – Reduce Phase Outdoor Picture Indoor Statistics Picture Statistics 17 22 Outdoor Indoor 26 CPU 1 26 CPU 2 Pictures Pictures . . 31 26 1 2 1 4 8 6 2 1 1 2 8 3 2 4 7 2 4 7 1 8 4 5 4 4 . . . . . . . . . . . . 9 1 5 3 3 5 3 4 9 8 4 3 I O O I I I O O I O I I Image Features 10
Map-Reduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel Is there more to Machine Learning Map Reduce ? Label Propagation Feature Algorithm Lasso Belief Extraction Tuning Kernel Propagation Methods Basic Data Processing Tensor PageRank Factorization Neural Deep Belief Networks Networks 11
Exploiting Dependencies
Graphs are Everywhere Collaborative Filtering Social Network Users Netflix Movies Probabilistic Analysis Text Analysis Wiki Docs Words
Concrete Example Label Propagation
Label Propagation Algorithm • Social Arithmetic: Sue Ann 50% What I list on my profile 80% Cameras 40% 40% Sue Ann Likes 20% Biking + 10% Carlos Like I Like: 60% Cameras, 40% Biking Profile 50% • Recurrence Algorithm: 50% Cameras Me 50% Biking ∑ Likes [ i ] = W ij × Likes [ j ] j ∈ Friends [ i ] Carlos – iterate until convergence 30% Cameras 10% 70% Biking • Parallelism: – Compute all Likes[i] in parallel
Properties of Graph Parallel Algorithms Dependency Factored Iterative Graph Computation Computation What I Like What My Friends Like
Map-Reduce for Data-Parallel ML • Excellent for large data-parallel tasks! Data-Parallel Graph-Parallel MapReduce MapReduce? Label Propagation Feature Algorithm Lasso Belief Extraction Tuning Kernel Propagation Methods Basic Data Processing Tensor PageRank Factorization Neural Deep Belief Networks Networks 17
Problem: Data Dependencies • MapReduce doesn’t efficiently express data dependencies – User must code substantial data transformations – Costly data replication Independent Data Rows
Iterative Algorithms • MR doesn’t efficiently express iterative algorithms: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data Processor Slow CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier
MapAbuse: Iterative MapReduce • Only a subset of data needs computation: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data Barrier Barrier Barrier
MapAbuse: Iterative MapReduce • System is not optimized for iteration: Iterations Data Data Data Data CPU 1 CPU 1 CPU 1 Data Data Data Data Startup Penalty Startup Penalty Startup Penalty Disk Penalty Disk Penalty Disk Penalty Data Data Data Data CPU 2 CPU 2 CPU 2 Data Data Data Data Data Data Data Data CPU 3 CPU 3 CPU 3 Data Data Data Data Data Data Data Data
ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce ? Feature Cross Graphical Models Semi-Supervised Extraction Validation Gibbs Sampling Learning Computing Sufficient Belief Propagation Label Propagation Statistics Variational Opt. CoEM Collaborative Graph Analysis Filtering PageRank Triangle Counting Tensor Factorization 22
ML Tasks Beyond Data-Parallelism Data-Parallel Graph-Parallel Map Reduce Feature Cross Extraction Validation Pregel Computing Sufficient Statistics 23
• Limited CPU Power • Limited Memory • Limited Scalability 24
Distributed Cloud Scale up computational resources! Challenges: - Distribute state - Keep data consistent - Provide fault tolerance 25
The GraphLab Framework Graph Based Update Functions Data Representation User Computation Consistency Model 26
Data Graph Data is associated with both vertices and edges Graph: • Social Network Vertex Data: • User profile • Current interests estimates Edge Data: • Relationship (friend, classmate, relative) 27
Distributed Data Graph Partition the graph across multiple machines: 28
Distributed Data Graph • Ghost vertices maintain adjacency structure and replicate remote data. “ghost” vertices 29
Distributed Data Graph • Cut efficiently using HPC Graph partitioning tools (ParMetis / Scotch / …) “ghost” vertices 30
The GraphLab Framework Graph Based Update Functions Data Representation User Computation Consistency Model 31
Update Function A user-defined program, applied to a vertex ; transforms data in scope of vertex Pagerank(scope){ // Update the current vertex data Update function applied (asynchronously) vertex.PageRank = α in parallel until convergence ForEach inPage: vertex.PageRank += (1 − α ) × inPage.PageRank Many schedulers available to prioritize computation // Reschedule Neighbors if needed if vertex.PageRank changes then reschedule_all_neighbors; } Selectively triggers computation at neighbors 32
Distributed Scheduling Each machine maintains a schedule over the vertices it owns a f b a b c d c h g e f g i j k h i j Distributed Consensus used to identify completion 33
Ensuring Race-Free Code • How much can computation overlap? 34
The GraphLab Framework Graph Based Update Functions Data Representation User Computation Consistency Model 35
PageRank Revisited Pagerank(scope) { vertex.PageRank = α ForEach inPage: vertex.PageRank += (1 − α ) × inPage.PageRank vertex.PageRank = tmp … } 36
PageRank data races confound convergence 37
Racing PageRank: Bug Pagerank(scope) { vertex.PageRank = α ForEach inPage: vertex.PageRank += (1 − α ) × inPage.PageRank vertex.PageRank = tmp … } 38
Racing PageRank: Bug Fix Pagerank(scope) { tmp vertex.PageRank = α ForEach inPage: tmp vertex.PageRank += (1 − α ) × inPage.PageRank vertex.PageRank = tmp … } 39
Throughput != Performance Higher Throughput (#updates/sec) No Consistency Potentially Slower Convergence of ML 40
Serializability For every parallel execution , there exists a sequential execution of update functions which produces the same result. time CPU 1 Parallel CPU 2 Single Sequential CPU 41
Serializability Example Write Stronger / Weaker consistency levels available Read User-tunable consistency levels trades off parallelism & consistency Overlapping regions are only read. Update functions one vertex apart can be run in parallel. Edge Consistency 42
Distributed Consistency • Solution 1: Chromatic Engine – Edge Consistency via Graph Coloring • Solution 2: Distributed Locking
Chromatic Distributed Engine Execute tasks Execute tasks on all vertices of on all vertices of color 0 color 0 Ghost Synchronization Completion + Barrier Time Execute tasks on all vertices of Execute tasks color 1 on all vertices of color 1 Ghost Synchronization Completion + Barrier 44
Matrix Factorization • Netflix Collaborative Filtering – Alternating Least Squares Matrix Factorization Model: 0.5 million nodes, 99 million edges Users Movies Users Netflix D D Movies 45
Recommend
More recommend