End-toEnd In-memory Graph Analytics Jure Leskovec (@jure) Including joint work with Rok Sosic, Deepak Narayanan, Yonathan Perez, et al. Jure Leskovec, Stanford 1
Background & Motivation My research at Stanford: § Mining large social and information networks § We work with data from Facebook,Twitter, LinkedIn, Wikipedia, StackOverflow Much research on graph processing systems but we don’t find it that useful… Why is that? What tools do we use? What do we see are some big challenges? Jure Leskovec, Stanford 2
Some Observations § We do not develop experimental systems to compete on benchmarks § BFS, PageRank, Triangle counting, etc. § Our work is § Knowledge discovery: Working on new problems using novel datasets to extract new knowledge § And as a side effect developing (graph) algorithms and software systems Jure Leskovec, Stanford 3
End-to-End Graph Analytics New knowledge and insights Data Graph analytics Need end-to-end graph analytics system that is flexible, scalable, and allows for easy implementation of new algorithms. Jure Leskovec, Stanford 4
Typical Workload § Finding experts on StackOverflow: Select Posts Questions Python Q&A Join Construct Graph Select Answers PageRank Algorithm Scores Experts Join Users Jure Leskovec, Stanford 5
Observation Graphs are never given! Graphs have to be constructed from input data! (graph constructions is a part of knowledge discovery process) Examples: § Facebook graphs: Friend, Communication, Poke, Co-tag, Co-location, Co-event § Cellphone/Email graphs: How many calls? § Biology: P2P, Gene interaction networks Jure Leskovec, Stanford 6
Graph Analytics Workflow Hadoop MapReduce Structured data Graph analytics Raw data Relational tables video, text, sound, events, sensor data, gene sequences, documents, … § Input: Structured data § Output: Results of network analyses § Node, edge, network properties § Expanded relational tables § Networks Jure Leskovec, Stanford 7
Plan for the Talk: Three Topics § SNAP: an in-memory system for end-to-end graph analytics § Constructing graphs from data § Multimodal networks § Representing richer types of graphs § New graph algorithms § Higher-order network partitioning § Feature learning in networks Jure Leskovec, Stanford 8
SNAP Stanford Network Analysis Platform SNAP: A General Purpose Network Analysis and Graph Mining Library. R. Sosic, J. Leskovec. ACM TIST 2016. RINGO: Interactive Graph Analytics on Big-Memory Machines Y. Perez, R. Sosic, A. Banerjee, R. Puttagunta, M. Raison, P. Shah, J. Leskovec. SIGMOD2015. Jure Leskovec, Stanford 9
End-to-End Graph Analytics New knowledge and insights Data Graph analytics § S tanford N etwork A nalysis P latform (SNAP) General-purpose, high-performance system for analysis and manipulation of networks § C++, Python (BSD, open source) § http://snap.stanford.edu § Scales to networks with hundreds of millions of nodes and billions of edges Jure Leskovec, Stanford 10
Desiderata for Graph Analytics § Easy to use front-end § Common high-level programming language § Fast execution times § Interactive use (as opposed to batch use) § Ability to process large graphs § Billions of edges § Support for several data representations § Transformations between tables and graphs § Large number of graph algorithms § Straightforward to use § Workflow management and reproducibility § Provenance Jure Leskovec, Stanford 11
Data Sizes in Network Analytics Number of Edges Number of Graphs <0.1M 16 0.1M – 1M 25 1M – 10M 17 10M – 100M 7 100M – 1B 5 > 1B 1 Networks in Stanford Large Network Collection § http://snap.stanford.edu § Common benchmark Twitter2010 graph has 1.5B § edges, requires 13.2GB RAM in SNAP Jure Leskovec, Stanford 12
Network of all Published research Entity #Items Size Papers 122.7M 32.4GB Authors 123.1M 3.1GB References 757.5M 14.4GB Affiliations 325.4M 15.3GB Keywords 176.8M 5.9GB Total 1.9B 104.1GB § Microsoft Academic Graph Jure Leskovec, Stanford 13
All Biomedical Research Dataset #Items Raw Size DisGeNet 30K 10MB STRING 10M 1TB OMIM 25K 100MB CTD 55K 1.2GB HPRD 30K 30MB BioGRID 64K 100MB DrugBank 7K 60MB Disease Ontology 10K 5MB Protein Ontology 200K 130MB Mesh Hierarchy 30K 40MB PubChem 90M 1GB DGIdb 5K 30MB Gene Ontology 45K 10MB MSigDB 14K 70MB Reactome 20K 100MB GEO 1.7M 80GB ICGC (66 cancer projects) 40M 1TB GTEx 50M 100GB Total: 250M entities, 2.2TB raw data Jure Leskovec, Stanford 14
Availability of Hardware Could all these datasets fit into RAM of a single machine? Single machine prices: § Server 1TB RAM, 80 cores, $25K § Server 6TB RAM, 144 cores, $200K § Server 12TB RAM, 288 cores, $400K In my group we have 1TB RAM machines since 2012 and just got a 12TB RAM machine Jure Leskovec, Stanford 15
Dataset vs. RAM Sizes § KDNuggets survey since 2006 surveys: “What is the largest dataset you analyzed/mined?” § Big RAM is eating big data: § Yearly increase of dataset sizes: 20% § Yearly increase of RAM sizes: 50% Bottom line: Want to do graph analytics? Get a BIG machine! Jure Leskovec, Stanford 16
Trade-offs Option 1 Option 2 Standard SQL database Custom representations Separate systems for Integrated system for tables and graphs tables and graphs Single representation for Separate table and graph tables and graphs representations Distributed system Single machine system Disk-based structures In-memory structures SNAP Jure Leskovec, Stanford 17
Graph Analytics: SNAP Specify Specify Optimize entities relationships representation Relational Unstructured Network Tabular tables data representation networks Perform Integrate graph analytics results SNAP Results Jure Leskovec, Stanford 18
Experts on StackOverflow Jure Leskovec, Stanford 19
Graph Construction in SNAP § SNAP (Python) code for executing finding the StackOverflow example RINGO: Interactive Graph Analytics on Big-Memory Machines Y. Perez, R. Sosic, A. Banerjee, R. Puttagunta, M. Raison, P. Shah, J. Leskovec. SIGMOD2015. Jure Leskovec, Stanford 20
SNAP Overview High-Level Language User Front-End Interface with Graph Metadata Provenance Processing Engine (Provenance) Script SNAP: In-memory Graph Processing Engine Filters Graph Graph Graph, Table Table Methods Containers Conversions Objects Secondary Storage Jure Leskovec, Stanford 21
Graph Construction Input data must be manipulated and transformed into graphs Src Dst … v1 v1 v2 … v4 v2 v3 … v3 v4 … v2 v1 v3 … v3 v1 v4 … Table data Graph data structure structure Jure Leskovec, Stanford 22
Creating a Graph in SNAP Four ways to create a graph: Nodes connected based on (1) Pairwise node similarity (2) Temporal order of nodes (3) Grouping and aggregation of nodes (4) The data already contains edges as source and destination pairs Jure Leskovec, Stanford 23
Creating Graphs in SNAP (1) Similarity-based: In a forum, connect users that post to similar topics § Distance metrics § Euclidean, Haversine, Jaccard distance § Connect similar nodes § SimJoin, connect if data points are closer than some threshold § How to get around quadratic complexity – Locality Sensitive Hashing Jure Leskovec, Stanford 24
Creating Graphs in SNAP (2) Sequence-based: In a Web log, connect pages in an order clicked by the users (click-trail) § Connect a node with its K successors § Events selected per user, ordered by timestamps § NextK, connect K successors Jure Leskovec, Stanford 25
Creating Graphs in SNAP (3) § Aggregation: Measure the activity level of different user groups § Edge creation § Partition users to groups § Identify interactions within each group § Compute a score for each group based on interactions § Treat groups as super-nodes in a graph Jure Leskovec, Stanford 26
Graphs and Methods Graph methods generation manipulation analytics graphs networks Graph containers § SNAP supports several graph types § Directed, Undirected, Multigraph § >200 graph algorithms § Any algorithm works on any container Jure Leskovec, Stanford 27
SNAP Implementation § High-level front end § Python module § Uses SWIG for C++ interface § High-performance graph engine § C++ based on SNAP § Multi-core support § OpenMP to parallelize loops § Fast, concurrent hash table, vector operations Jure Leskovec, Stanford 28
Graphs in SNAP Nodes Nodes Edges Sorted vectors of Sorted vectors of table table table in- and out- neighbors in- and out- edges 1 1 2 7 3 3 3 7 1 6 6 8 4 4 5 9 Directed graphs in SNAP Directed multigraphs in SNAP Jure Leskovec, Stanford 29
Experiments: Datasets Dataset LiveJournal Twitter2010 Nodes 4.8M 42M Edges 69M 1.5B Text Size (disk) 1.1GB 26.2GB Graph Size 0.7GB 13.2GB (RAM) Table Size 1.1GB 23.5GB (RAM) Jure Leskovec, Stanford 30
Benchmarks, One Computer Algorithm PageRank PageRank Triangles Triangles Graph LiveJournal T witter2010 LiveJournal T witter2010 Giraph 45.6s 439.3s N/A N/A GraphX 56.0s - 67.6s - GraphChi 54.0s 595.3s 66.5s - PowerGraph 27.5s 251.7s 5.4s 706.8s SNAP 2.6s 72.0s 13.7s 284.1s Hardware: 4x Intel CPU, 64 cores, 1TB RAM, $35K Jure Leskovec, Stanford 31
Recommend
More recommend