massive scale analysis of streaming social networks
play

Massive-scale analysis of streaming social networks David A. Bader - PowerPoint PPT Presentation

Massive-scale analysis of streaming social networks David A. Bader Exascale Streaming Data Analytics: Real-world challenges All involve analyzing massive 450 Million Users streaming complex networks: 400 Exponential growth: 350


  1. Massive-scale analysis of streaming social networks David A. Bader

  2. Exascale Streaming Data Analytics: Real-world challenges All involve analyzing massive 450 Million Users streaming complex networks: 400 Exponential growth: 350 • Health care Health care  disease spread, detection 300 More than 400 million active users and prevention of epidemics/pandemics 250 200 (e.g. SARS, Avian flu, H1N1 “swine” flu) 150 • Massive Massive social network social networks s  100 50 understanding communities, intentions, 0 population dynamics, pandemic spread, Dec ‐ 04 Mar ‐ 05 Jun ‐ 05 Sep ‐ 05 Dec ‐ 05 Mar ‐ 06 Jun ‐ 06 Sep ‐ 06 Dec ‐ 06 Mar ‐ 07 Jun ‐ 07 Sep ‐ 07 Dec ‐ 07 Mar ‐ 08 Jun ‐ 08 Sep ‐ 08 Dec ‐ 08 Mar ‐ 09 Jun ‐ 09 Sep ‐ 09 Dec ‐ 09 transportation and evacuation • Intelligence Intellig nce  business analytics, Sample queries: anomaly detection, security, knowledge discovery from massive data sets Allegiance switching : • Systems Biolo Systems Biology y  understanding identify entities that switch complex life systems, drug design, communities. microbial research, unravel the mysteries Community structure : of the HIV virus; understand life, disease, identify the genesis and Ex: discovered minimal dissipation of communities • Electric Power Grid  communication, Electric Power Grid changes in O(billions)-size Phase change : identify transportation, energy, water, food supply complex network that could significant change in the • Modeling and Simulation  Perform full- Modeling and Simulation hide or reveal top influencers network structure scale economic-social-political in the community simulations REQUIRES PREDICTING / INFLUENCE CHANGE IN REAL-TIME AT SCALE Da David A vid A. Bader Bader DARP RPA Edge Finding Idea Summit A Edge Finding Idea Summit 3

  3. Ubiquitous High Performance Computing (UHPC) Goal: develop highly parallel, security enabled, power efficient processing systems, supporting ease of programming, with resilient execution through all failure modes and intrusion attacks Architectural Drivers: Energy Efficient Security and Dependability Programmability Program Objectives: One PFLOPS, single cabinet including self-contained cooling 50 GFLOPS/W (equivalent to 20 pJ/FLOP) Total cabinet power budget 57KW, includes processing resources, storage and cooling Security embedded at all system levels Parallel, efficient execution models David A. Bader (CSE) Highly programmable parallel systems Echelon Leadership Team Scalable systems – from terascale to petascale “NVIDIA-Led Team Receives $25 Million Contract From DARPA to Develop High-Performance GPU Computing Systems” - MarketWatch Echelon : Extreme-scale Compute Hierarchies with Efficient Locality-Optimized Nodes

  4. Center for Adaptive Supercomputing Software (CASS-MT) • CASS-MT, launched July 2008 • Pacific-Northwest Lab – Georgia Tech, Sandia, WA State, Delaware • The newest breed of supercomputers have hardware set up not just for speed, but also to better tackle large networks of seemingly random data. And now, a multi-institutional group of researchers has been awarded over $14 million to develop software for these supercomputers. Applications include anywhere complex webs of information can be found: from internet security and power grid stability to complex biological networks. David A. Bader 6

  5. CASS-MT TASK 7: Analysis of Massive Social Networks Objective To design software for the analysis of massive-scale spatio-temporal interaction networks using multithreaded architectures such as the Cray XMT. The Center launched in July 2008 and is led by Pacific- Northwest National Laboratory. Description We are designing and implementing advanced, scalable algorithms for static and dynamic graph analysis, including generalized k- betweenness centrality and dynamic clustering coefficients. Highlights Image Courtesy of Cray, Inc. On a 64-processor Cray XMT, k- betweenness centrality scales nearly linearly (58.4x) on a graph with 16M Our research is focusing on temporal analysis, vertices and 134M edges. Initial streaming clustering answering questions about changes in global coefficients handle around 200k updates/sec on a properties ( e.g. diameter) as well as local structures similarly sized graph. (communities, paths). David A. Bader (CASS-MT Task 7 LEAD) David Ediger, Karl Jiang, Jason Riedy David A. Bader 7

  6. Driving Forces in Social Network Analysis • Facebook has more than 500 million active users 600 Million Users 500 400 3 orders of 300 magnitude 200 growth in 3 100 years! 0 Dec ‐ 04 Mar ‐ 05 Jun ‐ 05 Sep ‐ 05 Dec ‐ 05 Mar ‐ 06 Jun ‐ 06 Sep ‐ 06 Dec ‐ 06 Mar ‐ 07 Jun ‐ 07 Sep ‐ 07 Dec ‐ 07 Mar ‐ 08 Jun ‐ 08 Sep ‐ 08 Dec ‐ 08 Mar ‐ 09 Jun ‐ 09 Sep ‐ 09 Dec ‐ 09 Mar ‐ 10 Jun ‐ 10 • Note the graph is changin changing as well as growing. • What are this graph's properties? How do they change? How do they change? • Traditional graph partitioning often fails: – Topology: Interaction graph is low-diameter, and has no good separators Topology – Irregula Irre gularity rity: Communities are not uniform in size – Overla Overlap: individuals are members of one or more communities • Sample queries: – Allegiance switching Allegiance switching: identify entities that switch communities. – Community structure Community structure: identify the genesis and dissipation of communities – Phase change Phase change: identify significant change in the network structure David A. Bader 8

  7. Example: Mining Twitter for Social Good ICPP 2010 Image credit: bioethicsinstitute.org David A. Bader 9

  8. Massive Data Analytics: Protecting our Nation Public Health Public Health US High Voltage Transmission US High Voltage Transmission • CDC / Nation-scale surveillance of Grid (>150,000 miles of line) Grid (>150,000 miles of line) public health • Cancer genomics and drug design – computed Betweenness Centrality of Human Proteome ENSG0 Human Genome core protein interactions 000014 Degree vs. Betweenness Centrality 5332.2 Kelch- 1e+0 like protein 1e-1 8 Betweenness Centrality 1e-2 implicat ed in 1e-3 breast 1e-4 cancer 1e-5 1e-6 1e-7 1 10 100 Degree David A. Bader 10

  9. Network Analysis for Intelligence and Survelliance • [Krebs ’04] Post 9/11 Terrorist Network Analysis from public domain information • Plot masterminds correctly identified from interaction patterns: centrality • A global view of entities is often more Image Source: http://www.orgnet.com/hijackers.html insightful • Detect anomalous activities by exact/approximate graph matching Image Source: T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, CACM, 47 (3, March 2004): pp 45-47 David A. Bader 11

  10. Massive data analytics in Informatics networks • Graphs arising in Informatics are very different from topologies in scientific computing. Static networks, Emerging applications: dynamic, Euclidean topologies high-dimensional data • We need new data representations and parallel algorithms that exploit topological characteristics of informatics networks. David A. Bader 12

  11. The Reality • This image is a visualization of my personal friendster network (circa February 2004) to 3 hops out. The network consists of 47,471 people connected by 432,430 edges. Credit: Jeffrey Heer, UC Berkeley David A. Bader 13

  12. Limitations of Current Tools Limitations of Current Tools Graphs with millions of vertices are well beyond simple comprehension or visualization: we need tools to summarize the graphs . Existing tools: UCINet, Pajek, SocNetV, tnet Limitations: Target workstations, limited in memory No parallelism, limited in performance . Scale only to low density graphs with a few million vertices We need a package that will easily accommodate graphs with several billion vertices and deliver results in a timely manner. Need parallelism both for computational speed and memory! The Cray XMT is a natural fit... David A. Bader 14

  13. The Cray XMT Tolerates latency by massive multithreading • – Hardware support for 128 threads on each processor – Globally hashed address space – No data cache – Single cycle context switch – Multiple outstanding memory requests Support for fine-grained, • • word-level synchronization – Full/empty bit associated with every • memory word Image Source: cray.com Flexibly supports dynamic load balancing • GraphCT currently tested on a 128 processor XMT: 16K threads 16K threads • – 1 TB 1 TB of globally shared memory David A. Bader 15

  14. Graph Analysis Performance: Multithreaded (Cray XMT) vs. Cache-based multicore • SSCA#2 network, SCALE 24 (16.77 million vertices and 134.21 million edges.) Cray XMT 180 Sun UltraSparcT2 160 (Millions of edges per second) 140 Betweenness TEPS rate 120 100 80 60 40 2.0 GHz quad-core Xeon 20 0 1 2 4 8 12 16 Number of processors/cores David A. Bader 16

Recommend


More recommend