kineograph
play

Kineograph Raymond Cheng (University of Washinton, Microsoft - PowerPoint PPT Presentation

Kineograph Raymond Cheng (University of Washinton, Microsoft Research) et al. The challenge Social networks (Facebook, Twitter) generate a lot of information Let's analyze it! Simple data-mining won't do: too much data


  1. Kineograph Raymond Cheng (University of Washinton, Microsoft Research) et al.

  2. The challenge ● Social networks (Facebook, Twitter) generate a lot of information ● Let's analyze it! ● Simple data-mining won't do: ○ too much data ○ constant influx of new data ○ long computation time

  3. A solution ● Process live stream of data (i.e. tweets) ● Aggregate them as a dynamic graph ● Snapshot regularly ● Run distributed graph-mining on snapshots ○ support incremental computation

  4. Kineograph architecture

  5. Data influx (ingest node) @Alice: @Bob , check out these #kittens ! Ingest node node(Alice) T @Alice @Alice -> #kittens @Alice -> @Bob Transaction T @Bob #kittens T T @Bob -> @Alice #kittens -> @Alice node(Bob) node(kittens) after receiving ACKs, report T Progress table

  6. Data influx (ingest nodes) ● Parse data and convert them to graph updates (i.e. sets of edges) ● Send transaction to affected graph nodes ○ at this point, it's just stored in the queue ● Report submitted transaction to global vector clock

  7. Snapshot creation

  8. Snapshot creation ● Snapshooter initiates the process ○ in practice, every 10 seconds ● Snapshooter copies current progress table and sends it to graph nodes ● Graph nodes commit transactions up to times specified in progress table ○ new updates are coming in parallel

  9. Computation overview ● Ran on snapshots ● Algorithm-specific data stored in vertices ● Alternating phases of computation and propagation

  10. Example: TunkRank ● similar to PageRank: ● vertex value - single real number ● add ranks received from neighbours ● when rank increases by ε, push update to neighbours ● repeat until stable Bonus: it's incremental between snapshots!

  11. Example: Shortest Paths ● Bellman-Ford with landmarks ○ landmarks - top vertices from TunkRank ○ calculate only paths passing through landmarks ● vertex data - distances to landmarks ● shorten distances by relaxing edges ● push new distances to neighbours ● repeat until stable

  12. Evaluation ● 17,000 lines of C# code ● 50 Windows servers ○ Intel Xeon (quad-core, 2.8 GHz) with 8 GB RAM ● 100k tweets per second (10 times peak Twitter rate)

  13. Degree distribution

  14. Graph growth Decaying can help

  15. Throughput & timeliness

  16. Throughput

  17. Timeliness

  18. Incrementality helps! Tunk-rank:

  19. Incrementality helps!

  20. Scalability (TunkRank)

  21. Fault tolerance ● Centralized services (progress table & snapshooter): ○ simple replication ○ Paxos-based consensus ● Ingest nodes: ○ input data is cached until it is committed to a snapshot ○ if ingest node fails, all its transactions are discarded ○ another machine processes data from cache

  22. Replication of graph nodes ● quorum-based: 3 replicas of each node ● Update must be acknowledged by 2 replicas ● If replica misses update, it retrieves it from other replicas ● If replica fails and is replaced, it waits for the next snapshot and starts working normally from there ● For computation failures: rollback and redo

  23. Incremental expansion ● Ingest nodes - trivial, just add a node ● Storage nodes: ○ maintain more logical partitions than nodes ○ to add nodes, migrate some logical partitions to it ○ splitting logical partitions is possible too ○ new node starts working from the next snapshot - just as in failure recovery

  24. Failure recovery

  25. Thank you!

Recommend


More recommend