apache giraph
play

Apache Giraph Large-scale Graph Processing on Hadoop Claudio - PowerPoint PPT Presentation

Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org> @claudiomartella 2 Graphs are simple 3 A computer network 4 A social network 5 A semantic network 6 A map 7 Predicting break ups Graph


  1. Apache Giraph Large-scale Graph Processing on Hadoop Claudio Martella <claudio@apache.org> @claudiomartella

  2. 2

  3. Graphs are simple 3

  4. A computer network 4

  5. A social network 5

  6. A semantic network 6

  7. A map 7

  8. Predicting break ups Graph approach Aggregation approach 8

  9. Graphs are nasty. 9

  10. Each vertex depends on its neighbours, recursively. 10

  11. Recursive problems are nicely solved iteratively. 11

  12. 12

  13. PageRank in MapReduce • Record: < v_i, pr, [ v_j, ..., v_k ] > • Mapper: emits < v_j, pr / #neighbours > • Reducer: sums the partial values 13

  14. MapReduce dataflow 14

  15. Drawbacks • Each job is executed N times • Job bootstrap • Mappers send PR values and structure • Extensive IO at input, shuffle & sort, output 15

  16. 16

  17. Timeline • Inspired by Google Pregel (2010) • Donated to ASF by Yahoo! in 2011 • Top-level project in 2012 • 1.0 release in January 2013 • 1.1 release in November 2014 17

  18. Plays well with Hadoop 18

  19. Vertex-centric API 19

  20. Shortest Paths 20

  21. Shortest Paths 21

  22. Shortest Paths 22

  23. Shortest Paths 23

  24. Shortest Paths 24

  25. Code def compute(vertex, messages): minValue = Inf # float(‘Inf’) for m in messages: minValue = min(minValue, m) if minValue < vertex.getValue(): vertex.setValue(minValue) for edge in vertex.getEdges(): message = minValue + edge.getValue() sendMessage(edge.getTargetId(), message) vertex.voteToHalt() 25

  26. 26

  27. 27

  28. 28

  29. 29

  30. BSP & Giraph 30

  31. Advantages • No locks: message-based communication • No semaphores: global synchronization • Iteration isolation: massively parallelizable 31

  32. Designed for iterations • Stateful (in-memory) • Only intermediate values (messages) sent • Hits the disk at input, output, checkpoint • Can go out-of-core 32

  33. Giraph job lifetime 33

  34. Architecture 34

  35. Composable API 35

  36. Checkpointing 36

  37. No SPoFs 37

  38. Giraph scales ref: https://www.facebook.com/notes/facebook-engineering/scaling-apache-giraph-to-a-trillion- edges/10151617006153920 38

  39. Giraph is fast • 100x over MR (Pr) • jobs run within minutes • given you have resources ;-) 39

  40. Serialised objects 40

  41. Primitive types • Autoboxing is expensive • Objects overhead (JVM) • Use primitive types on your own • Use primitive types-based libs (e.g. fastutils) 41

  42. Sharded aggregators 42

  43. Okapi • Apache Mahout for graphs • Graph-based recommenders: ALS, SGD, SVD++, etc. • Graph analytics: Graph partitioning, Community Detection, K-Core, etc. 43

  44. Thank you http://giraph.apache.org <claudio@apache.org> @claudiomartella

Recommend


More recommend