final project presentation
play

Final Project Presentation Dec. 2 nd , 2013 Qing Zheng & Atreyee - PowerPoint PPT Presentation

15799 Final Project Presentation Dec. 2 nd , 2013 Qing Zheng & Atreyee Maiti Goals Graph Queries How different DBs handle large graph? Whats the differences in performance? What DB to which for a specific use-case? 2


  1. 15799 Final Project Presentation Dec. 2 nd , 2013 Qing Zheng & Atreyee Maiti

  2. Goals Graph Queries • How different DBs handle large graph? • What’s the differences in performance? • What DB to which for a specific use-case? 2

  3. Datasets/ System Neo4j v.s. MyS QL • the most popular open-source DB for each community 3

  4. Datasets/ System Neo4j v.s. MyS QL • the most popular open-source DB for each community Wikipedia Datasets • Reasonably big, easily accessible, and people are familiar with it 4

  5. Experimental Settings Amazon EC2 • Neo4j 1.9.5 • MyS QL 5.5.34 • Ubuntu 12.04.3 • 1 CPU, 4GB RAM, 410GB Disk (m1.medium) 5

  6. Benchmarks Queries • S ix-Degree <6 <6

  7. Benchmarks Queries • S ix-Degree • S hortest Path min <6 <6 min 7

  8. Benchmarks Queries • S ix-Degree • S hortest Path • Most Cited Page min <6 <6 min 8

  9. API Client Interface • S QL for MyS QL • Java API for Neo4j 9

  10. Results S ix Degree 10

  11. Results S ix Degree 2 2 3 1 2 2 2 3 11

  12. Results Most Cited Page 12

  13. Results Most Cited Page 4 ,673,396 13

  14. Performance MyS QL Neo4j 18 3000 10800 15 2500 9000 12 2000 7200 seconds 9 1500 5400 6 1000 3600 3 500 1800 0 0 0 cold warm cold warm cold warm S ix-Degree Most-Cited-Page S hortest-Path 14

  15. MYS QL ANAL YSIS SYST EM WISE 15

  16. Storage Engine INNODB • Reliable, high-performance transactional engine MYIS AM • Read-optimized, data-warehouse class engine • Dedicated in-memory buffer for index blocks • Uses OS page cache for buffering data blocks 16

  17. Bulk L oading Best Practices • Convert S QL inserts into raw CVS Files • Build indices after data is fully loaded • Increase “ MyS IAM_S ort_Buffer_S ize” • Add more memory 17

  18. T uning Optimizing for workloads • Compression (total table size after compression: 26G) • Resign table schemas • Add/ remove Indices • S et index cache to 25% of the RAM • Disable query cache (not for optimization) 18

  19. Schema Profile Wiki Datasets • 31 ,293,738 pages • 709 ,804,739 links 19

  20. Six Degree Query Breath-First S earch s 1 2 1 1 2 2 2 d 20

  21. Six Degree Query Breath-First S earch s 1 2 1 1 2 2 2 d 21

  22. Six Degree Query Breath-First S earch s 1 2 1 1 2 2 2 d 22

  23. Six Degree Query Breath-First S earch s 1 2 1 1 2 2 Group By / S ubquery 2 d 23

  24. Six Degree Query Breath-First S earch s 1 2 1 1 2 Insert Ignore Into … 2 Group By / S ubquery 2 d 24

  25. Six Degree Query Ignoring Breath-First S earch • 1/ 44 th index block read requests • No additional sorting • 5x more rows in temporary tables >>> 20x performance boost 25

  26. Six Degree Query Ignoring Breath-First S earch • 1/ 44 th index block read requests • No additional sorting • 5x more rows in temporary tables >>> 20x performance boost Need to keep temp table short! 26

  27. Six Degree Query Ignoring Breath-First S earch 0 0 1 19/ 19 1,210/ 1,211 1 9,270/ 11,743 2 85,829/ 340,632 2 619,132/ 2,594,398 3 Adolf-Hitler Walk-to-the-S ky 27

  28. Six Degree Query Bidirectional Breath-First S earch • 1/ 34 th rows in temporary tables • 1/ 386 th index block read requests • 1/ 5 th index block write requests >>> 720x additional performance boost 28

  29. Shortest Path Query Bidirectional Batched S hortest Path >>> 318x performance boost 29

  30. 2,786 secs Shortest Path Query 1.6 1.4 1.2 seconds 1 0.8 0.6 0.4 0.2 0 A 42 G Pt T M N AH WttS S JG Ah R W 30

  31. Most Cited Page Count(*) & Group-BY & Order-By & Limit 31

  32. Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 32

  33. Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 33

  34. Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 34

  35. Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 35

  36. Most Cited Path S ort Buffer • 2MB: 33 merge passes • 8MB: 8 merge passes • 64MB: 1 merge pass 36

  37. Most Cited Path S ort Buffer • 2MB: 33 merge passes • 8MB: 8 merge passes • 64MB: 1 merge pass >>> 0x performance improvements 37

  38. Most Cited Path S ort Buffer • 2MB: 33 merge passes • 8MB: 8 merge passes • 64MB: 1 merge pass >>> 0x performance improvements 45x more rows scanned than sorted 38

  39. Quick Summary MyS QL-MyS IAM • Loading takes time • Pay attention to query algorithms • Limited performance for large joins • Nice documentation with good out-of-box performance for analysis 39

  40. NEO4J ANAL YSIS SYST EM WISE 40

  41. Data cleaning/ importing Importing tool • use of graphipedia to import compressed dataset � LinkExtractor to transform xml to a links xml � Import graph which uses the links to create nodes and then relationships. Also indexes the data Graph Structure Node: pages with property "title" Relationship: "Link" Lucene index

  42. Algorithm implementation Neo4j GraphAlgoFactory Benchmark Implementation six degree findS inglePath with max depth shortest path shortestPath most cited node get all relationships, maintain count

  43. Internals Name:Qing A KNOWS B Age:24 KNOWS � records basically linked list of nodes, relations - suffers when need to traverse a lot of linked lists - most cited page � major win is joins - and then it becomes dependent on configuration and resource availability

  44. Caching T wo types • file buffer caching - use of native i/ o to cache data in memory - storage file data similar representation as disk for fast traversal • object caching - using the allocated area for the heap - caches individual nodes and relationships and their properties in a form that is optimized for fast traversal of the graph - relies on garbage collection for eviction from the cache in an LRU manner. cache levels • in heap • in file buffer cache • disk

Recommend


More recommend