15799 Final Project Presentation Dec. 2 nd , 2013 Qing Zheng & Atreyee Maiti
Goals Graph Queries • How different DBs handle large graph? • What’s the differences in performance? • What DB to which for a specific use-case? 2
Datasets/ System Neo4j v.s. MyS QL • the most popular open-source DB for each community 3
Datasets/ System Neo4j v.s. MyS QL • the most popular open-source DB for each community Wikipedia Datasets • Reasonably big, easily accessible, and people are familiar with it 4
Experimental Settings Amazon EC2 • Neo4j 1.9.5 • MyS QL 5.5.34 • Ubuntu 12.04.3 • 1 CPU, 4GB RAM, 410GB Disk (m1.medium) 5
Benchmarks Queries • S ix-Degree <6 <6
Benchmarks Queries • S ix-Degree • S hortest Path min <6 <6 min 7
Benchmarks Queries • S ix-Degree • S hortest Path • Most Cited Page min <6 <6 min 8
API Client Interface • S QL for MyS QL • Java API for Neo4j 9
Results S ix Degree 10
Results S ix Degree 2 2 3 1 2 2 2 3 11
Results Most Cited Page 12
Results Most Cited Page 4 ,673,396 13
Performance MyS QL Neo4j 18 3000 10800 15 2500 9000 12 2000 7200 seconds 9 1500 5400 6 1000 3600 3 500 1800 0 0 0 cold warm cold warm cold warm S ix-Degree Most-Cited-Page S hortest-Path 14
MYS QL ANAL YSIS SYST EM WISE 15
Storage Engine INNODB • Reliable, high-performance transactional engine MYIS AM • Read-optimized, data-warehouse class engine • Dedicated in-memory buffer for index blocks • Uses OS page cache for buffering data blocks 16
Bulk L oading Best Practices • Convert S QL inserts into raw CVS Files • Build indices after data is fully loaded • Increase “ MyS IAM_S ort_Buffer_S ize” • Add more memory 17
T uning Optimizing for workloads • Compression (total table size after compression: 26G) • Resign table schemas • Add/ remove Indices • S et index cache to 25% of the RAM • Disable query cache (not for optimization) 18
Schema Profile Wiki Datasets • 31 ,293,738 pages • 709 ,804,739 links 19
Six Degree Query Breath-First S earch s 1 2 1 1 2 2 2 d 20
Six Degree Query Breath-First S earch s 1 2 1 1 2 2 2 d 21
Six Degree Query Breath-First S earch s 1 2 1 1 2 2 2 d 22
Six Degree Query Breath-First S earch s 1 2 1 1 2 2 Group By / S ubquery 2 d 23
Six Degree Query Breath-First S earch s 1 2 1 1 2 Insert Ignore Into … 2 Group By / S ubquery 2 d 24
Six Degree Query Ignoring Breath-First S earch • 1/ 44 th index block read requests • No additional sorting • 5x more rows in temporary tables >>> 20x performance boost 25
Six Degree Query Ignoring Breath-First S earch • 1/ 44 th index block read requests • No additional sorting • 5x more rows in temporary tables >>> 20x performance boost Need to keep temp table short! 26
Six Degree Query Ignoring Breath-First S earch 0 0 1 19/ 19 1,210/ 1,211 1 9,270/ 11,743 2 85,829/ 340,632 2 619,132/ 2,594,398 3 Adolf-Hitler Walk-to-the-S ky 27
Six Degree Query Bidirectional Breath-First S earch • 1/ 34 th rows in temporary tables • 1/ 386 th index block read requests • 1/ 5 th index block write requests >>> 720x additional performance boost 28
Shortest Path Query Bidirectional Batched S hortest Path >>> 318x performance boost 29
2,786 secs Shortest Path Query 1.6 1.4 1.2 seconds 1 0.8 0.6 0.4 0.2 0 A 42 G Pt T M N AH WttS S JG Ah R W 30
Most Cited Page Count(*) & Group-BY & Order-By & Limit 31
Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 32
Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 33
Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 34
Most Cited Path Count(*) & Group-BY & Order-By & Limit +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | id | select_type | table | type | possible_keys | key | key_len | ref | rows | Extra | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ | 1 | SIMPLE | links | index | NULL | REVERSE | 8 | NULL | 709804739 | Using index; Using temporary; Using filesort | +----+-------------+-------+-------+---------------+---------+---------+------+-----------+----------------------------------------------+ 35
Most Cited Path S ort Buffer • 2MB: 33 merge passes • 8MB: 8 merge passes • 64MB: 1 merge pass 36
Most Cited Path S ort Buffer • 2MB: 33 merge passes • 8MB: 8 merge passes • 64MB: 1 merge pass >>> 0x performance improvements 37
Most Cited Path S ort Buffer • 2MB: 33 merge passes • 8MB: 8 merge passes • 64MB: 1 merge pass >>> 0x performance improvements 45x more rows scanned than sorted 38
Quick Summary MyS QL-MyS IAM • Loading takes time • Pay attention to query algorithms • Limited performance for large joins • Nice documentation with good out-of-box performance for analysis 39
NEO4J ANAL YSIS SYST EM WISE 40
Data cleaning/ importing Importing tool • use of graphipedia to import compressed dataset � LinkExtractor to transform xml to a links xml � Import graph which uses the links to create nodes and then relationships. Also indexes the data Graph Structure Node: pages with property "title" Relationship: "Link" Lucene index
Algorithm implementation Neo4j GraphAlgoFactory Benchmark Implementation six degree findS inglePath with max depth shortest path shortestPath most cited node get all relationships, maintain count
Internals Name:Qing A KNOWS B Age:24 KNOWS � records basically linked list of nodes, relations - suffers when need to traverse a lot of linked lists - most cited page � major win is joins - and then it becomes dependent on configuration and resource availability
Caching T wo types • file buffer caching - use of native i/ o to cache data in memory - storage file data similar representation as disk for fast traversal • object caching - using the allocated area for the heap - caches individual nodes and relationships and their properties in a form that is optimized for fast traversal of the graph - relies on garbage collection for eviction from the cache in an LRU manner. cache levels • in heap • in file buffer cache • disk
Recommend
More recommend