Clause-Iteration with Map-Reduce to Scalably Query Data Graphs: The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com schantz@bbn.com @avometric Many thanks to: Prakash Manghwani, Mike Dean, Ian Emmons, Gail Mitchell, Doug Reid, Chris Kappler from BBN Hanspeter Pfister from Harvard SEAS Phil Zeyliger from Cloudera
Outline • Challenge Problem: Scalably Query Graph Data • Large-Scale Computing and MapReduce • SHARD • Design Insights krohloff@bbn.com 2
A Preface SHARD is a cloud based graph store. • High-performance scalable query processing. SHARD released open-source. • BSD license. More information and code at: – My webpage – Sourceforge (SHARD-3store) • Use svn to get code: svn co https://shard-3store.svn.sourceforge.net/svnroot/shard- 3store shard-3store – Don’t worry - this command is on SourceForge! 3 krohloff@bbn.com
Scalable Graph Data Querying • Emerging commercially – Use by NYTimes, BBC, Pharma , … – Numerous startups. – Oracle, MySQL have SemWeb support. • Government use … • See the SemWeb. krohloff@bbn.com 4
SPARQL-like Queries SPARQL Query to find all people who own a car made in Detroit: SELECT ?person WHERE { ?person :owns ?car . ?car a :Car . ?car :madeIn :Detroit . Car a } owns ?person ?car madeIn Detroit 5 krohloff@bbn.com
Answering Queries Car a owns madeBy car0 Ford Kurt madeIn livesIn Variables bindings: Detroit ?person to Kurt Cambridge ?car to car0 a a City Car a owns ?car ?person madeIn Detroit 6 krohloff@bbn.com
Design Considerations • Scalable – web-scale? • High Assurance. • Cost Effective – commodity hardware? • Modular inferred data separation. • Robustness. • Considerations as endless as applications. krohloff@bbn.com 7
Scale Limitations! • Triple-Store Study: – “An Evaluation of Triple -Store Technologies for Large Data Stores”, SSWS '07 (Part of OTM). • What about cloud computing? – Economic scalability… krohloff@bbn.com 8
General Programming for Scalable Cloud Computing From Experience: • Inherently multi-threaded. • Toolsets still young. – Not many debugging tools. • Mental models are different... – Learn an algorithm, adapt it to choosen framework. – Ex: try to fit problem into PageRank design pattern. • (This isn’t what we do, but this approach seems common.) krohloff@bbn.com 9
Scalable Distributed System (Cloud) Design Concept Abstraction of parallelization enables much easier scaling. • We use maturing MapReduce framework in Hadoop to bulk process graph edges. • This provides services layer to scale our graph query processing techniques. • Innovation: – Iterative clause-based construction of queries. – Join partial query responses over multiple Map-Reduce jobs using flagged keys. krohloff@bbn.com 10
SHARD Triple-Store Built on Hadoop Prioritized goals: • Commodity hardware, ONLY • Web scalable • Robust What is good: Design Considerations: • Large query responses • Complex queries
Clause Iteration Query Response Construction 1 st clause results p s o p s o p s o owns Source Data p s o ?car ?person 2 nd clause p s o results p p p s o s o o p p p s Car o s o o a owns p ?person ?car s o p s o 2 nd clause p s o results a p Car p s o o owns p ?car ?person madeIn o Detroit p p s o o krohloff@bbn.com 12 p o
1 st Partial Query Match By Clause In first Map Step, first query clause is used to find partial query matches that satisfy first clause • Keys are variable bindings • Values are set to null 1 st Map Key-Val Source data: Output: John owns dog0 Kurt livesIn Cambridge {John dog0} - null ?person :owns ?car . Kurt owns car0 {Kurt car0} - null dog0 a Dog … car0 a Car … In first Reduce Step, repeated partial matches are removed krohloff@bbn.com 13
2 nd Clause Map – New Bindings Map partial query matches from 2 nd query clause. • Keys are variable bindings previously observed. • Values are set to new variable bindings. Map matches from previous clause for reordering. • Keys are variable bindings common with current clause • Values are previous non-common bindings Source data: John owns dog0 2 nd Map Key-Val Kurt livesIn Cambridge ?car a Car . Kurt owns car0 Output: dog0 a Dog {car0} – null car0 a Car … … {dog0} – {John} {car0} – {Kurt} 1 st Map Key-Val … Output: {John dog0} - null {Kurt car0} - null … krohloff@bbn.com 14
2 nd Clause Reduce – Join Reduce joins partial mappings on common variable bindings with flagged keys. 2 nd Reduce Key- 2 nd Map Key-Val Val Output: Output: {car0} – {Kurt} {car0} – null Reduce … … {dog0} – {John} {car0} – {Kurt} … Process continues over all query clauses. krohloff@bbn.com 15
HDFS Graph Storage Car a owns madeBy car0 Kurt Ford madeIn livesIn Detroit Cambridge a a Graphs saved as flat-file in HDFS: (Portions of file saved on each data node.) City K u r t owns car0 livesIn Cambridge C a r 0 a Car madeBy F o r d madeIn Detroit Cambridge a City Detroit a City krohloff@bbn.com 16
HDFS data partitioning Cloud Local Node 2 Node 1 Client Name Node Node 4 Node 3 Cannon Right Cannon Right Cannon Right Cannon Right Cannon Left Cannon Left Cannon Left Cannon Left Cannon Behind Cannon Behind Cannon Behind Cannon Behind • Hash Partitioning by Default. • Neighborhood partitioning would probably provide better performance. • R&D opportunity! krohloff@bbn.com 17
Query Processing Implementation • BBN-developed query processor. – Starting integration with “standard” interfaces • Jena, Sesame. • SHARD supports “most” of SPARQL. – Like most commercial triple-stores. • Large performance improvements possible with improved query reordering. krohloff@bbn.com 18
Data Persistence Advice from SHARD • Down to “bare metal” in HDFS for large -scale efficiency. – No Berkeley DB, no C- stores, …. Nothing. • Simple data storage as flat files. – Lists of (predicate, object) pairs for every subject by line. – Ex: Kurt owns car0 livesin Cambridge • Simple often really is better… krohloff@bbn.com 19
Test Data • Deployed code on Amazon EC2 cloud. – 19 XL nodes. • LUBM (Lehigh Univ. BenchMark) – Artificial data on students, professors, courses, etc… at universities. • 800 million edge graph. – 6000 LUBM university dataset . • In general, performed comparably to “industrial” monolithic triple -stores. krohloff@bbn.com 20
Performance Comparison Query Type SHARD Parliament+Sesame Parliament+Jena Simple Query, Small 0.1hr 0.001hr 404 sec. Response: Triple (approx 0.1 hr.) Lookup (Query 1) p s o Triangular Query 1hr 1hr 740 sec. (Query 9) (approx 0.2 hr.) o s o Simple Query, Large 1hr 5hr 118 sec. Response: (approx 0.03 hr.) (Query 14) p s o krohloff@bbn.com 21
Insight from Query Performance • SHARD is not optimal for edge look-ups. – This could be expected – SHARD (and MapReduce implementations) have no real indexing support. • SHARD does well where large portions of dataset need to be processed. – Ex: • Multiple join operations • Return large datasets – This behavior is an artifact of parallel searching and joining operation native to Clause-Iteration. krohloff@bbn.com 22
Design Insights • Abstraction is a big win. – Surprisingly economical for development. • Lack of indexing limits look-up capabilities. – This may not be so bad for some applications – Index will also need to be continually updated as data added. krohloff@bbn.com 23
Design Insights – Data Partitioning • Data linking may be a big win to reduce join overhead and reduce need for iterations over clauses. – A first step would be advanced data partitioning. – Done some in Cloud9, but still wide open for even basic R&D implementations. • Advanced data partitioning would also minimize overhead of moving intermediate results between compute nodes. – This seemed to be biggest bottleneck. krohloff@bbn.com 24
Design Insights – Query Processing • Query pre-processing may also be a big win. – Could also greatly reduce amount of data carried between nodes during join operations. • Subject-Iteration may be an alternative approach for queries with strongly connected source nodes. – Iterate over query subject rather than clauses. krohloff@bbn.com 25
Thanks! Questions? Kurt Rohloff krohloff@bbn.com @avometric
Recommend
More recommend