The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com - PowerPoint PPT Presentation

Clause-Iteration with Map-Reduce to Scalably Query Data Graphs: The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com schantz@bbn.com @avometric Many thanks to: Prakash Manghwani, Mike Dean, Ian Emmons, Gail Mitchell, Doug Reid, Chris Kappler from BBN Hanspeter Pfister from Harvard SEAS Phil Zeyliger from Cloudera

Outline • Challenge Problem: Scalably Query Graph Data • Large-Scale Computing and MapReduce • SHARD • Design Insights krohloff@bbn.com 2

A Preface SHARD is a cloud based graph store. • High-performance scalable query processing. SHARD released open-source. • BSD license. More information and code at: – My webpage – Sourceforge (SHARD-3store) • Use svn to get code: svn co https://shard-3store.svn.sourceforge.net/svnroot/shard- 3store shard-3store – Don’t worry - this command is on SourceForge! 3 krohloff@bbn.com

Scalable Graph Data Querying • Emerging commercially – Use by NYTimes, BBC, Pharma , … – Numerous startups. – Oracle, MySQL have SemWeb support. • Government use … • See the SemWeb. krohloff@bbn.com 4

SPARQL-like Queries SPARQL Query to find all people who own a car made in Detroit: SELECT ?person WHERE { ?person :owns ?car . ?car a :Car . ?car :madeIn :Detroit . Car a } owns ?person ?car madeIn Detroit 5 krohloff@bbn.com

Answering Queries Car a owns madeBy car0 Ford Kurt madeIn livesIn Variables bindings: Detroit ?person to Kurt Cambridge ?car to car0 a a City Car a owns ?car ?person madeIn Detroit 6 krohloff@bbn.com

Design Considerations • Scalable – web-scale? • High Assurance. • Cost Effective – commodity hardware? • Modular inferred data separation. • Robustness. • Considerations as endless as applications. krohloff@bbn.com 7

Scale Limitations! • Triple-Store Study: – “An Evaluation of Triple -Store Technologies for Large Data Stores”, SSWS '07 (Part of OTM). • What about cloud computing? – Economic scalability… krohloff@bbn.com 8

General Programming for Scalable Cloud Computing From Experience: • Inherently multi-threaded. • Toolsets still young. – Not many debugging tools. • Mental models are different... – Learn an algorithm, adapt it to choosen framework. – Ex: try to fit problem into PageRank design pattern. • (This isn’t what we do, but this approach seems common.) krohloff@bbn.com 9

Scalable Distributed System (Cloud) Design Concept Abstraction of parallelization enables much easier scaling. • We use maturing MapReduce framework in Hadoop to bulk process graph edges. • This provides services layer to scale our graph query processing techniques. • Innovation: – Iterative clause-based construction of queries. – Join partial query responses over multiple Map-Reduce jobs using flagged keys. krohloff@bbn.com 10

SHARD Triple-Store Built on Hadoop Prioritized goals: • Commodity hardware, ONLY • Web scalable • Robust What is good: Design Considerations: • Large query responses • Complex queries

Clause Iteration Query Response Construction 1 st clause results p s o p s o p s o owns Source Data p s o ?car ?person 2 nd clause p s o results p p p s o s o o p p p s Car o s o o a owns p ?person ?car s o p s o 2 nd clause p s o results a p Car p s o o owns p ?car ?person madeIn o Detroit p p s o o krohloff@bbn.com 12 p o

1 st Partial Query Match By Clause In first Map Step, first query clause is used to find partial query matches that satisfy first clause • Keys are variable bindings • Values are set to null 1 st Map Key-Val Source data: Output: John owns dog0 Kurt livesIn Cambridge {John dog0} - null ?person :owns ?car . Kurt owns car0 {Kurt car0} - null dog0 a Dog … car0 a Car … In first Reduce Step, repeated partial matches are removed krohloff@bbn.com 13

2 nd Clause Map – New Bindings Map partial query matches from 2 nd query clause. • Keys are variable bindings previously observed. • Values are set to new variable bindings. Map matches from previous clause for reordering. • Keys are variable bindings common with current clause • Values are previous non-common bindings Source data: John owns dog0 2 nd Map Key-Val Kurt livesIn Cambridge ?car a Car . Kurt owns car0 Output: dog0 a Dog {car0} – null car0 a Car … … {dog0} – {John} {car0} – {Kurt} 1 st Map Key-Val … Output: {John dog0} - null {Kurt car0} - null … krohloff@bbn.com 14

2 nd Clause Reduce – Join Reduce joins partial mappings on common variable bindings with flagged keys. 2 nd Reduce Key- 2 nd Map Key-Val Val Output: Output: {car0} – {Kurt} {car0} – null Reduce … … {dog0} – {John} {car0} – {Kurt} … Process continues over all query clauses. krohloff@bbn.com 15

HDFS Graph Storage Car a owns madeBy car0 Kurt Ford madeIn livesIn Detroit Cambridge a a Graphs saved as flat-file in HDFS: (Portions of file saved on each data node.) City K u r t owns car0 livesIn Cambridge C a r 0 a Car madeBy F o r d madeIn Detroit Cambridge a City Detroit a City krohloff@bbn.com 16

HDFS data partitioning Cloud Local Node 2 Node 1 Client Name Node Node 4 Node 3 Cannon Right Cannon Right Cannon Right Cannon Right Cannon Left Cannon Left Cannon Left Cannon Left Cannon Behind Cannon Behind Cannon Behind Cannon Behind • Hash Partitioning by Default. • Neighborhood partitioning would probably provide better performance. • R&D opportunity! krohloff@bbn.com 17

Query Processing Implementation • BBN-developed query processor. – Starting integration with “standard” interfaces • Jena, Sesame. • SHARD supports “most” of SPARQL. – Like most commercial triple-stores. • Large performance improvements possible with improved query reordering. krohloff@bbn.com 18

Data Persistence Advice from SHARD • Down to “bare metal” in HDFS for large -scale efficiency. – No Berkeley DB, no C- stores, …. Nothing. • Simple data storage as flat files. – Lists of (predicate, object) pairs for every subject by line. – Ex: Kurt owns car0 livesin Cambridge • Simple often really is better… krohloff@bbn.com 19

Test Data • Deployed code on Amazon EC2 cloud. – 19 XL nodes. • LUBM (Lehigh Univ. BenchMark) – Artificial data on students, professors, courses, etc… at universities. • 800 million edge graph. – 6000 LUBM university dataset . • In general, performed comparably to “industrial” monolithic triple -stores. krohloff@bbn.com 20

Performance Comparison Query Type SHARD Parliament+Sesame Parliament+Jena Simple Query, Small 0.1hr 0.001hr 404 sec. Response: Triple (approx 0.1 hr.) Lookup (Query 1) p s o Triangular Query 1hr 1hr 740 sec. (Query 9) (approx 0.2 hr.) o s o Simple Query, Large 1hr 5hr 118 sec. Response: (approx 0.03 hr.) (Query 14) p s o krohloff@bbn.com 21

Insight from Query Performance • SHARD is not optimal for edge look-ups. – This could be expected – SHARD (and MapReduce implementations) have no real indexing support. • SHARD does well where large portions of dataset need to be processed. – Ex: • Multiple join operations • Return large datasets – This behavior is an artifact of parallel searching and joining operation native to Clause-Iteration. krohloff@bbn.com 22

Design Insights • Abstraction is a big win. – Surprisingly economical for development. • Lack of indexing limits look-up capabilities. – This may not be so bad for some applications – Index will also need to be continually updated as data added. krohloff@bbn.com 23

Design Insights – Data Partitioning • Data linking may be a big win to reduce join overhead and reduce need for iterations over clauses. – A first step would be advanced data partitioning. – Done some in Cloud9, but still wide open for even basic R&D implementations. • Advanced data partitioning would also minimize overhead of moving intermediate results between compute nodes. – This seemed to be biggest bottleneck. krohloff@bbn.com 24

Design Insights – Query Processing • Query pre-processing may also be a big win. – Could also greatly reduce amount of data carried between nodes during join operations. • Subject-Iteration may be an alternative approach for queries with strongly connected source nodes. – Iterate over query subject rather than clauses. krohloff@bbn.com 25

Thanks! Questions? Kurt Rohloff krohloff@bbn.com @avometric

The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com - PowerPoint PPT Presentation

Clause-Iteration with Map-Reduce to Scalably Query Data Graphs: The SHARD Triple-Store Rick Schantz Kurt Rohloff krohloff@bbn.com schantz@bbn.com @avometric Many thanks to: Prakash Manghwani, Mike Dean, Ian Emmons, Gail Mitchell, Doug

Sapporo Sapporo Namba Namba Shinjuku Shinjuku Store Store Store Store West Store West

Triple P - Positive Parenting Program: AZ Expands Triple P to Address the Opioid Crisis Cricket

The Triple Helix Model Role of different entities 1 The Triple Helix Model Role of

JUST THE MATHS SLIDES NUMBER 8.4 VECTORS 4 (Triple products) by A.J.Hobson 8.4.1 The

So Whats New? David A. V. Reynolds, DrPH My Business Card Intentionally Blank Presentation

TRIPLE INTEGRALS MATH 200 GOALS Be able to set up and evaluate triple integrals using

Another family of Steiner triple systems without almost parallel classes Daniel Horsley (Monash

Store Presentation And Design Store Presentation And Design Looking for qualified reading

Brand In Store Display Distrib tributi tion on Brasla Cosmetics Ayur Store e Images ges

IBS (protons at store) as part of APEX during April 12, 2012 Protons at store: contribution from

Antidot Training AFS@Store AFS@Store Introduction 2 Antidot solution for E-Commerce 3 What

University of Oxford Online Store Linda McCluskey Online Store Manager Cashiers Office, Finance

Oasis Community Learning Presentation By Jim Gardner The Shard - July 2019 [Notes from the

Economic Shocks A City Horizons Lecture, The Shard. London 12 December 2017 Ron Martin

No Shard Left Behind Straggler-free data processing in Cloud Dataflow Eugene Kirpichov Senior

Orbis UUID Generation, using Consistent Hashing in Erlang UUID [42-bit Timestamp, 12-bit Shard,

Be#er Tes(ng with Less Work: QuickCheck Tes(ng in Prac(ce

Educational Programming Jim Princivalle The Corporation for Public Broadcasting The Public

Ob je c tive s He a lth Ca re During a nd Afte r I nc a rc e ra tio n 1) R e vie w tr e

Resources for Educational Games (Emphasizing PuppyBot Rescue ) Mike Christel, Scott Stevens, Bryan

Unpacking key livelihood challenges and opportunities in energy crop cultivation: village level

Crowd Learning for Indoor Navigation Thomas Burgess Chief Research Officer indoo.rs GmbH indoor

35 Years of TIMI Trials? Robert P. Giugliano, MD, SM, FACC, FAHA Senior Investigator, TIMI Study

PLANT SAFETY An overview Gareth Langston HM Inspector of Health and Safety - Construction