popularity and challenges of graph cypher queries
play

Popularity and Challenges of Graph Cypher Queries Introduction - PowerPoint PPT Presentation

Presentation Layout Popularity and Challenges of Graph Cypher Queries Introduction Motivation Dataset Details Methodology for Data Extraction Sheik Shameer, Shivasurya Sankarapandian (CS846 Fall 2019) Results and implications


  1. Presentation Layout Popularity and Challenges of Graph Cypher Queries • Introduction • Motivation • Dataset Details • Methodology for Data Extraction Sheik Shameer, Shivasurya Sankarapandian (CS846 Fall 2019) • Results and implications • Threats to Validity Social Network Experiment- Finding Friends of Friends Introduction • A NoSql Database that uses graph structures for semantic queries with nodes and edges. • They allow fast retrieval of complex hierarchical structures that are difficult to model in relational database systems. • Commonly used in Fraud detection analysis, network and database infrastructure monitoring, Recommendation engines, Social Network, Knowledge Graphs, Privacy and risk management. Database of 1,000,000 users, searching for 1000users

  2. Cypher Queries MATCH (:Movie{ title: 'Wall Street' })<-[:ACTED_IN|:DIRECTED]-(person) • Some of the most popular Graph database management system are Neo4j, microsoft RETURN person.name azure, Cosmos DB, OrientDB, ArangoDB, Virtuoso. • For this study we will be looking into Neo4j. Oliver Stone • Some of the popular forms of graph query languages are Cypher, SPARQL, GraphQL and Gremlin. Michael Douglas Charlie Sheen • For this study we will be looking into the Cypher Query language. Martin Sheen Motivation Research Questions • Version Control system – can this be used to provide relevant information on the • RQ1 - "what type of graph cypher queries are popular among the developers now?" problems faced by developers in Open source repositories. • Can we use Abstract Syntax trees to mine the Cypher queries from the repositories. • RQ2 : "what type of graph cypher queries do the developers have trouble with?" • Can we start building a corpus of graph cypher queries that can be further used for analysis by others. • Can the information that we gained help others to make useful contributions to the open source community.

  3. Data set Methodology for data extraction v Repositories Count Cypher Queries Mined Dataset Java 2579 4159 Java and JavaScript GitHub Repositories Java Script 1212 832 Extracting graph database queries from source code, Total 3791 4991 • Regular Expressions Pattern matching approach • Abstract Syntax Tree parsing approach AST Approach • Follows Visitor Pattern • Parse source code, modules to represent as Tree • Traverse for Identifiers & CallExpression, with official Driver method calls • Extract parameters, variables within block

  4. Mining and Tools Mining and Tools • 1212 JavaScript repositories from GitHub which uses Neo4J • 2579 java repositories mined from github • Javalang python module for AST tree • Verify existence of Neo4J-driver in the repo • Javaparser library – we were able to mine the queries with this library • ESPrima – Source code parser and Construct AST • We were looking for official methods and the variables used in them • ESPrima-Walk – Efficiently traverse AST and Filter queries • Able to mine queries from the same file, totaling around 4159 queries • Node-git to fetch commit logs and code changes of extracted graphdb queries • 836 java queries commit messages were mined using combination of git log and grep • Shell Script for automation commands RQ1 - "what type of graph Word Tokens Count cypher queries are popular Procedure 654 among the developers now?" Inferences RQ1 Initial 626 Commit 611 • Call procedures were very popular. • Call , Match and Create type of queries were Fixes 388 popular among • We also used the tokenization and Java Type of Cypher Querises stemming concepts in NLP to search for Neo4j 295 most used words in the messages of the commits that created the Queries. • So the Cypher Queries are predominantly used Annotation 259 • The word "procedure" had a for Creating, Fetching and also for calling significant usage. Change 226 procedures. Sparkles 224 Branch 216 Javascript Type of Cypher Queries

  5. Procedures Count org.neo4j.procedure.simpleArgument 42 org.neo4j.procedure.writingProcedure 40 Inferences RQ1 org.neo4j.procedure.defaultValues 30 org.neo4j.procedure.node 28 org.neo4j.procedure.integrationTestMe 24 org.neo4j.procedure.schemaProcedure 20 Procedures Count org.neo4j.procedure.genericListWithDefault 18 graph.versioner.diff 4 org.neo4j.procedure.recursiveSum 18 graph.versioner.diff.from.current 3 org.neo4j.procedure.sideEffect 16 graph.versioner.diff.from.previous 3 org.neo4j.procedure.createNode 12 graph.versioner.get.all 2 graph.versioner.get.by.date 1 graph.versioner.get.by.label 2 • Neo4j default procedures were used – 516 graph.versioner.get.current.path 1 graph.versioner.get.current.state 1 • Other procedures worth mentioning were apoc repositories, machine learning graph.versioner.get.nth.state 2 Procedures Count graph.versioner.init 6 regression.linear.addM 2 procedures. graph.versioner.patch 6 regression.linear.create 8 graph.versioner.patch.from 4 regression.linear.delete 2 • We also found that users were writing their own procedures after tokenizing the graph.versioner.rollback 4 regression.linear.info 3 graph.versioner.rollback.nth 2 regression.linear.load 3 call queries. graph.versioner.rollback.to 4 regression.linear.test 1 graph.versioner.update 4 regression.linear.train 3 • Neo4jversioner is a repository that deals with network and database infrastructure, regression.logistic.add 1 regression.logistic.delete 1 these procedures can be used by other users in the related domain as well. regression.linear.add 2 RQ2 • what type of graph cypher queries do the developers have trouble with? • With the extracted 832 Queries from JavaScript and 4159 from Java, verified for false positive queries • Git-log with corresponding line number and file names that produced commit information • 100 Random queries from Javascript and Java. • Manually verified the code changes and commit information Refactored Neo4J types of queries - JavaScript Repo Commits for Sample 100 Queries

  6. RQ2 Results • Transaction, Merge and Match has large number of changes in refactoring the particular query whereas other type of queries have infrequent changes • Rare and common query edits in the MATCH, CALL and CREATE queries such as • Adding & Renaming Alias • Adding & Removing Attributes • Adding & Removing Conditions • Adopting new version procedures and libraries Refactored Neo4J types of queries – Java Repo Commits for Sample 100 Queries Threats to Validity • We collect JavaScript and Java source code from Opensource which may not represent the whole general set. • Developers may use Object Relational Mapping, runtime query generation which can be missed out by static tools like AST. • We generalize our results based on the Java and Javascript repositories we mined there may be repositories in other programming languages like python that may provide further insights to our work.

Recommend


More recommend