Data Integration for Neo4j using Kettle Matt Casters, matt.casters@neo4j.com mattcasters Neo4j Chief Solutions Architect
Topics ➢ What is Kettle? ➢ Kettle plugins for Neo4j ➢ Kettle using Neo4j ➢ Examples ➢ The Hunger Games ➢ Q&A
What is Kettle? 3
Kettle: Introduction ➢ A visual programming tool for data orchestration ➢ A.k.a. Pentaho Data Integration from Hitachi Vantara ➢ Over 15 years old ➢ Open source under Apache Public License 2.0 ➢ Large community, marketplace, ... ➢ Easy to embed, install, package, rebrand ➢ Download your Neo4j remix at www.kettle.be
Kettle: where is it used? ➢ On tiny and enormous systems, real or virtual ➢ Very small computers, Raspberry Pi sized ➢ Your laptop or browser ➢ Locally or in the cloud ➢ On Hadoop clusters, VMs, Docker, Serverless, ➢ At large and small companies ➢ In government ➢ In education ➢ In the Neo4j Solutions Reference Architecture
Kettle: Why is it used? ➢ Reduce costs, reach goals faster ➢ Answers the “build or buy?” question Accum. build Cost y u b Kettle Time
Kettle: Architecture ➢ Metadata driven, engine based : ○ No code generation ○ Define what you need to happen → GUI, Web, code, rules, … ○ Clear and transparent, self documenting ➢ Types of work: ○ Jobs for workflows ○ Transformations for parallel data streaming
Kettle: Design ➢ 100% Exposure of our engine through UI elements ➢ Everyone should be able to play along: plugins! ➢ We built integration points for others: run everywhere! ➢ Allow the user to avoid programming anything ➢ Allow the user to program anything: JavaScript, Java, Groovy, RegEx, Rules, Python, Ruby, R, … ➢ Transparency wins: best in class logging, data lineage, execution lineage, debugging / breakpoints, data previewing, row sniff testing, …
Other Kettle options available to you... ➢ SpoonGit: UI integration with git ➢ WebSpoon: web interface to the full Spoon UI ➢ Data Sets: build transformation unit tests ➢ Native file system protocols: hdfs://, s3://, gs:// … ➢ Hadoop support through a compatibility layer ➢ Kettle Beam: execute transformations on Apache Spark, Apache Flink and GCP DataFlow
Kettle: The Toolset ➢ Spoon: GUI ➢ Scripts ➢ Server(s) ➢ Java API & SDK ➢ Standard file format ➢ Plugin ecosystem ➢ Docker image(s) ➢ Documentation, books, ...
Architecture Version Control System Deploy System Checkout version git VM, docker, ... Configure - setup - initialize - run Artifacts, graphs, configurations
Kettle plugins for Neo4j 12
Plugins: Neo4j Cypher ➢ For reading and writing ➢ Dynamic Cypher ➢ Batching and UNWIND ➢ Parallel execution ➢ High performance ➢ Call procedures
Plugins: Neo4j Output ➢ Easy node creation ➢ Create/Merge of ()-[]-() ➢ Batching and UNWIND ➢ Parallel execution ➢ Dynamic labels
Plugins: Neo4j Graph Output ➢ Update parts of a graph ➢ Auto-generate Cypher ➢ Using a logical model ➢ Using field mapping
Plugins: Check Neo4j Connection ➢ Job Entry ➢ Validate DBs are up ➢ Used in error diagnostic ➢ Defensive setup
Plugins: Neo4j Cypher Script ➢ Job Entry ➢ Executes series of Cypher statements
Neo4j Generate CSVs ➢ Generate CSV files for Neo4j Import ➢ Generates appropriate header ➢ Handles escaping, quoting, … ➢ Outputs file names
Neo4j Split Graph ➢ Splits a graph field into nodes and relationships ➢ Used for unique value calculation
Neo4j Importer ➢ Runs a neo4j-import command ➢ Accepts the filenames of CSV files
Kettle using Neo4j 21
Using Neo4j in Kettle : Logging ➢ Write logging to Neo4j ➢ Builds an execution lineage graph ➢ Updates a metadata graph ➢ Execution details are stored on Job, Job entry, Transformation, Steps, Database levels ➢ Stores graph updates ○ Node creation or update ○ Relationship creation or update
Using Neo4j in Kettle : Logging ● Documents the execution process ○ Log text, times, lineage
Using Neo4j in Kettle : Logging ➢ Examine past executions ○ See what went wrong over the weekend ○ Click on a step to see how long it took ○ Examine log texts ○ Generate Cypher queries to examine further ➢ Calculate delta window Take last execution without error into account
Using Neo4j in Kettle : Logging ➢ Top-to-bottom : find an error ○ Large jobs are hard to debug ○ Sub-jobs and sub-transformations obfuscate ○ Going through logging takes time ○ We know the loaded job or transformation ○ Neo4j can find the shortest path to the lowest execution node without children with errors>0 ○ We can show these shortest paths to the error ○ The user knows in seconds where the error happened and go straight to it to fix.
Using Neo4j in Kettle : Logging ➢ Bottom-up : how was a component executed ○ We know the step or job entry selected ○ Neo4j can find the shortest path to the root execution node without parents ○ We can show these execution paths ○ The user knows how something was executed ○ Very useful in highly dynamic conditional executions
Using Neo4j in Kettle : Logging ➢ Examining executions with browser or Bloom ➢ What exactly executed what, how, when, …? ➢ We generate Cypher for Neo4j beginners ➢ Fun Neo4j learning path for Kettle users
Other data for this audit graph... ➢ Data profiling ➢ Git branches and commit history graph ➢ Transformation unit testing results ➢ Transformation data lineage information ➢ … ➢ Coming soon
Examples 33
Kettle: Quick Spoon intro
Loading Neo4j: loading nodes ➢ Demonstrates the Neo4j Output step ➢ Read a CSV file in parallel ➢ Load the data into nodes in parallel
Loading Neo4j: update graphs ➢ Demonstrates the Neo4j Graph Output step ➢ Updates multiple nodes and relationships at once ➢ Takes key values into account to ignore nodes ➢ Automatically generates MERGE statements
Sourcing Neo4j: simple reading ➢ Read using a Cypher query ➢ Write to an Excel file
To wrap up... 38
Take-aways Data Integration for Neo4j using Kettle : ➢ Work faster, tackle harder problems ➢ Reduce risk by showing results faster ➢ Govern your Neo4j solutions using Neo4j
Upcoming Kettle Community Meetup ➢ → kcm19.be ➢ Antwerp ➢ Saturday November 23rd
Join our slack kettle-community.slack.com ➢ Mail me for an invite: matt.casters@neo4j.com
The Hunger Games 42
Hunger Games Questions for "Data Integration for Neo4j using Kettle" 1. Easy : Can you extract information from relational databases using Kettle? a. No b. Yes but only a few c. Yes, almost all of them 2. Medium : Can I script harder parts of my data orchestration work? a. No, Kettle is a visual programming tool b. Yes, you can use all popular scripting languages c. Yes, you can use JavaScript 3. Hard : Can Kettle work with big data resources? a. Yes, Kettle supports native support for protocols like S3, HDFS, GS and others. b. Yes a) + Kettle also supports visual Map/Reduce development c. Yes b) + Kettle also support execution on the Spark, Flink and DataFlow engines Answer here: r.neo4j.com/hunger-games
Kettle & Neo4j Q&A 44
Recommend
More recommend