Data Integration for Neo4j using Kettle Matt Casters, - PowerPoint PPT Presentation

Data Integration for Neo4j using Kettle Matt Casters, matt.casters@neo4j.com mattcasters Neo4j Chief Solutions Architect

Topics ➢ What is Kettle? ➢ Kettle plugins for Neo4j ➢ Kettle using Neo4j ➢ Examples ➢ The Hunger Games ➢ Q&A

What is Kettle? 3

Kettle: Introduction ➢ A visual programming tool for data orchestration ➢ A.k.a. Pentaho Data Integration from Hitachi Vantara ➢ Over 15 years old ➢ Open source under Apache Public License 2.0 ➢ Large community, marketplace, ... ➢ Easy to embed, install, package, rebrand ➢ Download your Neo4j remix at www.kettle.be

Kettle: where is it used? ➢ On tiny and enormous systems, real or virtual ➢ Very small computers, Raspberry Pi sized ➢ Your laptop or browser ➢ Locally or in the cloud ➢ On Hadoop clusters, VMs, Docker, Serverless, ➢ At large and small companies ➢ In government ➢ In education ➢ In the Neo4j Solutions Reference Architecture

Kettle: Why is it used? ➢ Reduce costs, reach goals faster ➢ Answers the “build or buy?” question Accum. build Cost y u b Kettle Time

Kettle: Architecture ➢ Metadata driven, engine based : ○ No code generation ○ Define what you need to happen → GUI, Web, code, rules, … ○ Clear and transparent, self documenting ➢ Types of work: ○ Jobs for workflows ○ Transformations for parallel data streaming

Kettle: Design ➢ 100% Exposure of our engine through UI elements ➢ Everyone should be able to play along: plugins! ➢ We built integration points for others: run everywhere! ➢ Allow the user to avoid programming anything ➢ Allow the user to program anything: JavaScript, Java, Groovy, RegEx, Rules, Python, Ruby, R, … ➢ Transparency wins: best in class logging, data lineage, execution lineage, debugging / breakpoints, data previewing, row sniff testing, …

Other Kettle options available to you... ➢ SpoonGit: UI integration with git ➢ WebSpoon: web interface to the full Spoon UI ➢ Data Sets: build transformation unit tests ➢ Native file system protocols: hdfs://, s3://, gs:// … ➢ Hadoop support through a compatibility layer ➢ Kettle Beam: execute transformations on Apache Spark, Apache Flink and GCP DataFlow

Kettle: The Toolset ➢ Spoon: GUI ➢ Scripts ➢ Server(s) ➢ Java API & SDK ➢ Standard file format ➢ Plugin ecosystem ➢ Docker image(s) ➢ Documentation, books, ...

Architecture Version Control System Deploy System Checkout version git VM, docker, ... Configure - setup - initialize - run Artifacts, graphs, configurations

Kettle plugins for Neo4j 12

Plugins: Neo4j Cypher ➢ For reading and writing ➢ Dynamic Cypher ➢ Batching and UNWIND ➢ Parallel execution ➢ High performance ➢ Call procedures

Plugins: Neo4j Output ➢ Easy node creation ➢ Create/Merge of ()-[]-() ➢ Batching and UNWIND ➢ Parallel execution ➢ Dynamic labels

Plugins: Neo4j Graph Output ➢ Update parts of a graph ➢ Auto-generate Cypher ➢ Using a logical model ➢ Using field mapping

Plugins: Check Neo4j Connection ➢ Job Entry ➢ Validate DBs are up ➢ Used in error diagnostic ➢ Defensive setup

Plugins: Neo4j Cypher Script ➢ Job Entry ➢ Executes series of Cypher statements

Neo4j Generate CSVs ➢ Generate CSV files for Neo4j Import ➢ Generates appropriate header ➢ Handles escaping, quoting, … ➢ Outputs file names

Neo4j Split Graph ➢ Splits a graph field into nodes and relationships ➢ Used for unique value calculation

Neo4j Importer ➢ Runs a neo4j-import command ➢ Accepts the filenames of CSV files

Kettle using Neo4j 21

Using Neo4j in Kettle : Logging ➢ Write logging to Neo4j ➢ Builds an execution lineage graph ➢ Updates a metadata graph ➢ Execution details are stored on Job, Job entry, Transformation, Steps, Database levels ➢ Stores graph updates ○ Node creation or update ○ Relationship creation or update

Using Neo4j in Kettle : Logging ● Documents the execution process ○ Log text, times, lineage

Using Neo4j in Kettle : Logging ➢ Examine past executions ○ See what went wrong over the weekend ○ Click on a step to see how long it took ○ Examine log texts ○ Generate Cypher queries to examine further ➢ Calculate delta window Take last execution without error into account

Using Neo4j in Kettle : Logging ➢ Top-to-bottom : find an error ○ Large jobs are hard to debug ○ Sub-jobs and sub-transformations obfuscate ○ Going through logging takes time ○ We know the loaded job or transformation ○ Neo4j can find the shortest path to the lowest execution node without children with errors>0 ○ We can show these shortest paths to the error ○ The user knows in seconds where the error happened and go straight to it to fix.

Using Neo4j in Kettle : Logging ➢ Bottom-up : how was a component executed ○ We know the step or job entry selected ○ Neo4j can find the shortest path to the root execution node without parents ○ We can show these execution paths ○ The user knows how something was executed ○ Very useful in highly dynamic conditional executions

Using Neo4j in Kettle : Logging ➢ Examining executions with browser or Bloom ➢ What exactly executed what, how, when, …? ➢ We generate Cypher for Neo4j beginners ➢ Fun Neo4j learning path for Kettle users

Other data for this audit graph... ➢ Data profiling ➢ Git branches and commit history graph ➢ Transformation unit testing results ➢ Transformation data lineage information ➢ … ➢ Coming soon

Examples 33

Kettle: Quick Spoon intro

Loading Neo4j: loading nodes ➢ Demonstrates the Neo4j Output step ➢ Read a CSV file in parallel ➢ Load the data into nodes in parallel

Loading Neo4j: update graphs ➢ Demonstrates the Neo4j Graph Output step ➢ Updates multiple nodes and relationships at once ➢ Takes key values into account to ignore nodes ➢ Automatically generates MERGE statements

Sourcing Neo4j: simple reading ➢ Read using a Cypher query ➢ Write to an Excel file

To wrap up... 38

Take-aways Data Integration for Neo4j using Kettle : ➢ Work faster, tackle harder problems ➢ Reduce risk by showing results faster ➢ Govern your Neo4j solutions using Neo4j

Upcoming Kettle Community Meetup ➢ → kcm19.be ➢ Antwerp ➢ Saturday November 23rd

Join our slack kettle-community.slack.com ➢ Mail me for an invite: matt.casters@neo4j.com

The Hunger Games 42

Hunger Games Questions for "Data Integration for Neo4j using Kettle" 1. Easy : Can you extract information from relational databases using Kettle? a. No b. Yes but only a few c. Yes, almost all of them 2. Medium : Can I script harder parts of my data orchestration work? a. No, Kettle is a visual programming tool b. Yes, you can use all popular scripting languages c. Yes, you can use JavaScript 3. Hard : Can Kettle work with big data resources? a. Yes, Kettle supports native support for protocols like S3, HDFS, GS and others. b. Yes a) + Kettle also supports visual Map/Reduce development c. Yes b) + Kettle also support execution on the Spark, Flink and DataFlow engines Answer here: r.neo4j.com/hunger-games

Kettle & Neo4j Q&A 44

Data Integration for Neo4j using Kettle Matt Casters, - PowerPoint PPT Presentation

Data Integration for Neo4j using Kettle Matt Casters, matt.casters@neo4j.com mattcasters Neo4j Chief Solutions Architect Topics What is Kettle? Kettle plugins for Neo4j Kettle using Neo4j Examples The Hunger Games

An Introduc/on to Neo4j @iansrobinson ian.robinson@neotechnology.com #neo4j Neo4j

Stefan Plantikow, Neo4j 2017 Stefan Plantikow, Neo4j 2 2017 Stefan Plantikow, Neo4j

KETTLE MORAINE SCHOOL DISTRICT KETTLE MORAINE HIGH SCHOOL & MIDDLE SCHOOL KETTLE MORAINE HIGH

Neosemantics - A Linked Data Toolkit for Neo4j Jess Barrasa - Neo4j Jess Barrasa

All-new SDN-RX: Reactive Spring Data Neo4j Spring Data Neo4j / Neo4j-OGM Team Michael Simons

Intro to Neo4j for Developers Jennifer Reif Developer Relations Engineer, Neo4j

Causal Consistency For Large Neo4j Clusters Jim Webber Chief Scientist, Neo4j QCon London Leads

Neo4j and Spring Data Going from relational databases to databases with relations Michael

Neo4j Spatial - GIS for the rest of us. OSCON Data 2011 #neo4j Peter Neubauer @peterneubauer

Django and Neo4j Domain modeling that kicks ass! twitter: @thobe / #neo4j Tobias Ivarsson

Building Spatial Search Algorithms for Neo4j Craig Taverner Neo4j Cypher and Spatial

Tearing Down the Walls at Kettle Moraine MS Michael Comiskey, Principal, Kettle Moraine MS

RDKit (cheminformatics) Neo4j Integration Mentors: Christian Pilger (BASF) Presenter - Evgeny

#NODES #2k19 Earth (Milky Road), 10/10/2019 larus-ba.it/neo4j @AgileLARUS Agenda Agenda

Building a real-time recommendation engine with Neo4j OSCON 2017 William Lyon @lyonwj William

Understanding Trolls with Efficient Analytics of Large Graphs in Neo4j David Allen, Amy

REDBOOK 101 Accounting Procedures for Kentucky School Activity Funds 2 What is the

Swiss E-Voting Workshop September 6, 2010 TRANSPARENCY SECURITY 2 VERIFIABILITY PRIVACY 3

Emulab Anton Burtsev, Prashanth Radhakrishnan, Mike Hibler, and Jay Lepreau University of Utah,

R Programming Basics Thomas J. Leeper May 20, 2015 1 Functions Built-in functions x <-

THE LOGGING LOOPHOLE How the Logging Industrys Unregulated Carbon Emissions Undermine

What is a Bro log? 1 What is a Bro log? A stream of

ALMA Common Software Basic Track Logging and Error Systems Logging system conceptual overview

Log-Structured File System CS 416: Operating Systems Design, Spring 2011 Department of Computer

Data Integration for Neo4j using Kettle Matt Casters, - PowerPoint PPT Presentation

Data Integration for Neo4j using Kettle Matt Casters, matt.casters@neo4j.com mattcasters Neo4j Chief Solutions Architect Topics What is Kettle? Kettle plugins for Neo4j Kettle using Neo4j Examples The Hunger Games

An Introduc/on to Neo4j @iansrobinson ian.robinson@neotechnology.com #neo4j Neo4j

Stefan Plantikow, Neo4j 2017 Stefan Plantikow, Neo4j 2 2017 Stefan Plantikow, Neo4j

KETTLE MORAINE SCHOOL DISTRICT KETTLE MORAINE HIGH SCHOOL &amp; MIDDLE SCHOOL KETTLE MORAINE HIGH

Neosemantics - A Linked Data Toolkit for Neo4j Jess Barrasa - Neo4j Jess Barrasa

All-new SDN-RX: Reactive Spring Data Neo4j Spring Data Neo4j / Neo4j-OGM Team Michael Simons

Intro to Neo4j for Developers Jennifer Reif Developer Relations Engineer, Neo4j

Causal Consistency For Large Neo4j Clusters Jim Webber Chief Scientist, Neo4j QCon London Leads

Neo4j and Spring Data Going from relational databases to databases with relations Michael

Neo4j Spatial - GIS for the rest of us. OSCON Data 2011 #neo4j Peter Neubauer @peterneubauer

Django and Neo4j Domain modeling that kicks ass! twitter: @thobe / #neo4j Tobias Ivarsson

Building Spatial Search Algorithms for Neo4j Craig Taverner Neo4j Cypher and Spatial

Tearing Down the Walls at Kettle Moraine MS Michael Comiskey, Principal, Kettle Moraine MS

RDKit (cheminformatics) Neo4j Integration Mentors: Christian Pilger (BASF) Presenter - Evgeny

#NODES #2k19 Earth (Milky Road), 10/10/2019 larus-ba.it/neo4j @AgileLARUS Agenda Agenda

Building a real-time recommendation engine with Neo4j OSCON 2017 William Lyon @lyonwj William

Understanding Trolls with Efficient Analytics of Large Graphs in Neo4j David Allen, Amy

REDBOOK 101 Accounting Procedures for Kentucky School Activity Funds 2 What is the

Swiss E-Voting Workshop September 6, 2010 TRANSPARENCY SECURITY 2 VERIFIABILITY PRIVACY 3

Emulab Anton Burtsev, Prashanth Radhakrishnan, Mike Hibler, and Jay Lepreau University of Utah,

R Programming Basics Thomas J. Leeper May 20, 2015 1 Functions Built-in functions x &lt;-

THE LOGGING LOOPHOLE How the Logging Industrys Unregulated Carbon Emissions Undermine

What is a Bro log? 1 What is a Bro log? A stream of

ALMA Common Software Basic Track Logging and Error Systems Logging system conceptual overview

Log-Structured File System CS 416: Operating Systems Design, Spring 2011 Department of Computer

KETTLE MORAINE SCHOOL DISTRICT KETTLE MORAINE HIGH SCHOOL & MIDDLE SCHOOL KETTLE MORAINE HIGH

R Programming Basics Thomas J. Leeper May 20, 2015 1 Functions Built-in functions x <-