data integration for neo4j using kettle
play

Data Integration for Neo4j using Kettle Matt Casters, - PowerPoint PPT Presentation

Data Integration for Neo4j using Kettle Matt Casters, matt.casters@neo4j.com mattcasters Neo4j Chief Solutions Architect Topics What is Kettle? Kettle plugins for Neo4j Kettle using Neo4j Examples The Hunger Games


  1. Data Integration for Neo4j using Kettle Matt Casters, matt.casters@neo4j.com mattcasters Neo4j Chief Solutions Architect

  2. Topics ➢ What is Kettle? ➢ Kettle plugins for Neo4j ➢ Kettle using Neo4j ➢ Examples ➢ The Hunger Games ➢ Q&A

  3. What is Kettle? 3

  4. Kettle: Introduction ➢ A visual programming tool for data orchestration ➢ A.k.a. Pentaho Data Integration from Hitachi Vantara ➢ Over 15 years old ➢ Open source under Apache Public License 2.0 ➢ Large community, marketplace, ... ➢ Easy to embed, install, package, rebrand ➢ Download your Neo4j remix at www.kettle.be

  5. Kettle: where is it used? ➢ On tiny and enormous systems, real or virtual ➢ Very small computers, Raspberry Pi sized ➢ Your laptop or browser ➢ Locally or in the cloud ➢ On Hadoop clusters, VMs, Docker, Serverless, ➢ At large and small companies ➢ In government ➢ In education ➢ In the Neo4j Solutions Reference Architecture

  6. Kettle: Why is it used? ➢ Reduce costs, reach goals faster ➢ Answers the “build or buy?” question Accum. build Cost y u b Kettle Time

  7. Kettle: Architecture ➢ Metadata driven, engine based : ○ No code generation ○ Define what you need to happen → GUI, Web, code, rules, … ○ Clear and transparent, self documenting ➢ Types of work: ○ Jobs for workflows ○ Transformations for parallel data streaming

  8. Kettle: Design ➢ 100% Exposure of our engine through UI elements ➢ Everyone should be able to play along: plugins! ➢ We built integration points for others: run everywhere! ➢ Allow the user to avoid programming anything ➢ Allow the user to program anything: JavaScript, Java, Groovy, RegEx, Rules, Python, Ruby, R, … ➢ Transparency wins: best in class logging, data lineage, execution lineage, debugging / breakpoints, data previewing, row sniff testing, …

  9. Other Kettle options available to you... ➢ SpoonGit: UI integration with git ➢ WebSpoon: web interface to the full Spoon UI ➢ Data Sets: build transformation unit tests ➢ Native file system protocols: hdfs://, s3://, gs:// … ➢ Hadoop support through a compatibility layer ➢ Kettle Beam: execute transformations on Apache Spark, Apache Flink and GCP DataFlow

  10. Kettle: The Toolset ➢ Spoon: GUI ➢ Scripts ➢ Server(s) ➢ Java API & SDK ➢ Standard file format ➢ Plugin ecosystem ➢ Docker image(s) ➢ Documentation, books, ...

  11. Architecture Version Control System Deploy System Checkout version git VM, docker, ... Configure - setup - initialize - run Artifacts, graphs, configurations

  12. Kettle plugins for Neo4j 12

  13. Plugins: Neo4j Cypher ➢ For reading and writing ➢ Dynamic Cypher ➢ Batching and UNWIND ➢ Parallel execution ➢ High performance ➢ Call procedures

  14. Plugins: Neo4j Output ➢ Easy node creation ➢ Create/Merge of ()-[]-() ➢ Batching and UNWIND ➢ Parallel execution ➢ Dynamic labels

  15. Plugins: Neo4j Graph Output ➢ Update parts of a graph ➢ Auto-generate Cypher ➢ Using a logical model ➢ Using field mapping

  16. Plugins: Check Neo4j Connection ➢ Job Entry ➢ Validate DBs are up ➢ Used in error diagnostic ➢ Defensive setup

  17. Plugins: Neo4j Cypher Script ➢ Job Entry ➢ Executes series of Cypher statements

  18. Neo4j Generate CSVs ➢ Generate CSV files for Neo4j Import ➢ Generates appropriate header ➢ Handles escaping, quoting, … ➢ Outputs file names

  19. Neo4j Split Graph ➢ Splits a graph field into nodes and relationships ➢ Used for unique value calculation

  20. Neo4j Importer ➢ Runs a neo4j-import command ➢ Accepts the filenames of CSV files

  21. Kettle using Neo4j 21

  22. Using Neo4j in Kettle : Logging ➢ Write logging to Neo4j ➢ Builds an execution lineage graph ➢ Updates a metadata graph ➢ Execution details are stored on Job, Job entry, Transformation, Steps, Database levels ➢ Stores graph updates ○ Node creation or update ○ Relationship creation or update

  23. Using Neo4j in Kettle : Logging ● Documents the execution process ○ Log text, times, lineage

  24. Using Neo4j in Kettle : Logging ➢ Examine past executions ○ See what went wrong over the weekend ○ Click on a step to see how long it took ○ Examine log texts ○ Generate Cypher queries to examine further ➢ Calculate delta window Take last execution without error into account

  25. Using Neo4j in Kettle : Logging ➢ Top-to-bottom : find an error ○ Large jobs are hard to debug ○ Sub-jobs and sub-transformations obfuscate ○ Going through logging takes time ○ We know the loaded job or transformation ○ Neo4j can find the shortest path to the lowest execution node without children with errors>0 ○ We can show these shortest paths to the error ○ The user knows in seconds where the error happened and go straight to it to fix.

  26. Using Neo4j in Kettle : Logging ➢ Bottom-up : how was a component executed ○ We know the step or job entry selected ○ Neo4j can find the shortest path to the root execution node without parents ○ We can show these execution paths ○ The user knows how something was executed ○ Very useful in highly dynamic conditional executions

  27. Using Neo4j in Kettle : Logging ➢ Examining executions with browser or Bloom ➢ What exactly executed what, how, when, …? ➢ We generate Cypher for Neo4j beginners ➢ Fun Neo4j learning path for Kettle users

  28. Other data for this audit graph... ➢ Data profiling ➢ Git branches and commit history graph ➢ Transformation unit testing results ➢ Transformation data lineage information ➢ … ➢ Coming soon

  29. Examples 33

  30. Kettle: Quick Spoon intro

  31. Loading Neo4j: loading nodes ➢ Demonstrates the Neo4j Output step ➢ Read a CSV file in parallel ➢ Load the data into nodes in parallel

  32. Loading Neo4j: update graphs ➢ Demonstrates the Neo4j Graph Output step ➢ Updates multiple nodes and relationships at once ➢ Takes key values into account to ignore nodes ➢ Automatically generates MERGE statements

  33. Sourcing Neo4j: simple reading ➢ Read using a Cypher query ➢ Write to an Excel file

  34. To wrap up... 38

  35. Take-aways Data Integration for Neo4j using Kettle : ➢ Work faster, tackle harder problems ➢ Reduce risk by showing results faster ➢ Govern your Neo4j solutions using Neo4j

  36. Upcoming Kettle Community Meetup ➢ → kcm19.be ➢ Antwerp ➢ Saturday November 23rd

  37. Join our slack kettle-community.slack.com ➢ Mail me for an invite: matt.casters@neo4j.com

  38. The Hunger Games 42

  39. Hunger Games Questions for "Data Integration for Neo4j using Kettle" 1. Easy : Can you extract information from relational databases using Kettle? a. No b. Yes but only a few c. Yes, almost all of them 2. Medium : Can I script harder parts of my data orchestration work? a. No, Kettle is a visual programming tool b. Yes, you can use all popular scripting languages c. Yes, you can use JavaScript 3. Hard : Can Kettle work with big data resources? a. Yes, Kettle supports native support for protocols like S3, HDFS, GS and others. b. Yes a) + Kettle also supports visual Map/Reduce development c. Yes b) + Kettle also support execution on the Spark, Flink and DataFlow engines Answer here: r.neo4j.com/hunger-games

  40. Kettle & Neo4j Q&A 44

Recommend


More recommend