Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords - PowerPoint PPT Presentation

Dataiku Flow and dctc Data pipelines made easy  Berlin Buzzwords 2013

About me Clément Stenac <clement.stenac@dataiku.com> @ClementStenac  CTO @ Dataiku  Head of product R&D @ Exalead (Search Engine T echnology)  OSS developer @ VLC, Debian and OpenStreetMap Dataiku Training – Hadoop for Data Science

 The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! DIP – Introduction to Dataiku Flow

Follow the Flow Sync In Sync Out Tracker Log Session Hive Customer Profile Pig MongoDB MongoDB Product Recommender Product Transformation Catalog MySQL Category Affinity MySQL Hive Order Python MySQL Category T argeting Apache Logs Partner FTP Pig Syslog (External) Search Search Logs Engine Optimization ElasticSearch S3 (Internal) Search Ranking Dataiku - Pig, Hive and Cascading

Zooming more Bots, Special Users Page Views Filtered Page Views User Affinity User Similarity Catalog (Per Category) User Similarity (Per Brand) Product Popularity Orders Order Summary Recommendation Graph Machine Learning Recommendation Dataiku - Pig, Hive and Cascading

Real-life data pipelines  Many tasks and tools  Dozens of stage, evolves daily  Exceptional situations are the norm  Many pains ◦ Shared schemas ◦ Efficient incremental synchronization and computation ◦ Data is bad DIP – Introduction to Dataiku Flow

An evolution similar to build  1970 Shell scripts  1977 Makefile  1980 Makedeps  1999 SCons/CMake  2001 Maven  … Shell Scripts  2008 HaMake  Better dependencies  2009 Oozie  Higher-level tasks  ETLS, …  Next ?

Introduction to Flow Dataiku Flow is a data-driven orchestration framework for complex data pipelines  Manage data, not steps and taks  Simplify common maintainance situations ◦ Data rebuilds ◦ Processing steps update  Handle real day-to-day pains ◦ Data validity checks ◦ Transfers between systems DIP – Introduction to Dataiku Flow

Concepts: Dataset  Like a table : contains records, with a schema  Can be partitioned ◦ Time partitioning (by day, by hour, …) ◦ « Value » partitioning (by country, by partner, …)  Various backends ◦ SQL ◦ Filesystem ◦ NoSQL (MongoDB, …) ◦ HDFS ◦ Cloud Storages ◦ ElasticSearch DIP – Introduction to Dataiku Flow

Concepts: Task  Has input datasets and output datasets Weekly Visits aggregation Aggregate Visits Daily Customers aggregation  Declares dependencies from input to output  Built-in tasks with strong integration ◦ Pig ◦ Python Pandas & SciKit ◦ Hive ◦ Data transfers  Customizable tasks ◦ Shell script, Java, … DIP – Introduction to Dataiku Flow

Introduction to Flow A sample Flow Browser s Referent . Web Shaker Pig Clean Visits T racker « cleanlogs » « aggr_visits » logs logs Shaker Hive Clean Cust last CRM table « enrich_cust « customer_visits custs. visits customers » » Pig Cust last CRM table « customer_last_ products products product » DIP – Introduction to Dataiku Flow

Data-oriented Flow is data-oriented  Don’t ask « Run task A and then task B »  Don’t even ask « Run all tasks that depend from task A »  Ask « Do what’s needed so that my aggregated customers data for 2013/01/25 is up to date »  Flow manages dependencies between datasets, through tasks  You don’t execute tasks, you compute or refresh datasets DIP – Introduction to Dataiku Flow

Partition-level dependencies Shaker Pig cleantask1 cleanlog « aggr_visits » weekly_aggr wtlogs sliding_days(7)  "wtlogs" and "cleanlog" are day-partitioned  "weekly_aggr" needs the previous 7 days of clean logs  "sliding days" partition-level dependency  "Compute weekly_aggr for 2012-01-25" ◦ Automatically computes the required 7 partitions ◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition ◦ Perform cleantask1 in parallel for all missing / stale days ◦ Perform aggr_visits with the 7 partitions as input Dataiku Training – Hadoop for Data Science

Automatic parallelism  Flow computes the global DAG of required activities  Compute activities that can take place in parallel  Previous example: 8 activities ◦ 7 can be parallelized ◦ 1 requires the other 7 first  Manages running activities  Starts new activities based on available resources DIP – Introduction to Dataiku Flow

Schema and data validity checks  Datasets have a schema, available in all tools  Advanced verification of computed data ◦ "Check that output is not empty" ◦ "Check that this custom query returns between X and Y records" ◦ "Check that this specific record is found in output" ◦ "Check that number of computed records for day B is no more than 40% different than day A"  Automatic tests for data pipelines DIP – Introduction to Dataiku Flow

Integrated in Hadoop, open beyond  Native knowledge of Pig and Hive formats  Schema-aware loaders and storages  A great ecosystem, but not omnipotent ◦ Not everything requires Hadoop's strong points  Hadoop = first-class citizen of Flow, but not the only one  Native integration of SQL capabilities  Automatic incremental synchronization to/from MongoDB, Vertica, ElasticSearch, …  Custom tasks DIP – Introduction to Dataiku Flow

What about Oozie and Hcatalog ? DIP – Introduction to Dataiku Flow

Are we there yet ?  Engine and core tasks are working  Under active development for first betas  Get more info and stay informed http://flowbeta.dataiku.com And while you wait, another thing Ever been annoyed by data transfers ? DIP – Introduction to Dataiku Flow

Feel the pain DIP – Introduction to Dataiku Flow

DCTC : Cloud data manipulation  Extract from the core of Flow  Manipulate files across filesystems # Li st t he f i l es and f ol der s i n a S3 bucket % dct c l s s3: / / m y- bucket # Synchr oni ze i ncr em ent al l y f r om G CS t o l ocal f ol der % dct c sync gs: / / m y- bucket / m y- pat h t ar get - di r ect or y # Copy f r om G CS t o HD FS, com pr ess t o . gz on t he f l y # ( decom pr ess handl ed t oo) % dct c cp –R –c gs: / / m y- bucket / m y- pat h hdf s: / / / dat a/ i nput # D i spat ch t he l i nes of a f i l e t o 8 f i l es on S3, gzi p- com pr essed % dct c di spat ch i nput s3: / / bucket / t ar get –f r andom –nf 8 - c DIP – Introduction to Dataiku Flow

DCTC : More examples # cat f r om anyw her e % dct c cat f t p: / / account @ : / pub/ dat a/ dat a. csv # M ul t i - account aw ar e % dct c sync s3: / / account 1@ pat h s3: / / account 2@ ot her _pat h # Edi t a r em ot e f i l e ( w i t h $ED I TO R) % dct c edi t ssh: / / account @ : m yf i l e. t xt # Tr anspar ent l y unzi p % dct c # Head / t ai l f r om t he cl oud % dct c t ai l s3: / / bucket / huge- l og. csv DIP – Introduction to Dataiku Flow

Try it now http://dctc.io  Self-contained binary for Linux, OS X, Windows  Amazon S3  HTTP  Google Cloud Storage  SSH  FTP  HDFS (through local install) DIP – Introduction to Dataiku Flow

Questions ? Florian Douetteau Marc Batty Thomas Cabrol Clément Stenac Chief Executive Officer Chief Customer Officer Chief Data Scientist Chief T echnical Officer florian.douetteau@dataiku.com marc.batty@dataiku.com thomas.cabrol@dataiku.com clement.stenac@dataiku.com +33 6 70 56 88 97 +33 6 45 65 67 04 +33 7 86 42 62 81 +33 6 28 06 79 04 @fdouetteau @battymarc @ThomasCabrol @ClementStenac

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords - PowerPoint PPT Presentation

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords 2013 About me Clment Stenac <clement.stenac@dataiku.com> @ClementStenac CTO @ Dataiku Head of product R&D @ Exalead (Search Engine T echnology) OSS

BIG DATA IN HYBRID WORLDS The Story of M H i ! Im Florian CEO of Dataiku maker Data

Introduction to Stan and Bayesian Inference Paris Machine Learning Meetup Dataiku User Meetup

Flow Networks Flow Network: - digraph - weights, called capacities on edges - two

Potential Flow & Flow Nets Potential Flow Irrotational flow for which implies:

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Fl Flow data d t

1 What Is Control-Flow Analysis? Loop Concepts Control-flow analysis discovers the flow of

Max flow and min cost max flow Han Hoogeveen May 23, 2014 Basic problem description Given:

CS 401 Max Flow / Bipartite Matching Xiaorui Sun 1 Flow network Flow network. G = (V, E) =

Review. Ford-Fulkerson algorithm for max-flow: repeatedly augment flow along paths in the

Flow Analysis Data-flow analysis, Control-flow analysis, Abstract interpretation, AAM Helpful

Network Flow CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Network Flow Models the flow

Cytometry Flow Cytometry Flow Cytometry is the technological process that allows for the

Engineering Economics 4-1 Cash Flow Cash flow is the sum of money recorded as receipts or

Review Network flow definitions CSE 421 Flow examples Augmenting Paths Algorithms

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

Traffic Flow Characteristics and Theory Primary Elements of Traffic Flow a. Flow Rate = q

Flow networks 2 5 1 How much flow can we push 4 7 through from s to t ? 3 2 (Numbers are

Network Flow 5 Network Flow terminology Network flow is similar to finding how much water we

CS 401 Max Flow Xiaorui Sun 1 Stuff Homework 3 due today Homework 4 will be out soon Flow

Flow and Congestion Control CS 218 F2003 Oct 29, 03 Flow control goals Classification

Flow solutions for LNG metering Helene Casellas Business Development Manager Oil & Gas Flow

Network Flow II 2 Every edge e has a capacity c(e) 0. Flow: 1 Inge Li Grtz

Steady Flow: Lid-Driven Cavity Flow This tutorial demonstrates the performance of STAR-CCM+ in

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords - PowerPoint PPT Presentation

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords 2013 About me Clment Stenac <clement.stenac@dataiku.com> @ClementStenac CTO @ Dataiku Head of product R&D @ Exalead (Search Engine T echnology) OSS

BIG DATA IN HYBRID WORLDS The Story of M H i ! Im Florian CEO of Dataiku maker Data

Introduction to Stan and Bayesian Inference Paris Machine Learning Meetup Dataiku User Meetup

Flow Networks Flow Network: - digraph - weights, called capacities on edges - two

Potential Flow &amp; Flow Nets Potential Flow Irrotational flow for which implies:

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Flow data Simulation

Flow Visualization Overview: Flow Visualization (1) Introduction, overview Fl Flow data d t

1 What Is Control-Flow Analysis? Loop Concepts Control-flow analysis discovers the flow of

Max flow and min cost max flow Han Hoogeveen May 23, 2014 Basic problem description Given:

CS 401 Max Flow / Bipartite Matching Xiaorui Sun 1 Flow network Flow network. G = (V, E) =

Review. Ford-Fulkerson algorithm for max-flow: repeatedly augment flow along paths in the

Flow Analysis Data-flow analysis, Control-flow analysis, Abstract interpretation, AAM Helpful

Network Flow CS31005: Algorithms-II Autumn 2020 IIT Kharagpur Network Flow Models the flow

Cytometry Flow Cytometry Flow Cytometry is the technological process that allows for the

Engineering Economics 4-1 Cash Flow Cash flow is the sum of money recorded as receipts or

Review Network flow definitions CSE 421 Flow examples Augmenting Paths Algorithms

Flow networks, flow, maximum flow Can interpret directed graph as flow network. Material

Traffic Flow Characteristics and Theory Primary Elements of Traffic Flow a. Flow Rate = q

Flow networks 2 5 1 How much flow can we push 4 7 through from s to t ? 3 2 (Numbers are

Network Flow 5 Network Flow terminology Network flow is similar to finding how much water we

CS 401 Max Flow Xiaorui Sun 1 Stuff Homework 3 due today Homework 4 will be out soon Flow

Flow and Congestion Control CS 218 F2003 Oct 29, 03 Flow control goals Classification

Flow solutions for LNG metering Helene Casellas Business Development Manager Oil &amp; Gas Flow

Network Flow II 2 Every edge e has a capacity c(e) 0. Flow: 1 Inge Li Grtz

Steady Flow: Lid-Driven Cavity Flow This tutorial demonstrates the performance of STAR-CCM+ in

Potential Flow & Flow Nets Potential Flow Irrotational flow for which implies:

Flow solutions for LNG metering Helene Casellas Business Development Manager Oil & Gas Flow