Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords 2013
About me Clément Stenac <clement.stenac@dataiku.com> @ClementStenac CTO @ Dataiku Head of product R&D @ Exalead (Search Engine T echnology) OSS developer @ VLC, Debian and OpenStreetMap Dataiku Training – Hadoop for Data Science
The hard life of a Data Scientist Dataiku Flow DCTC Lunch ! DIP – Introduction to Dataiku Flow
Follow the Flow Sync In Sync Out Tracker Log Session Hive Customer Profile Pig MongoDB MongoDB Product Recommender Product Transformation Catalog MySQL Category Affinity MySQL Hive Order Python MySQL Category T argeting Apache Logs Partner FTP Pig Syslog (External) Search Search Logs Engine Optimization ElasticSearch S3 (Internal) Search Ranking Dataiku - Pig, Hive and Cascading
Zooming more Bots, Special Users Page Views Filtered Page Views User Affinity User Similarity Catalog (Per Category) User Similarity (Per Brand) Product Popularity Orders Order Summary Recommendation Graph Machine Learning Recommendation Dataiku - Pig, Hive and Cascading
Real-life data pipelines Many tasks and tools Dozens of stage, evolves daily Exceptional situations are the norm Many pains ◦ Shared schemas ◦ Efficient incremental synchronization and computation ◦ Data is bad DIP – Introduction to Dataiku Flow
An evolution similar to build 1970 Shell scripts 1977 Makefile 1980 Makedeps 1999 SCons/CMake 2001 Maven … Shell Scripts 2008 HaMake Better dependencies 2009 Oozie Higher-level tasks ETLS, … Next ?
The hard life of a Data Scientist Dataiku Flow DCTC Lunch ! DIP – Introduction to Dataiku Flow
Introduction to Flow Dataiku Flow is a data-driven orchestration framework for complex data pipelines Manage data, not steps and taks Simplify common maintainance situations ◦ Data rebuilds ◦ Processing steps update Handle real day-to-day pains ◦ Data validity checks ◦ Transfers between systems DIP – Introduction to Dataiku Flow
Concepts: Dataset Like a table : contains records, with a schema Can be partitioned ◦ Time partitioning (by day, by hour, …) ◦ « Value » partitioning (by country, by partner, …) Various backends ◦ SQL ◦ Filesystem ◦ NoSQL (MongoDB, …) ◦ HDFS ◦ Cloud Storages ◦ ElasticSearch DIP – Introduction to Dataiku Flow
Concepts: Task Has input datasets and output datasets Weekly Visits aggregation Aggregate Visits Daily Customers aggregation Declares dependencies from input to output Built-in tasks with strong integration ◦ Pig ◦ Python Pandas & SciKit ◦ Hive ◦ Data transfers Customizable tasks ◦ Shell script, Java, … DIP – Introduction to Dataiku Flow
Introduction to Flow A sample Flow Browser s Referent . Web Shaker Pig Clean Visits T racker « cleanlogs » « aggr_visits » logs logs Shaker Hive Clean Cust last CRM table « enrich_cust « customer_visits custs. visits customers » » Pig Cust last CRM table « customer_last_ products products product » DIP – Introduction to Dataiku Flow
Data-oriented Flow is data-oriented Don’t ask « Run task A and then task B » Don’t even ask « Run all tasks that depend from task A » Ask « Do what’s needed so that my aggregated customers data for 2013/01/25 is up to date » Flow manages dependencies between datasets, through tasks You don’t execute tasks, you compute or refresh datasets DIP – Introduction to Dataiku Flow
Partition-level dependencies Shaker Pig cleantask1 cleanlog « aggr_visits » weekly_aggr wtlogs sliding_days(7) "wtlogs" and "cleanlog" are day-partitioned "weekly_aggr" needs the previous 7 days of clean logs "sliding days" partition-level dependency "Compute weekly_aggr for 2012-01-25" ◦ Automatically computes the required 7 partitions ◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition ◦ Perform cleantask1 in parallel for all missing / stale days ◦ Perform aggr_visits with the 7 partitions as input Dataiku Training – Hadoop for Data Science
Automatic parallelism Flow computes the global DAG of required activities Compute activities that can take place in parallel Previous example: 8 activities ◦ 7 can be parallelized ◦ 1 requires the other 7 first Manages running activities Starts new activities based on available resources DIP – Introduction to Dataiku Flow
Schema and data validity checks Datasets have a schema, available in all tools Advanced verification of computed data ◦ "Check that output is not empty" ◦ "Check that this custom query returns between X and Y records" ◦ "Check that this specific record is found in output" ◦ "Check that number of computed records for day B is no more than 40% different than day A" Automatic tests for data pipelines DIP – Introduction to Dataiku Flow
Integrated in Hadoop, open beyond Native knowledge of Pig and Hive formats Schema-aware loaders and storages A great ecosystem, but not omnipotent ◦ Not everything requires Hadoop's strong points Hadoop = first-class citizen of Flow, but not the only one Native integration of SQL capabilities Automatic incremental synchronization to/from MongoDB, Vertica, ElasticSearch, … Custom tasks DIP – Introduction to Dataiku Flow
What about Oozie and Hcatalog ? DIP – Introduction to Dataiku Flow
Are we there yet ? Engine and core tasks are working Under active development for first betas Get more info and stay informed http://flowbeta.dataiku.com And while you wait, another thing Ever been annoyed by data transfers ? DIP – Introduction to Dataiku Flow
Feel the pain DIP – Introduction to Dataiku Flow
The hard life of a Data Scientist Dataiku Flow DCTC Lunch ! DIP – Introduction to Dataiku Flow
DCTC : Cloud data manipulation Extract from the core of Flow Manipulate files across filesystems # Li st t he f i l es and f ol der s i n a S3 bucket % dct c l s s3: / / m y- bucket # Synchr oni ze i ncr em ent al l y f r om G CS t o l ocal f ol der % dct c sync gs: / / m y- bucket / m y- pat h t ar get - di r ect or y # Copy f r om G CS t o HD FS, com pr ess t o . gz on t he f l y # ( decom pr ess handl ed t oo) % dct c cp –R –c gs: / / m y- bucket / m y- pat h hdf s: / / / dat a/ i nput # D i spat ch t he l i nes of a f i l e t o 8 f i l es on S3, gzi p- com pr essed % dct c di spat ch i nput s3: / / bucket / t ar get –f r andom –nf 8 - c DIP – Introduction to Dataiku Flow
DCTC : More examples # cat f r om anyw her e % dct c cat f t p: / / account @ : / pub/ dat a/ dat a. csv # M ul t i - account aw ar e % dct c sync s3: / / account 1@ pat h s3: / / account 2@ ot her _pat h # Edi t a r em ot e f i l e ( w i t h $ED I TO R) % dct c edi t ssh: / / account @ : m yf i l e. t xt # Tr anspar ent l y unzi p % dct c # Head / t ai l f r om t he cl oud % dct c t ai l s3: / / bucket / huge- l og. csv DIP – Introduction to Dataiku Flow
Try it now http://dctc.io Self-contained binary for Linux, OS X, Windows Amazon S3 HTTP Google Cloud Storage SSH FTP HDFS (through local install) DIP – Introduction to Dataiku Flow
Questions ? Florian Douetteau Marc Batty Thomas Cabrol Clément Stenac Chief Executive Officer Chief Customer Officer Chief Data Scientist Chief T echnical Officer florian.douetteau@dataiku.com marc.batty@dataiku.com thomas.cabrol@dataiku.com clement.stenac@dataiku.com +33 6 70 56 88 97 +33 6 45 65 67 04 +33 7 86 42 62 81 +33 6 28 06 79 04 @fdouetteau @battymarc @ThomasCabrol @ClementStenac
Recommend
More recommend