dataiku flow and dctc
play

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords - PowerPoint PPT Presentation

Dataiku Flow and dctc Data pipelines made easy Berlin Buzzwords 2013 About me Clment Stenac <clement.stenac@dataiku.com> @ClementStenac CTO @ Dataiku Head of product R&D @ Exalead (Search Engine T echnology) OSS


  1. Dataiku Flow and dctc Data pipelines made easy  Berlin Buzzwords 2013

  2. About me Clément Stenac <clement.stenac@dataiku.com> @ClementStenac  CTO @ Dataiku  Head of product R&D @ Exalead (Search Engine T echnology)  OSS developer @ VLC, Debian and OpenStreetMap Dataiku Training – Hadoop for Data Science

  3.  The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! DIP – Introduction to Dataiku Flow

  4. Follow the Flow Sync In Sync Out Tracker Log Session Hive Customer Profile Pig MongoDB MongoDB Product Recommender Product Transformation Catalog MySQL Category Affinity MySQL Hive Order Python MySQL Category T argeting Apache Logs Partner FTP Pig Syslog (External) Search Search Logs Engine Optimization ElasticSearch S3 (Internal) Search Ranking Dataiku - Pig, Hive and Cascading

  5. Zooming more Bots, Special Users Page Views Filtered Page Views User Affinity User Similarity Catalog (Per Category) User Similarity (Per Brand) Product Popularity Orders Order Summary Recommendation Graph Machine Learning Recommendation Dataiku - Pig, Hive and Cascading

  6. Real-life data pipelines  Many tasks and tools  Dozens of stage, evolves daily  Exceptional situations are the norm  Many pains ◦ Shared schemas ◦ Efficient incremental synchronization and computation ◦ Data is bad DIP – Introduction to Dataiku Flow

  7. An evolution similar to build  1970 Shell scripts  1977 Makefile  1980 Makedeps  1999 SCons/CMake  2001 Maven  … Shell Scripts  2008 HaMake  Better dependencies  2009 Oozie  Higher-level tasks  ETLS, …  Next ?

  8.  The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! DIP – Introduction to Dataiku Flow

  9. Introduction to Flow Dataiku Flow is a data-driven orchestration framework for complex data pipelines  Manage data, not steps and taks  Simplify common maintainance situations ◦ Data rebuilds ◦ Processing steps update  Handle real day-to-day pains ◦ Data validity checks ◦ Transfers between systems DIP – Introduction to Dataiku Flow

  10. Concepts: Dataset  Like a table : contains records, with a schema  Can be partitioned ◦ Time partitioning (by day, by hour, …) ◦ « Value » partitioning (by country, by partner, …)  Various backends ◦ SQL ◦ Filesystem ◦ NoSQL (MongoDB, …) ◦ HDFS ◦ Cloud Storages ◦ ElasticSearch DIP – Introduction to Dataiku Flow

  11. Concepts: Task  Has input datasets and output datasets Weekly Visits aggregation Aggregate Visits Daily Customers aggregation  Declares dependencies from input to output  Built-in tasks with strong integration ◦ Pig ◦ Python Pandas & SciKit ◦ Hive ◦ Data transfers  Customizable tasks ◦ Shell script, Java, … DIP – Introduction to Dataiku Flow

  12. Introduction to Flow A sample Flow Browser s Referent . Web Shaker Pig Clean Visits T racker « cleanlogs » « aggr_visits » logs logs Shaker Hive Clean Cust last CRM table « enrich_cust « customer_visits custs. visits customers » » Pig Cust last CRM table « customer_last_ products products product » DIP – Introduction to Dataiku Flow

  13. Data-oriented Flow is data-oriented  Don’t ask « Run task A and then task B »  Don’t even ask « Run all tasks that depend from task A »  Ask « Do what’s needed so that my aggregated customers data for 2013/01/25 is up to date »  Flow manages dependencies between datasets, through tasks  You don’t execute tasks, you compute or refresh datasets DIP – Introduction to Dataiku Flow

  14. Partition-level dependencies Shaker Pig cleantask1 cleanlog « aggr_visits » weekly_aggr wtlogs sliding_days(7)  "wtlogs" and "cleanlog" are day-partitioned  "weekly_aggr" needs the previous 7 days of clean logs  "sliding days" partition-level dependency  "Compute weekly_aggr for 2012-01-25" ◦ Automatically computes the required 7 partitions ◦ For each partition, check if cleanlog is up-to-date wrt. the wtlogs partition ◦ Perform cleantask1 in parallel for all missing / stale days ◦ Perform aggr_visits with the 7 partitions as input Dataiku Training – Hadoop for Data Science

  15. Automatic parallelism  Flow computes the global DAG of required activities  Compute activities that can take place in parallel  Previous example: 8 activities ◦ 7 can be parallelized ◦ 1 requires the other 7 first  Manages running activities  Starts new activities based on available resources DIP – Introduction to Dataiku Flow

  16. Schema and data validity checks  Datasets have a schema, available in all tools  Advanced verification of computed data ◦ "Check that output is not empty" ◦ "Check that this custom query returns between X and Y records" ◦ "Check that this specific record is found in output" ◦ "Check that number of computed records for day B is no more than 40% different than day A"  Automatic tests for data pipelines DIP – Introduction to Dataiku Flow

  17. Integrated in Hadoop, open beyond  Native knowledge of Pig and Hive formats  Schema-aware loaders and storages  A great ecosystem, but not omnipotent ◦ Not everything requires Hadoop's strong points  Hadoop = first-class citizen of Flow, but not the only one  Native integration of SQL capabilities  Automatic incremental synchronization to/from MongoDB, Vertica, ElasticSearch, …  Custom tasks DIP – Introduction to Dataiku Flow

  18. What about Oozie and Hcatalog ? DIP – Introduction to Dataiku Flow

  19. Are we there yet ?  Engine and core tasks are working  Under active development for first betas  Get more info and stay informed http://flowbeta.dataiku.com And while you wait, another thing Ever been annoyed by data transfers ? DIP – Introduction to Dataiku Flow

  20. Feel the pain DIP – Introduction to Dataiku Flow

  21.  The hard life of a Data Scientist  Dataiku Flow  DCTC  Lunch ! DIP – Introduction to Dataiku Flow

  22. DCTC : Cloud data manipulation  Extract from the core of Flow  Manipulate files across filesystems # Li st t he f i l es and f ol der s i n a S3 bucket % dct c l s s3: / / m y- bucket # Synchr oni ze i ncr em ent al l y f r om G CS t o l ocal f ol der % dct c sync gs: / / m y- bucket / m y- pat h t ar get - di r ect or y # Copy f r om G CS t o HD FS, com pr ess t o . gz on t he f l y # ( decom pr ess handl ed t oo) % dct c cp –R –c gs: / / m y- bucket / m y- pat h hdf s: / / / dat a/ i nput # D i spat ch t he l i nes of a f i l e t o 8 f i l es on S3, gzi p- com pr essed % dct c di spat ch i nput s3: / / bucket / t ar get –f r andom –nf 8 - c DIP – Introduction to Dataiku Flow

  23. DCTC : More examples # cat f r om anyw her e % dct c cat f t p: / / account @ : / pub/ dat a/ dat a. csv # M ul t i - account aw ar e % dct c sync s3: / / account 1@ pat h s3: / / account 2@ ot her _pat h # Edi t a r em ot e f i l e ( w i t h $ED I TO R) % dct c edi t ssh: / / account @ : m yf i l e. t xt # Tr anspar ent l y unzi p % dct c # Head / t ai l f r om t he cl oud % dct c t ai l s3: / / bucket / huge- l og. csv DIP – Introduction to Dataiku Flow

  24. Try it now http://dctc.io  Self-contained binary for Linux, OS X, Windows  Amazon S3  HTTP  Google Cloud Storage  SSH  FTP  HDFS (through local install) DIP – Introduction to Dataiku Flow

  25. Questions ? Florian Douetteau Marc Batty Thomas Cabrol Clément Stenac Chief Executive Officer Chief Customer Officer Chief Data Scientist Chief T echnical Officer florian.douetteau@dataiku.com marc.batty@dataiku.com thomas.cabrol@dataiku.com clement.stenac@dataiku.com +33 6 70 56 88 97 +33 6 45 65 67 04 +33 7 86 42 62 81 +33 6 28 06 79 04 @fdouetteau @battymarc @ThomasCabrol @ClementStenac

Recommend


More recommend