datalab aversion data management and analytics system
play

Datalab:AVersion Data Management And Analytics System Yang - PowerPoint PPT Presentation

Datalab:AVersion Data Management And Analytics System Yang Zhang,Fangzhou Xu, Erwin Frise, Siqi Wu, Bin Yu,Wei Xu Overview Problem: how do we manage code and data with versions? Code version control, e.g. GitHub Data version control, e.g.


  1. Datalab:AVersion Data Management And Analytics System Yang Zhang,Fangzhou Xu, Erwin Frise, Siqi Wu, Bin Yu,Wei Xu

  2. Overview ¡ Problem: how do we manage code and data with versions? ¡ Code version control, e.g. GitHub ¡ Data version control, e.g. DataHub [1] ¡ But how to combine them in a coherent system? Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J Elmore, Samuel Madden, And Aditya G Parameswaran. Datahub: Collaborative Data Science & Dataset Version Management At Scale. Arxiv Preprint Arxiv:1409.0798, 2014.

  3. Our Solution ¡ Version control combining codes and datasets. ¡ Datasets are generated by execution of codes. ¡ T wo data versions are connected by a code version. Experiment commit_id = c41f29 Dataset version0001 Dataset version0002

  4. Data Work Flow (DWF) Dataset 1 Dataset 6 Dataset 4 ¡ Pairs of data versions make up a data Dataset 2 work flow (DWF) Dataset 7 ¡ Reconstruct a dataset by re-executing Dataset 5 the version of code that generates it Dataset 3

  5. System Architecture

  6. Case Study --A Biological Data Application ¡ Goal: find the best K principle patterns ¡ Procedure: ¡ Data preprocessing ¡ Feature extraction ¡ Non-negative matrix factorization ¡ Evaluate K by a stability function ¡ Repeat until find the best parameter

  7. Core APIs

  8. Future Work Dataset 1 ¡ Dataset caching Dataset 6 ¡ Online development environment Dataset 4 ¡ Multi-level of interfaces Dataset 2 Dataset 7 Dataset 5 Dataset 3

  9. Conclusions ¡ We combine data and code version control ¡ We propose data work flow ¡ We improve the efficiency of a data science procedure

Recommend


More recommend