Datalab:AVersion Data Management And Analytics System Yang Zhang,Fangzhou Xu, Erwin Frise, Siqi Wu, Bin Yu,Wei Xu
Overview ¡ Problem: how do we manage code and data with versions? ¡ Code version control, e.g. GitHub ¡ Data version control, e.g. DataHub [1] ¡ But how to combine them in a coherent system? Anant Bhardwaj, Souvik Bhattacherjee, Amit Chavan, Amol Deshpande, Aaron J Elmore, Samuel Madden, And Aditya G Parameswaran. Datahub: Collaborative Data Science & Dataset Version Management At Scale. Arxiv Preprint Arxiv:1409.0798, 2014.
Our Solution ¡ Version control combining codes and datasets. ¡ Datasets are generated by execution of codes. ¡ T wo data versions are connected by a code version. Experiment commit_id = c41f29 Dataset version0001 Dataset version0002
Data Work Flow (DWF) Dataset 1 Dataset 6 Dataset 4 ¡ Pairs of data versions make up a data Dataset 2 work flow (DWF) Dataset 7 ¡ Reconstruct a dataset by re-executing Dataset 5 the version of code that generates it Dataset 3
System Architecture
Case Study --A Biological Data Application ¡ Goal: find the best K principle patterns ¡ Procedure: ¡ Data preprocessing ¡ Feature extraction ¡ Non-negative matrix factorization ¡ Evaluate K by a stability function ¡ Repeat until find the best parameter
Core APIs
Future Work Dataset 1 ¡ Dataset caching Dataset 6 ¡ Online development environment Dataset 4 ¡ Multi-level of interfaces Dataset 2 Dataset 7 Dataset 5 Dataset 3
Conclusions ¡ We combine data and code version control ¡ We propose data work flow ¡ We improve the efficiency of a data science procedure
Recommend
More recommend