data pipeline selection and optimization dolap 2019
play

Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy - PowerPoint PPT Presentation

Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy IBM IBM, , Da Data ta an and d AI Politechnika Poznaska The usual workflow Data collection Data pipeline Model selection Model selection Raw data algorithm


  1. Data Pipeline Selection and Optimization DOLAP 2019 Alexandre Quemy IBM IBM, , Da Data ta an and d AI Politechnika Poznańska

  2. The usual workflow Data collection Data pipeline Model selection Model selection Raw data algorithm algorithm model 1 model 1 Operation 1 Operation 2 Operation 3 model 2 model 2 metric metric Potentially tool-assisted model 3 model 3 p metaoptimizer metaoptimizer best model best model

  3. The hard life of data scientists → Deali ling ng wit ith mis issin ing valu lue: e: → Discarding? Row? Column? → Imputation? What imputation? Mean? Median? Model-based? What model? → Im Imbalan lanced ed datasets: asets: → Downsampling? Oversampling? → Nothing? What bias it implies? → Data a too oo la large: e: → Dimensional reductions: what algorithm? PCA? normalization or not? → Subsampling: what technique? what bias? → Outli liers ers detecti tection on and curat atio ion: n: → What threshold? What deviation measure? → Trimming? Truncating? Censoring? Winsorizing? → Encodi coding ng for or metho hod d dom omain in requi uirements: rements: → Discretization? Grid? What step? Cluster? What method? What hyperparameter? → Categorial encoder? Binary? Hot-One? Helmert? Backward Difference? → NLP: → How many tokens? → Size of m-grams?

  4. The usual workflow Data collection Data pipeline Model selection Model selection Raw data algorithm algorithm model 1 model 1 Operation 1 Operation 2 Operation 3 model 2 model 2 metric metric Potentially tool-assisted model 3 model 3 p metaoptimizer metaoptimizer best model best model

  5. The workflow proposed in the paper Data collection Data pipeline Model selection Raw data algorithm metric? model 1 O11 O21 O22 Best pipeline O23 O12 O22 model 2 metric model 3 p p metaoptimizer metaoptimizer best model

  6. The workflow proposed in the paper Data collection Data pipeline Model selection Raw data algorithm metric model 1 O11 O21 O22 Best pipeline O23 O12 O22 model 2 metric model 3 p p metaoptimizer metaoptimizer best model Feedback

  7. Pipeline prototype Bas aseline: eline: (Id, Id, Id) Reb ebal alance: ance: 4 operators Nor Normalize: malize: 5 operators Fea eatu tures: res: 4 operators Co Configuratio figuration n space: ace: 4750 configurations

  8. Protocol • Datasets: Breast, Iris, Wine. • Methods: SVM, Random Forest, Neural Network, Decision Tree. • Dataset split: 60% for training set, 40% for test set. • Pipeline configuration space size: 4750 configurations. • Performance metric: Cross-validation accuracy • Metaoptimizer: Tree Parzen Estimator (hyperopt) • Budget: 100 configurations (~2% of the space) No algorithm hyperparameter tuning!  We want to quantify the influence of data pipeline Exhaustive search to compare between baseline and max score.

  9. Results

  10. In average, with 20 ite tera ratio tions ns (0.42 .42% of the search space): 1. decrease of error by 58% % compared to the baseline 2. 98.92% .92% in the normalized score space)

  11. How close are we from the optimal pipeline?

  12. A solution for Euclidian space N: number of algorithms K: dimension of the configuration space r: a reference point p* p*: sample of optimal configurations For or each ch op optim imal al con onfig igurat uratio ion n r: 1. Bu Buil ild the e sa sample le w.r. r.t. t. to th o the e alg lgor orit ithms: hms: → For each algorithm, select the optimal point that is the closest from the reference point. 2.Express Express the sa sample le in in n nor ormali alized zed con onf. . sp space ce 3.Cal Calcul culate ate the NMAD on on t the e sample le

  13. Results on two datasets for text classification

  14. Future work Wo Work k in progress: ogress: → Tests on larger configuration spaces. → Online architecture. → Metric between pipelines

  15. Thank you Don’t forget the poster session!

Recommend


More recommend