data model predictions x
play

Data Model Predictions ( x ) Kim Hammar (Logical Clocks) - PowerPoint PPT Presentation

Feature Store: the missing data layer in ML pipelines? 1 Spotify ML Guild Fika Kim Hammar kim@logicalclocks.com February 26, 2019 1 Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines?


  1. Feature Store: the missing data layer in ML pipelines? 1 Spotify ML Guild Fika Kim Hammar kim@logicalclocks.com February 26, 2019 1 Kim Hammar and Jim Dowling. Feature Store: the missing data layer in ML pipelines? https://www.logicalclocks.com/feature-store/ . 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 1 / 29

  2. Data Model Predictions ϕ ( x ) Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 2 / 29

  3. A/B Testing Distributed Training Data Validation Data Model Predictions Model ϕ ( x ) Serving Data Collection HyperParameter Tuning Monitoring Hardware Management Feature Engineering Pipeline Management 2 2 Image inspired from Sculley et al. (Google) Hidden Technical Debt in Machine Learning Systems Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 3 / 29

  4. Outline 1 Hopsworks : Quick background of the platform 2 What is a Feature Store 3 Why You Need a Feature Store, Things to Consider: How to encourage feature reusage? How to store large-scale datasets for deep learning? How to serve features for inference? 4 How to Build a Feature Store (Hopsworks Feature Store Case Study) 5 Demo Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 4 / 29

  5. HopsML Orchestration Data Ingestion Data Prep Training Serving Feature Store REST API Kafka TF Serving CPUs GPUs HopsYARN (fork of YARN) HopsFS (fork of HDFS) Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29

  6. HopsML Orchestration Data Ingestion Data Prep Training Serving Feature Store REST API Kafka TF Serving CPUs GPUs HopsYARN (fork of YARN) HopsFS (fork of HDFS) Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 5 / 29

  7.     x 1 , 1 . . . x 1 , n y 1 ϕ ( x ) ˆ y . . .     . . . . . . . . .         x n , 1 . . . x n , n y n Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

  8. _\_ _/_     x 1 , 1 . . . x 1 , n y 1 ( " ) ϕ ( x ) ˆ y ) ❄ . . .  . .   .  . . . . . .         . . . x n , 1 x n , n y n 3 Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo . https://eng.uber.com/scaling-michelangelo/ . 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

  9. _\_ _/_     x 1 , 1 . . . x 1 , n y 1 ( " ) ϕ ( x ) ˆ y ) ❄ . . .  . .   .  . . . . . .         . . . x n , 1 x n , n y n “Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming features at training time and then building the pipelines to deliver those features to production models.” - Uber 3 3 Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo . https://eng.uber.com/scaling-michelangelo/ . 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

  10.     . . . x 1 , 1 x 1 , n y 1 ϕ ( x ) ˆ y . . .  . .   .  . . . Feature Store . . .         x n , 1 . . . x n , n y n “Data is the hardest part of ML and the most important piece to get right. Modelers spend most of their time selecting and transforming features at training time and then building the pipelines to deliver those features to production models.” - Uber 4 4 Jeremy Hermann and Mike Del Balso. Scaling Machine Learning at Uber with Michelangelo . https://eng.uber.com/scaling-michelangelo/ . 2018. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 6 / 29

  11. Disentangle ML Pipelines with a Feature Store Raw/Structured Data Models b 0 b 1 Feature Engineering Training x 0 , 1 x 1 , 1 Feature Store y ˆ x 0 , 2 x 1 , 2 x 0 , 3 x 1 , 3 A feature store is a central vault for storing documented, curated, and access-controlled features. The feature store is the interface between data engineering and data model development Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 7 / 29

  12. . . . Dataset 1 Dataset 2 Dataset n Feature Engineering 200 b 0 b 1 A ≥ 0 . 2 ≥ 11 . 2 180 x 0 , 1 x 1 , 1 B B Y ≥ 0 . 9 < 0 . 2 < 0 . 9 < 11 . 2 y ˆ x 0 , 2 x 1 , 2 160 ( − 1 , − 1) ( − 10 , 0) (0 , − 10) ( − 8 , − 8) 40 60 80 100 x 0 , 3 x 1 , 3 X Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 8 / 29

  13. . . . Dataset 1 Dataset 2 Dataset n Feature Store 200 b 0 b 1 A ≥ 0 . 2 ≥ 11 . 2 180 x 0 , 1 x 1 , 1 B B Y ≥ 0 . 9 < 0 . 2 < 0 . 9 < 11 . 2 ˆ y x 0 , 2 x 1 , 2 160 ( − 1 , − 1) ( − 10 , 0) (0 , − 10) ( − 8 , − 8) 40 60 80 100 x 0 , 3 x 1 , 3 X Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

  14. . . . Dataset 1 Dataset 2 Dataset n Backfilling Feature Store 200 b 0 b 1 A ≥ 0 . 2 ≥ 11 . 2 180 x 0 , 1 x 1 , 1 B B Y ≥ 0 . 9 < 0 . 2 < 0 . 9 < 11 . 2 y ˆ x 0 , 2 x 1 , 2 160 ( − 1 , − 1) ( − 10 , 0) (0 , − 10) ( − 8 , − 8) 40 60 80 100 x 0 , 3 x 1 , 3 X Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

  15. . . . Dataset 1 Dataset 2 Dataset n Backfilling Feature Store Analysis 200 b 0 b 1 A ≥ 0 . 2 ≥ 11 . 2 180 x 0 , 1 x 1 , 1 B B Y ≥ 0 . 9 < 0 . 2 < 0 . 9 < 11 . 2 y ˆ x 0 , 2 x 1 , 2 160 ( − 1 , − 1) ( − 10 , 0) (0 , − 10) ( − 8 , − 8) 40 60 80 100 x 0 , 3 x 1 , 3 X Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

  16. . . . Dataset 1 Dataset 2 Dataset n Backfilling Versioning Feature Store Analysis 200 b 0 b 1 A ≥ 0 . 2 ≥ 11 . 2 180 x 0 , 1 x 1 , 1 B B Y ≥ 0 . 9 < 0 . 2 < 0 . 9 < 11 . 2 y ˆ x 0 , 2 x 1 , 2 160 ( − 1 , − 1) ( − 10 , 0) (0 , − 10) ( − 8 , − 8) 40 60 80 100 x 0 , 3 x 1 , 3 X Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

  17. . . . Dataset 1 Dataset 2 Dataset n Backfilling Versioning Feature Store Analysis Documentation 200 b 0 b 1 A ≥ 0 . 2 ≥ 11 . 2 180 x 0 , 1 x 1 , 1 B B Y ≥ 0 . 9 < 0 . 2 < 0 . 9 < 11 . 2 y ˆ x 0 , 2 x 1 , 2 160 ( − 1 , − 1) ( − 10 , 0) (0 , − 10) ( − 8 , − 8) 40 60 80 100 x 0 , 3 x 1 , 3 X Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 9 / 29

  18. What is a Feature? A feature is a measurable property of some data-sample A feature could be.. An aggregate value (min, max, mean, sum) A raw value (a pixel, a word from a piece of text) A value from a database table (the age of a customer) A derived representation: e.g an embedding or a cluster Features are the fuel for AI systems: Gradient ∇ θ L ( y , ˆ y ) b 0 b 1   x 1 x 0 , 1 x 1 , 1 .  .  ˆ y L ( y , ˆ y ) .   ˆ y   x 0 , 2 x 1 , 2 x n x 0 , 3 x 1 , 3 Features Model θ Prediction Loss Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 10 / 29

  19. lower-case Raw text tokenization lemmatization & remove noise words_post.csv group by post words.txt annotation with weak word2vec TF-IDF ontology-matching LDA supervision Model Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 11 / 29

  20. Raw text weak TF-IDF LDA word2vec normalization annotation Feature Store Model Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 12 / 29

  21. How to Encourage Feature Reusage?

  22. Feature Marketplace Search Publish Download Features Features Features Feature Marketplace Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 14 / 29

  23. Feature Store API Service Include features from hops import featurestore in ML pipelines features_df = Feature Feature Store API Service Metadata featurestore.get_features([ "average_attendance", Shared Storage "average_player_age" Feature Relationships ]) Feature Groups Figure: Feature Store API Service Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 15 / 29

  24. How to Store Datasets for Deep Learning?

  25. How to Store Datasets for Deep Learning? Should be framework agnostic Need to be able to store tensor datasets Should support sharding for distributed training Advanced features: ? row-predicate filtering, SQL interface, columnar selection. Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 17 / 29

  26. How to Store Datasets for Deep Learning? � � � � HDF5 TFRecords Kim Hammar (Logical Clocks) Hopsworks Feature Store February 26, 2019 18 / 29

  27. How to Store Datasets for Deep Learning? Petastorm is a dataset format designed for deep learning Petastorm stores data as parquet files with extra � � metadata to handle multi-dimensional tensors Petastorm contains readers for the popular machine Petastorm learning frameworks such as SparkML, Tensorflow, PyTorch 5 5 Robbie Gruener, Owen Cheng, and Yevgeni Litvin. Introducing Petastorm: Uber ATG’s Data Access Library for Deep Learning . Kim Hammar (Logical Clocks) https://eng.uber.com/petastorm/ . 2018. Hopsworks Feature Store February 26, 2019 19 / 29

  28. How to Serve Features for Inference?

Recommend


More recommend