bighead
play

Bighead Airbnbs End-to-End Machine Learning Infrastructure Andrew - PowerPoint PPT Presentation

Bighead Airbnbs End-to-End Machine Learning Infrastructure Andrew Hoh ML Infra @ Airbnb Architecture Background Design Goals Open Source Deep Dive Background Airbnbs Product A global travel community that offers magical


  1. Bighead Airbnb’s End-to-End Machine Learning Infrastructure Andrew Hoh ML Infra @ Airbnb

  2. Architecture Background Design Goals Open Source Deep Dive

  3. Background

  4. Airbnb’s Product A global travel community that offers magical end-to-end trips, including where you stay, what you do and the people you meet.

  5. Airbnb is already driven by Machine Learning Search Ranking Smart Pricing Fraud Detection

  6. But there are *many* more opportunities for ML Paid Growth - Hosts ● ● Classifying / Categorizing Listings Experience Ranking + Personalization ● ● Room Type Categorizations Customer Service Ticket Routing ● ● Airbnb Plus Listing Photo Quality ● ● Object Detection - Amenities .... ●

  7. Intrinsic Complexities with Machine Learning ● Understanding the business domain Selecting the appropriate Model ● ● Selecting the appropriate Features Fine tuning ●

  8. Incidental Complexities with Machine Learning ● Integrating with Airbnb’s Data Warehouse ● Scaling model training & serving ● Keeping consistency between: Prototyping vs Production, Training vs Inference ● Keeping track of multiple models, versions, experiments ● Supporting iteration on ML models → ML models take on average 8 to 12 weeks to build → ML workflows tended to be slow, fragmented, and brittle

  9. The ML Infrastructure Team addresses these challenges Vision Mission Airbnb routinely ships Equip Airbnb with shared ML-powered features technology to build throughout the product. production-ready ML applications with no incidental complexity .

  10. Supporting the Full ML Lifecycle

  11. Bighead: Design Goals

  12. Seamless Versatile Consistent Scalable

  13. Seamless ● Easy to prototype, easy to productionize ● Same workflow across different frameworks

  14. Versatile ● Supports all major ML frameworks ● Meets various requirements ○ Online and Offline Data size ○ ○ SLA ○ GPU training ○ Scheduled and Ad hoc

  15. Consistent ● Consistent environment across the stack ● Consistent data transformation Prototyping and Production ○ ○ Online and Offline

  16. Scalable ● Horizontal Elastic ●

  17. Bighead: Architecture Deep Dive

  18. Lifecycle Prototyping Production Management Deep Thought Real Time Inference Redspot Bighead Service / UI ML Automator Batch Airflow Training + Inference Environment Management: Docker Image Service Execution Management: Bighead Library Feature Data Management: Zipline

  19. Lifecycle Prototyping Production Management Deep Thought Real Time Inference Redspot Bighead Service / UI ML Automator Batch Airflow Training + Inference Environment Management: Docker Image Service Execution Management: Bighead Library Feature Data Management: Zipline

  20. Redspot Prototyping with Jupyter Notebooks

  21. Jupyter Notebooks? What are those? “Creators need an immediate connection to what they are creating.” - Bret Victor

  22. The ideal Machine Learning development environment? Interactivity and Feedback ● Access to Powerful Hardware ● Access to Data ●

  23. Redspot a Supercharged Jupyter Notebook Service ● A fork of the JupyterHub project ● Integrated with our Data Warehouse ● Access to specialized hardware (e.g. GPUs) ● File sharing between users via AWS EFS ● Packaged in a familiar Jupyterhub UI

  24. Redspot

  25. Redspot a Supercharged Jupyter Notebook Service Consistent Versatile Seamless ● Promotes prototyping in ● Customized Hardware: ● Integrated with the exact environment AWS EC2 Instance Types Bighead Service & that your model will use e.g. P3, X1 Docker Image Service in production via APIs & UI widgets Customized ● Dependencies: Docker Images e.g. Py2.7, Py3.6+Tensorflow

  26. Lifecycle Prototyping Production Management Deep Thought Real Time Inference Redspot Bighead Service / UI ML Automator Batch Airflow Training + Inference Environment Management: Docker Image Service Execution Management: Bighead Library Feature Data Management: Zipline

  27. Docker Image Service Environment Customization

  28. Docker Image Service - Why ● ML Users have a diverse, heterogeneous set of dependencies ● Need an easy way to bootstrap their own runtime environments ● Need to be consistent with the rest of Airbnb’s infrastructure +

  29. Docker Image Service - Dependency Customization ● Our configuration management solution ● A composition layer on top of Docker ● Includes a customization service that faces our users Promotes Consistency and Versatility ●

  30. Lifecycle Prototyping Production Management Deep Thought Real Time Inference Redspot Bighead Service / UI ML Automator Batch Airflow Training + Inference Environment Management: Docker Image Service Execution Management: Bighead Library Feature Data Management: Zipline

  31. Bighead Service Model Lifecycle Management

  32. Model Lifecycle Management - why? ● Tracking ML model changes is just as important as tracking code changes ● ML model work needs to be reproducible to be sustainable ● Comparing experiments before you launch models into production is critical

  33. Bighead Service Consistent Seamless ● Central model ● Context-aware management service visualizations that carry over from the prototyping ● Single source of truth experience about the state of a model, it’s dependencies, and what’s deployed

  34. Lifecycle Prototyping Production Management Deep Thought Real Time Inference Redspot Bighead Service / UI ML Automator Batch Airflow Training + Inference Environment Management: Docker Image Service Execution Management: Bighead Library Feature Data Management: Zipline

  35. Bighead Library

  36. ML Models are highly heterogeneous in Frameworks Training data Environment ● Data quality ● GPU vs CPU ● Structured vs ● Dependencies Unstructured (image, text)

  37. ML Models are hard to keep consistent ● Data in production is different from data in training ● Offline pipeline is different from online pipeline ● Everyone does everything in a different way

  38. Bighead Library Versatile Consistent ● Pipeline on steroids - compute graph for ● Uniform API preprocessing / inference / training / ● Serializable - same pipeline used in evaluation / visualization training, offline inference, online inference Composable, Reusable, Shareable ● ● Support popular frameworks ● Fast primitives for preprocessing Metadata for trained models ●

  39. Bighead Library: ML Pipeline

  40. Visualization - Pipeline

  41. Easy to Serialize/Deserialize

  42. Visualization - Training Data

  43. Visualization - Transformer

  44. Lifecycle Prototyping Production Management Deep Thought Real Time Inference Redspot Bighead Service / UI ML Automator Batch Airflow Training + Inference Environment Management: Docker Image Service Execution Management: Bighead Library Feature Data Management: Zipline

  45. Deep Thought Online Inference

  46. Hard to make online model serving... Consistent with training Easy to do Scalable ● Different data ● Data scientists can’t ● Resource launch models without requirements varies Different pipeline ● engineer team across models ● Different Engineers often need to Throughput fluctuates ● ● dependencies rebuild models across time

  47. Deep Thought Consistent Seamless Scalable ● Docker + Bighead ● Integration with event ● Kubernetes: Model Library: Same data logging, dashboard pods can easily scale source, pipeline, ● Integration with Zipline ● Resource segregation environment from across models training

  48. Lifecycle Prototyping Production Management Deep Thought Real Time Inference Redspot Bighead Service / UI ML Automator Batch Airflow Training + Inference Environment Management: Docker Image Service Execution Management: Bighead Library Feature Data Management: Zipline

  49. ML Automator Offline Training and Batch Inference

  50. ML Automator - Why Automated training, inference, and evaluation are necessary ● Scheduling Resource allocation ● ● Saving results ● Dashboards and alerts ● Orchestration

  51. ML Automator Consistent Seamless Scalable ● Docker + Bighead ● Automate tasks via ● Spark: Distributed Library: Same data Airflow: Generate DAGs computing for large source, pipeline, for training, inference, datasets environment across the etc. with appropriate stack resources Integration with Zipline ● for training and scoring data

  52. ML Automator

  53. Lifecycle Prototyping Production Management Deep Thought Real Time Inference Redspot Bighead Service / UI ML Automator Batch Airflow Training + Inference Environment Management: Docker Image Service Execution Management: Bighead Library Feature Data Management: Zipline

  54. Zipline ML Data Management Framework

More recommend