training deep learning models at scale using kubernetes
play

TRAINING DEEP LEARNING MODELS AT SCALE USING KUBERNETES Mitul - PowerPoint PPT Presentation

TRAINING DEEP LEARNING MODELS AT SCALE USING KUBERNETES Mitul Tiwari and Deepak Bobbarjung Introductions Outline Conversational AI and Deep Learning Need for a Jobs framework on Kubernetes Our Jobs architecture Our Conversational AI Platform


  1. TRAINING DEEP LEARNING MODELS AT SCALE USING KUBERNETES Mitul Tiwari and Deepak Bobbarjung

  2. Introductions

  3. Outline Conversational AI and Deep Learning Need for a Jobs framework on Kubernetes Our Jobs architecture

  4. Our Conversational AI Platform Entities Intents Attributes AI/NLP Language Speech to Translation text Knowledge Sentiment Base Bot Builder Bot Training Bot Deployment No Coding Required. #1 AI/NLP Model. Build Once, Deploy Everywhere.

  5. How to Make a Bot Intelligent? • Natural Language Understanding • Information Extraction • Entities • Intents • Actions • Natural Language Generation • Generating Response

  6. Deep Learning • Traditional Machine Learning • Human designed features and representations • Optimize weights to combine • Deep Learning • Deep Neural Network • Learn good features and multiple levels of representations

  7. Deep Learning for NLP • Language Translation • Image Captioning • Text Summarization • Parts-of-speech Tagging • Named Entity Recognition • Natural Language Generation • Question-Answering • Optical Character Recognition • Speech Recognition • Machine Reading Comprehension

  8. Neural Network for Word Embedding • Word Embedding: Word2Vec • Embed words in continues vector space • Semantically similar words are mapped to nearby points • Enables powerful operations • “King”-“Man”+”Woman” -> “Queen”

  9. Bag of Words - Curse of Dimensionality • Before word embeddings - Bag of words Dictionary of words & counts in the text • Easy feature generation technique • • Limitations Hard to capture order of words • Curse of dimensionality - limited vocabulary - similar words don’t match •

  10. Word Embeddings Cont’d • Word Embedding: mapping words to a higher dimensional space, typically 200-500, e.g., • W(‘King’) = (0.2, -0.4, 0.9, …) • W(‘Queen’) = (0.1, -0.3, 0.8, …) • Learn representations of words • How: two layer NN to learn word representations by predicting validity of phrases Example of similar word vectors

  11. Sequence Learning: Response Generation • Automated Response Generation • Sequence 2 Sequence Model • Recurrent Neural Network (RNN) • Long Short Term Memory Network (LSTM) • Example: GMail Smart Reply • Automated Response Suggestions

  12. Sequence Learning: RNN And LSTM • Recurrent Neural Network Output of a module go into a module of same type (recurrent) • Good for capturing a sequence • • Long Short Term Memory Network Long running cell state: forget & add new values • Output: combination of cell state, previous output, and new • input

  13. Training Deep Learning Models for NLP •Intent Classification Text Sentiment Notifications Speech •Deep Learning — LSTM •Information Extraction Natural Language Targeted Automatic Speech Analysis for •Named Entity Recognition (NER) Understanding & Personalized Timely Recognition & Complaints Generation Notification Generation •Slot attributes •Sentiment and Complaint Classifier •Knowledge Base & Semantic Search Deep Learning •Machine Reading Comprehension Entity Graph & Knowledge Base

  14. Scaling Training Deep Learning Models For NLP Control Plane Conversation Plane • Off line: Started with a script for training models (Offline) (Run Time) Training Load Models • Run Time: A service for prediction during runtime Data Train Models NLP Prediction (for example, Intent • However, the number of models are reaching in thousands classification …) Store Models Users Interfaces • Hard to manage model training script for each of the bot

  15. Outline Conversational AI and Deep Learning Need for a Jobs framework on Kubernetes Our Jobs architecture

  16. Passage AI Architecture Conversation plane Control plane Data Orchestration Hi Config Welcome to.. Data Classify (Hi) Greetings intent AI/NLP API Service model User Data Configure Bots Train Bot training job Training plane

  17. Passage AI Architecture (Jobs) Training plane

  18. When do we train a new model for a bot ? • When a new bot is created • When a bot is changed • utterances are added or modified • New training data is available

  19. # of Bots # of Bot changes per day 500 100 375 75 250 50 125 25 0 0 August September October November August September October November

  20. Why do we need a Control plane Jobs framework? API Service •Run jobs at scale •Eliminate out of band scripts that tend to become ‘tribal’. Create job specification Run a job •APIs and UI for exposing jobs to our Show last 10 job runs customers in our Bot Builder UI. •Reporting and auditing around jobs.

  21. Why Kubernetes (K8S) For Our Microservices? •Scale and availability of our microservices Conversation plane Orchestration Service Pod Pod Pod Pod Pod Pod Pod Pod Pod Pod Pod Pod K8s deployment K8s service K8s hpa Nginx

  22. Why Kubernetes?(Contd) •Cloud Agnostic and On-prem ready Integration Standby On-Prem Conversation Control plane plane Conversation Control plane Conversation Control plane plane plane Staging Conversation Control plane plane Production Conversation Control plane plane $helm install passage-ai

  23. Why Create The Jobs Framework In Kubernetes? •Jobs should also be cloud-agnostic and on-prem ready •Handle scale and availability in the same Jobs Plane way as our microservices Pod Pod Pod Pod Pod Pod Pod Pod Pod Pod Pod Pod •Same set of tools for monitoring, logging and auditing.

  24. Outline Conversational AI and Deep Learning Need for a Jobs framework on Kubernetes Our Jobs architecture

  25. Example Job types in our system • Training deep learning models • Extracting and indexing knowledge base articles • Nightly testing of our bots

  26. Job Specification Job Object Control plane Control plane API API Trigger a job from a job spec Create a job spec job_spec_id: <id> Job type: “Training” progress: 25 Schedule: “Every Monday at 1 AM” state: in_progress Training specific params: data: < confusion matrix> - bot ID, training data - description: performing training priority

  27. Jobs Architecture Control plane Trigger training job from job spec API Service jobs job_specs Create a Job (params) Jobs plane Jobs Service Update job progress Q n o m e t i d d A KB Index Q training Q Bot Testing Q Pickup item Pod1 Pod1 Pod1 Pod2 gpu Pod2 Pod2 Pod3 gpu Pod4 K8s deployment of training pods

  28. Jobs Architecture (scheduled jobs) Control plane Get job status API Service jobs job_specs Jobs plane c e p s b o j m Jobs Service o r f b o j r e g g i r T s s e CronJob r g o r p b o j e KB Index Q t a d p trainingQ Bot Testing Q Scheduler U Pod1 Pod2 Pod1 Pod1 Pod3 Pod4 Pod2 Pod2 K8s deployment of training pods

  29. Alternatives that we considered Apache Airflow Azkaban

  30. Thank You Mitul Tiwari Deepak Bobbarjung mitul@passage.ai deepak@passage.ai @Mitultiwari @Bobbarjung

Recommend


More recommend