jarvis and nemo
play

Jarvis and NeMo GTC China Jarvis 2 JARVIS Platform to develop - PowerPoint PPT Presentation

Jarvis and NeMo GTC China Jarvis 2 JARVIS Platform to develop and deploy conversational AI applications Designed for sensor fusion Gaze & Speech https://www.youtube.com/watch?v=r264lBi1nMU 3 USE CASES ACROSS ALL VERTICALS Online


  1. Jarvis and NeMo GTC China

  2. Jarvis 2

  3. JARVIS Platform to develop and deploy conversational AI applications Designed for sensor fusion Gaze & Speech https://www.youtube.com/watch?v=r264lBi1nMU 3

  4. USE CASES ACROSS ALL VERTICALS Online Store Industrial Finance Energy / Oil & Gas Consumer Internet In car experience Provide Collaborative robots - Robots Call center: Sentiment Use camera and ask, Video diarization - Autonomous Driving: “what are the safety conversational and humans collaborate in of customers calling Meeting/conversation Enhanced In-car interface for close proximity guidelines for this transcription per person with experience combining chemical”? shopping Insurance chatbot: timestamps visual inputs with speech “Add a wedding ring to Engineer troubleshooting with the help of AI assistant an insurance policy via Loud environment - virtual Content tagging with Image, an image and receive assistant using lip reading text, Audio - policy price quote” Recommendation, Ads 4

  5. CHALLENGES OF CONVERSATIONAL AI Deployment Multiple sensors Real Time Custom models High accuracy Cloud services not Existing software not Difficult to use customizable Requires low latency designed for modern Need state-of-the-art multiple sensors High costs for natural interaction production algorithms and models efficiently Data Sovereignty environments 5

  6. JARVIS BENEFITS Deployment Multiple sensors Real Time Custom models High accuracy Micro-service approach Best-in-breed Framework for training Start from base model, End-to-end inference Designed for K8s algorithms and deploying models train with your data on on GPUs optimized to Simple APIs, easy to Direct access to across modalities your infrastructure reduce latency integrate Tools to simplify fusion cutting-edge research 6

  7. JARVIS WORKFLOW OVERVIEW JARVIS AI Services Gaze Pose detection estimation Wake Speech Intent Speech word Object Recognition Classification Synthesis Lip detection activity Pretrained Data for Fine-Tuning gRPC, Python client library models customizing Jarvis Core Client (client) Application (optional) Multiple End users sensor input Sensor Fusion, Dialog 7 Manager, Backend fulfillment

  8. JARVIS WORKFLOW OVERVIEW JARVIS AI Services Gaze Pose detection estimation Wake Speech Intent Speech word Object Recognition Classification Synthesis Lip detection activity Pretrained Data for Fine-Tuning gRPC, Python client library models customizing Jarvis Core Client (client) Application (optional) Multiple End users sensor input Sensor Fusion, Dialog 8 Manager, Backend fulfillment

  9. Visual Diarization Multiple speaker transcription based on video and audio streams Interaction : Jupyter notebook with live video stream overlaying gaze detection and lip activity detection and producing a text transcript per person from the audio stream Technology of sensor fusion : ● Video stream ○ Gaze detection to engage the system ○ Lip activity to determine who is speaking ● Audio stream: ○ Transcribe the audio Label transcriptions per individual speaker ○ Implementation : ● Fusion graph via JSON to combine the multiple inference models Transcription ● gRPC end points for direct interaction with the inference models Driver: Where is a good sushi restaurant? Passenger: What’s the weather in Chicago ● Jupyter notebook demonstrates Python APIs for interaction Model Developer : Improve the conversational model accuracy via fine-tuning with NeMo Developer Operations : Deploy via docker containers from NGC into Kubernetes (EGX) 9

  10. Jarvis ASR Service Jarvis ASR TRTIS pipeline Pre- Post- Acoustic Post- Post- Processing Model Processing Processing Processing End of Greedy BERT- Feature Audio Text Jasper Sentence or Beam based Extractor Detector Decoder Punctuator TRTIS custom TRT on TRTIS custom TRT on backend on GPU backend, N-gram GPU GPU language model Jarvis ASR API Method Name Description Recognize Given audio file as input, return transcript StreamingRecognize Process audio from a file or a microphone as it’s being captured, returning partial transcripts 10

  11. 12

  12. Jarvis – Weather Bot Architecture Deployment of Jarvis components with simple dialog manager Fulfillment Engine Weather query, etc. Jarvis Service Action Result Text ASR Spoken Text Dialog Manager input Jarvis Service • State of conversation Intent & • Route text to services Intent • Entity Pass commands to Audio Jarvis Service Text Slots fulfillment engine response TTS Dialog Trained model Description weights Legend Domain specific NEMO (offline) Authoring (offline) NVIDIA Intent & dialog states, transitions, Entity Chat Application response templates (e.g. iFlyTek) 13

  13. Neural Modules (NeMo) 14

  14. CONVERSATIONAL AI WORKFLOW JARVIS AI Services Gaze Pose detection estimation Wake Speech Intent Speech word Object Recognition Classification Synthesis Lip detection activity Pretrained Data for Fine-Tuning gRPC, Python client library models customizing Jarvis Core Client (client) Application (optional) Multiple End users sensor input Sensor Fusion, Dialog 15 Manager, Backend fulfillment

  15. CONVERSATIONAL AI WORKFLOW JARVIS AI Services NeMo Gaze Pose detection estimation Wake Speech Intent Speech word Object Recognition Classification Synthesis Lip detection activity Pretrained Data for Fine-Tuning gRPC, Python client library models customizing Jarvis Core Client (client) Application (optional) Multiple End users sensor input Sensor Fusion, Dialog 16 Manager, Backend fulfillment

  16. NEMO: TRAINING CONVERSATIONAL AI MODELS Pretrained Models per module • Open source deep learning Python toolkit for training speech and language models Neural Modules Collection Libraries • High performance training on NVIDIA GPUs Uses TensorCores • Multi-GPU • Multi-Node • Voice Natural Speech • Based on concept of Neural Module – Recognition Language Synthesis reusable high level building block for defining deep learning models • PyTorch backend (TensorFlow on Roadmap) Neural Modules Core Mixed Precision, Distributed training, Semantic checks Optimized Framework 18 Accelerated Libraries CUDA, cuBLAS, cuDNN etc...

  17. NEMO COLLECTIONS pip install nemo_asr pip install nemo_nlp pip install nemo_tts nemo_asr nemo_nlp nemo_tts (Speech Recognition) (Natural Lang Processing) (Speech Synthesis) Jasper acoustic model BERT pre-training & Tacotron 2 • • • QuartzNet acoustic model finetuning WaveGlow • • RNN with attention GLUE tasks English and Mandarin • • • Transformer-based Language modeling output and datasets • • English and Mandarin Neural Machine Translation importers • • tokenizers and dataset Intent classification & slot • importers filling ASR spell correction • Punctuation • English and Mandarin • dataset importers 19

  18. NEMO EXAMPLE: JASPER ASR AUDIO AUDIO SPECT SPECT ENC AUDIO AUDIO SPECT SPECT ENC Audio To Text Audio Jasper LEN LEN LEN LEN LEN Data Layer Preprocessing Encoder TEXT TEXT LEN Logging LOG LOG PROB PREDICT ENC PROB Callback Jasper ENC Greedy CTC LEN Decoder For Decoder CTC (invokes) Train LOG PROB LOSS Action LOG PROB LEN CTC Loss TEXT TEXT LEN 20

  19. NEMO EXAMPLE: JASPER ASR Create modules Connect them Define training and evaluation actions 21 ”Jasper: An End -to- End Convolutional Neural Acoustic Model” by Li et al. INTERSPEECH 2019 https://arxiv.org/pdf/1904.03288.pdf

  20. ASR COMPARISONS English LibriSpeech dataset %WER Model Language Model Test-Clean Test-Other Params, M DeepSpeech 2 5-gram 5.33 13.25 >70 wav2letter++ ConvLM 3.26 10.47 208 Listen-Attend-Spell RNN 2.5 5.8 360 (with SpecAugment) - 3.77 11.08 Jasper 10x5 6-gram 3.19 9.03 333 Transformer-XL 2.86 8.17 - 3.90 11.28 QuartzNet 15x5 6-gram 2.96 8.07 19 Transformer-XL 2.69 7.25 22

  21. DOMAIN SPECIFIC ASR Jupyter notebook transfer learning tutorial • Start with pretrained base QuartzNet model • Fine tune with WSJ data (newspaper read aloud) • Add custom language model to base model • Add custom language model to fine-tuned model for best performance • Achieves Word Error Rate of < 2.5 ! Fine-tuned acoustic Pretrained acoustic Fine-tuned Pretrained base model + custom model + custom acoustic model model language model language model Tutorial here: https://ngc.nvidia.com/catalog/containers/nvidia:nemo_asr_app_img 23

  22. TRANSFER LEARNING CUSTOMER STORY An S&P Global Company ● S&P Global produces transcriptions of earnings calls – 10,000 hours of high quality data ● Scribe application works with ASR models ● Recognizes domain specific financial jargon ● Additional language models provide meta tags, punctuation GTC Talk: https://events.rainfocus.com/widget/nvidia/gtcdc19/catalog-short?search=nemo 24

  23. KENSHO ASR RESULTS ● QuartzNet trained on domain specific financial data outperformed all leading ASR models ● Fine tuning was faster and had higher accuracy than training from scratch 25

  24. JARVIS AND NEMO TOGETHER 26

Recommend


More recommend