end to end ai speech in didi from algorithm to application
play

End-to-End AI Speech in DiDi - From Algorithm to Application - PowerPoint PPT Presentation

End-to-End AI Speech in DiDi - From Algorithm to Application AI- lixiangang@didiglobal.com pengyiping@didiglobal.com


  1. End-to-End AI Speech in DiDi - From Algorithm to Application 滴滴端到端语音AI实践-从算法到实现 lixiangang@didiglobal.com pengyiping@didiglobal.com

  2. ������������ • ���� �� ���� ������ � ���� MORE THAN A JOURNEY

  3. Speech Processing & NLP Layout Understand Think Say What You Say What You Think What You Say Language Speech Recognition Speech Synthesis Understanding Natural language Processing Text can be converted into speech. Speech can be converted into texts including text categorization, Music notes can be converted Voices of different people can be syntax parsing, intention into songs. identified recognition and semantic Different voices can be categorized comprehension, etc. 3

  4. Driver Care Assistant Intelligent Bot

  5. Driver Care Assistant Intelligent Bot

  6. Intelligent Assistant ASR NLU Recognize Questions Provide Solutions Generate Abstracts 6

  7. Voice Interactive Voice Interactive ASR NLU TTS n Japan & Australia: Accept Orders n China: Cancel Orders Japan Australia China

  8. Voice Interactive

  9. • ���� ��� ���� ����������������� MORE THAN A JOURNEY

  10. Contents p Algorithm and computation capability are key to speech AI landing. p From algorithm to implementation, we will briefly walk through our most important speech AI applications. ◦ Speech & NLP applications ◦ How GPU enables our applications ◦ Smart customer service assistant ◦ Graph optimization ◦ Responsibility judgement ◦ Online serving ◦ Voice interaction ◦ DELTA ◦ Speech Technology ◦ ASR ◦ Signal processing ◦ Speaker Identification ◦ Emotion Recognition 10

  11. Attentional ASR • Dictionary • The modeling units for Mandarin Chinese ASR Word Character Syllable Initial-final/phones 北京 北 京 bei jing b ei j ing • Characters are usually selected as the basic modeling units • Language Model • How to benefit from the large text corpus without N-gram ? • We pre-train RNN-LM and then merged into acoustic neural network

  12. End-to-end speech recognition • End-to-end is a relative concept phoneme syllable/character We need decision-tree based DNN-HMM state clustering, dictionary, language model We need dictionary, language model, The N-gram based language (If we use the cd-phone as RNN-CTC models would improve the modeling units, we still need performance decision-tree based state clustering) RNN-Attention We do not need extra models

  13. Attentional ASR • Sequence-to-sequence model from translation

  14. Listen-Attend-Spell • Encoder • Listen, map the input feature sequence to embedding • Decoder • Spell, map the embedding based on the attention information to the output symbols

  15. Attention vs. CTC • Advantages • There is no conditional independence assumptions • Joint learning of acoustic information and language information • Speech recognition system is more simple • Disadvantages • Not easy to converge, We need more tricks to train attention model • Cannot be used for “streaming” speech recognition, during inference, the model can produce the first output token only after all input speech frames have been consumed.

  16. Listen-Attend-Spell • Hard to train – many “tricks” • Schedule sampling • Label smoothing (2016) • Multi-Task Learning (2017) • Multi-headed Attention (2018) • SpecAugment (2019) • Data augmentation to LAS • Achieved sota results on Librispeech and SWBD

  17. Speech-Transformer • Speech Transformer • Transformer applied to ASR • With Conv layers as inputs

  18. Speech-Transformer • Speech Transformer • Transformer applied to ASR • With Conv layers as inputs

  19. Speech-Transformer • Speech Transformer • Transformer applied to ASR • With Conv layers as inputs • Time-restricted self-attention • Left & Right Contexts restricting the attention mechanism

  20. Unsupervised pre-training for speech-transformer • Pre-training: • Like BERT in NLP, e.g. Mask Predictive Coding • Fine-tuning: • Plug in a decoder

  21. Unsupervised pre-training for speech-transformer • Mask Predictive Coding: • mask 15% of all frames in each sequence at random, and only predict the masked frame rather than reconstructing the entire input • Dynamic Masking: • Like RoBERTA, masking strategies are not decided in advance • Down-sampling: • Local smoothness of speech makes learning too easy without down-sampling. Eight-fold down- sampling is used, like LAS.

  22. Unsupervised pre-training for speech-transformer

  23. Related topics: signal processing for noise and far-field AEC De-reverb BSS AGC NS Beamforming ( ) x k Fixed filter 0 + ( ) Å Å y k ( ) x k Z -L Fixed filter 1 - …… …… ( ) x k Fixed filter N - 1 ( ) u k 0 Adap Filter ( ) Å u k Adap Filter 1 BM ( ) k u M Adap Filter

  24. -- Acoustic Echo Cancellation -- Noise suppression -- Beamforming / Blind source separation -- Auto Gain Control original speech Note: The reasons for this dive seemed foolish now. His captain was thin and haggard and his beautiful boots were worn and shabby. processed speech Production may fall far below expectations.

  25. Multimodality: speech + text Speech Text Intent/slot… ASR model NLP model Can we utilize speech information or even build an end-to-end model?

  26. Multimodal speech emotion recognition • Aims to automatically identify the emotional state of a human being from his or her voice. • audio signals features à speech emotion recognition • transcribed text à sentiment analysis • Combination of audio and text à multimodal methods https://www.cs.cmu.edu/~morency/MMML-Tutorial-ACL2017.pdf

  27. Multimodal emotion recognition • Motivation (Xu et al., Interspeech, 2019) • The existing methods ignore the temporal relationship between speech and text in a fine-grained level • The multimodal system will be benefit from using the alignment information since the speech and text inherently co-exist in the temporal dimension • Utilize an attention network to learn the alignment between speech and text

  28. Learning alignment between speech and text • Speech encoder and text encoder • Speech + ASR-recognized text • BLSTM for each • Alignment • Utilize the attention to learn the alignment weights between speech frames and text words • Concatenate the aligned feature to multimodal feature for classification

  29. Speaker Identification Algorithm d-vector x-vector • Classical i-vector • GMM-UBM Speaker ID Speaker ID • A generative model XENT XENT • d-vector FC FC utt-level • Sum over all hidden outputs • Somehow text-dependent FC FC …… Stats Pooling • x-vector based models … • Statistical pooling • Faster to compute FC TDNN frame-level • More robust to noises FC TDNN • Better on short utterances • Scales better on large data Filterbank Filterbank Features Features

  30. Speaker Identification Algorithm EER comparison for different models 12.000 10.762 10.000 9.114 State-of-the-art performance 7.592 8.000 7.254 7.188 • Based on additive angular margin loss 6.000 4.012 4.000 3.232 • Significantly lower error rate 3.270 2.542 2.000 Development: 3000k utterance,200k • 0.000 speaker 10k 20k 100k 200k 500k 1000k 3000k Training: 12.5k utterance, 2k speaker • ivector xvector GE2E Testing: 12.5k utterance, same 2k speaker •

  31. • ���� ���� ���� ���������������������� MORE THAN A JOURNEY

  32. Online inference acceleration Huge amount of speech data Average inference performance per node ◦ Huge amount of data per day 1400.0% ◦ Requires strong processing power 1200.0% 1144.1% 1000.0% 900.0% Strategy 800.0% 单机性能 572.0% 600.0% 单位 cost 性能 Algorithm 450.0% 400.0% 317.8% 250.0% Computation 200.0% 100.0% 100.0% 0.0% CPU( 双路 4114) P4 float32 P4 int8 P4 int8 + model X86 compression

  33. Significant speed up with GPU deployment • AVX-512 brings no significant speed up (~20%, TDP constraint) • Some SKU has only one AVX unit which limits speed up X86 • CNN takes up to 90% computation time • Uses elastic Tesla P4 instances on DiDiYun GPU • Int8 quantization brings +80% speed up over float32 Model • Negligible accuracy loss Quantization • Distill-based model compression Model • Acceptable accuracy loss compression

  34. Inference optimization p Overall strategy • Bypass-based graph optimization • Custom Graph Optimization strategy based on TF Grappler • More platform-specific ops Based on TF Custom Op(X86/ARM/GPU) • Decoupling with TF source • Hand-crafted high performance ops

  35. Graph Optimization p Op fusion 复杂子图融合 如图中红框,复杂的激活函数 (Gelu)计算图,融合为单独 Gelu 算子 • Higher computation density 简单计算图融合 • Less kernel invocation 常见的融合图融合策略比如 • Less memory access Conv+BatchNorm • Utilizes registers better Conv+BatchNorm+Relu Matmul+[Matmul] +Relu ……

Recommend


More recommend