Spoken Language Understanding on the Edge Alaa Saade, Alice Coucke, Alexandre Caulier, Joseph Dureau, Adrien Ball, Théodore Bluche, David Leroy, Clément Doumouro, Thibault Gisselbrecht, Francesco Caltagirone, Thibaut Lavril, Mael Primet Snips, Paris EMC2 Workshop @ Neurips 2019 November 13 Alexandre Caulier
Spoken language understanding system Automatic Speech Recognition Language modeling Engine Intent: Natural t ɜ r n ɑ n ð ə Turn on the SwitchLightOn Acoustic Language Language l a ɪ t s ɪ n ð ə lights in the model model Understanding ˈ l ɪ v ɪ ŋ r u m living room Slots: Engine room: living room Features Tested and certified to run on 1GB RAM 1.4GHz CPU • Cloud independent - no remote processing • Private by Design - no user data can be collected • Accurate - on-par with cloud-based solutions
Acoustic modeling Automatic Speech Recognition Language modeling Engine Intent: Natural t ɜ r n ɑ n ð ə Turn on the SwitchLightOn Acoustic Language Language l a ɪ t s ɪ n ð ə lights in the model model Understanding ˈ l ɪ v ɪ ŋ r u m living room Slots: Engine room: living room Deep neural Proba over phones network /a/ /b/ /c/ /d/ /e/ time Challenges Large deep learning models Trade-off between accuracy & computational efficiency Computationally & memory intensive Reduced model size (~10MB) Training data: 10K+ hours of in-domain audio with transcript per language Few K hours of training data
Assistant Contextualization Automatic Speech Recognition Language modeling Engine Intent: Natural t ɜ r n ɑ n ð ə Turn on the SwitchLightOn Acoustic Language Language l a ɪ t s ɪ n ð ə lights in the model model Understanding ˈ l ɪ v ɪ ŋ r u m living room Slots: Engine room: living room Approach : LM and NLU are consistent and contextualized Language Model Proba over Decoding phones graph /a/ /b/ /c/ /d/ Turn on the lights in the living room time Natural Language Understanding Logistic Conditional regression Random Field Intent Slots Sentence Lightweight models Out of vocabulary management On-device personalization
Benchmarks - Datasets Open Sourcing Experimental setting Method Datasets Metrics Audio utterances with transcripts & supervision End-to-end score Specialized for 💢 & 🎶 Recorded in close and far- field % of perfectly parsed queries <100MB, real time on a Raspberry Pi 3 💢 Smart Lights Assistant Intent: SwitchLightOn 1.8K utterances Slots: room: living room 400 word pronunciations 🎶 Music Assistant Google Speech-to-Text cloud services 3K utterances One-size-fits-all engine 178K word pronunciations
Benchmarks End-to-End performance 100% % of perfectly parsed queries Contextualized for 💢 & 🎶 84 79 <100MB, real time on a 69 Raspberry Pi 3 50% 48 STT cloud service One-size-fits-all engine 0% 🎶 Smart Lights Assistant 💢 Music Assistant 🎶 Tier 1 Artists Tier 2 Artists Tier 3 Artists 400 word pronunciations 178K word pronunciations 1-1k 4.5k-5.5k 9k-10k Snips 71 % 68 % 67 % Google 69 % 38 % 37 % Questions ?
Recommend
More recommend