Developing Your Own Wake Word Engine Just Like “ Alexa ” and “OK Google” Xuchen Yao, CEO, KITT.AI Guoguo Chen, CTO, KITT.AI
What’s a “wake word”? Alexa what’s the weather today? OK Google Hey Siri • Wake word • One shot • Hot word understanding • Offline • Online • Code runs on • Code runs on cloud CPU/DSP/MCU • 7x24 • On Demand • Always listening • Explicit permission
Conversational UI Pipeline wake up device voice speech text text speech text text dialogue understanding management
a customizable hotword detection engine a.k.a: deep neural network in 2MB of RAM hotword.io video blog
Who’s using it (released 5/2016) 10,000+ developers, 7000+ unique hotwords Dominating developer community for hotword detection
Use Cases
#1 Hotword: Smart Mirror https://github.com/evancohen/smart-mirror (credits to Evan Cohen) video link
Command & Control: GoPiGo (credits to Paul Matz) video link
Project RePL (credits to Chris Burns) video link
Conversational UI Pipeline wake up device voice speech text text speech Speech Pipeline text text dialogue understanding management
Speech Pipeline Wake Word Speech Microphone Voice Detection Recognition Array local cloud/local • Close talking • IBM/Microsoft/Nua • Telephone nce/Google (8KHz Sampling) • Far field (3-9 • Alexa Voice Service • Others (16KHz) feet) • Voice Activity Detection • 2, 4, or 6 • Kaldi • Noises: TV, • Auto Gain microphones • PocketSphinx radio, street, Control • Linear/circular • HTK café, car, music • Adaptive Echo • Fast response • Command & Control • Pitch: children, • Language Cancellation (0.1 second) adults, senior Understanding • Beam forming • High accuracy • Accent: US/UK/Europe/ Asian …
Supported Platforms and Wrappers • Raspberry Pi • Mac OS X • iPhone/iPad/iPod • x86/64bit Ubuntu • Android • Pine 64 • Intel Edison • Samsung Artik • Allwinner R-series • Ingenic X1000 • Rockchip
Personal vs. Universal models Personal Universal Voice samples needed 3 At least 1500 Speaker-independent No Yes Speaker-specific Sort of No Robust against noise No Yes Free Yes No Time needed Immediately 2 weeks
Customizing a universal model hotword collect voice web API from device Iterate & Improve define train a deliver & deploy to collect voice hotword model evaluate beta users desired performance: ship & >90% detection rate success <= 3 false alarms in 24 hours
Science behind wake word
Challenges Is this “ Alexa ”? • High detection rate • Low false alarm • Efficient: detect every 0.1 short window longer window second • Small RAM: <2MB • Too much ambiguity, not much context
Existing Algorithm
Existing Algorithm
Existing Algorithm • Advantage: – Simplified pipeline – Simplified decoder • Disadvantage: – Massive hotword specific training data
Possible Ways to Improve • Data augmentation – Adding noise – Adding reverberation – And so on … original add noise add noise and reverberation
Possible Ways to Improve • Network models – Model selection • Feedforward models? Recurrent models? – Model compression • 32-bit float 16-bit float 8-bit integer • Parameters with small absolute value
Possible Ways to Improve • Decoder redesigning – Modeling smaller units • Syllables, phones, etc – False alarm suppression • Additional classifier?
Training with Tesla K20/K80 • Positive data – 1,500 hotword samples • Negative data – Thousands of hours of speech • Training time – Half a day with 4 K80 GPUs
Software Architecture Backend Frontend
KITT.AI Scientific Computing Content Data Training Model Deploy Websocket audio, msg Traffic HTTPs Deep Learning Cloud ELB Message Queue Production Devices Cloud
Running Your First Snowboy Demo
Recommend
More recommend