Description of yourself, your team/lab, your topic area, and who funds it • Director of the Center for Language and Speech Processing – more than 15 faculties in language, speech, machine translation, machine learning, cognitive sciences and neurosciences – more than 40 graduate students – usual funding sources • Collaborations with CoE HLT • Three-student team working directly with me – acoustic processing for ASR • techniques based on temporal cues in the signal and on artificial neural net post-processing • biology-inspired auditory processing – funded by IARPA, DARPA and Google
How does your area impact current speech technology (if at all) right now • Temporal features (perception of modulations) perception of modulations – longer (syllable and beyond) temporal context – RASTA, LDA filters, TRAPS, MRASTA, modulation spectrum, ….. • Data-guided features – LDA, convolutive DNNs,… • Parallel processing streams physiology of hearing – different frequency ranges, different spectro-temporal properties, different expertise (training), different degrees of prior constraints,.. • Hierarchical processing (deep learning?) – frequency-localized to full spectrum, short context to longer context, …
Challenges • Human-like processing not always appreciated by hard-core engineers) • Communication between engineering and life sciences – different goals, different vocabularies, different reward systems,… • Researchers trained in both the life sciences and engineering are rare
What Is The Problem? • ML (DNNs) – train over all sources of unwanted variability • How to deal with previously unseen data? • Knowledge from life sciences? – Emphasis on higher processing levels (beyond periphery) • Hierarchical processing in auditory system • Generalizations • Performance monitoring • Attention (what to ignore)
Dealing with Unknown Unknowns: Biologically-inspired multi-stream processing of sensory information ~10 Hz ~1000 Hz 1. how to create processing streams ? processing 2. “smart” fusion ? periphery cortex ~100K neurons ~10M neurons How do we know which combination of processing streams yield “correct” information Preserving information in a system ? the information must “make sense” • prior knowledge learning Typical sound occurrences, typical confusions, and typical temporal signal information “smart” patterns of speech sounds fusion processing streams bottom-up streams environment conventional proposed best by hand • conflicts indicate localized corruptions clean 31 % 28 % 25 % bottom-up dominated leave out affected • modalities, projections car at 0 dB SNR 54 % 38 % 35 % streams within modalities “ five ” “ three ” “ zero ’ top-down and bottom-up streams /th/ /r/ /iy/ weak top-down influenced conflicts indicate unexpected • priors different strengths of inputs /z/ /z/ /r/ strong /r/ /f/ /iy/ /iy/ prior constraints /v/ • opportunity for learning /oh/ /oh/ /sil/ priors /ay/ divergence time
TRAINING clean 10 dB SNR 5 dB SNR “clean” DNN signal DNN decoder signal “10 dB” DNN “5 dB” DNN word error rates Aurora 4 Training / Test Clean 10 dB SNR 5 dB SNR Clean 3.10 15.65 36.60 10 dB SNR 5.06 4.35 14.70 5 dB SNR 9.04 4.73 7.73 multi-condition training 4.28 5.17 11.86 multi-band 3.06 3.12 10.29
Where Are We Now ? Signal processing, information theory, machine learning, … signal processing pattern classification decoder message
And Where Are We Heading ? Repetition, fillers, hesitations, interruptions, unfinished and non-gramatical sentences, new words, dialects, emotions, … Current DARPA and IARPA programs, research agenda of the JHU CoE HLT, industrial efforts (Google, Microsoft, IBM, Amazon,…) neural information processing, Signal processing, & psychophysics, physiology, cognitive information theory, machine learning, … science, phonetics and linguistics, ... Engineering and Life Sciences together !
Recommend
More recommend