Micropower Electro-Magnetic (EM) Sensors for Speech Characterization: Recognition, Verification, and Other Applications presented to IBM Watson Research Laboratory Yorktown, New York J.F. Holzrichter Lawrence Livermore National Laboratory holzrichter1@llnl.gov February 4, 1999 Work performed under the auspices of the U.S. Department of Energy by Lawrence Livermore National Laboratory under Contract W-7405-Eng-48.
We conjectured, a few years ago, that micropower EM sensors could provide useful additional information for many speech applications. This appears to be true. • EM sensors measure generalized positions versus time of speech articulator interfaces – They are very effective as articulator gesture detectors – They work well where the articulator being detected is isolated in space and characterized in frequency • They are useful for many signal processing applications – Enable a good voiced excitation function – Enable noise reduction & many “house keeping” activities • These sensors are compatible with most acoustic technologies – Very low, human compatible output levels, < 0.2 mW – Low cost, low power use, FCC & FDA ok, and small JFH:2/4/99.tlb 2
Micropower EM sensors have been used primarily in the homodyne “field disturbance” mode This image shows the present EM sensor, which costs about $15 in parts, and about $200 to replicate (margin costs). These are expected to cost a few dollars each, in production as chips. JFH:2/4/99.tlb 3
Miniature micropower EM sensors can measure articulator positions and motions in real time Refllected EM wave Teeth Lips Transmitted E M waves velum velulm closed position nasal tract palate Acoustic oral tract microphon e tip pharynx back EM tongue sensor 3 glottis EM sensor 2 EM sensor 1 vocal folds JFH:2/4/99.tlb 4
There are many ways to include the EM sensors with normal speech transducers: e.g., hand held micro- phones, mike-booms, monitor, telephones, etc. JFH:2/4/99.tlb 5
Micropower EM sensors have many applicable modalities when used for acoustics and speech applications • EM sensor modalities used for speech experiments are – Impulse transmit and impulse range gate – Wave packet transmit and impulse range gate – Homodyne transmit and receive – High pass filtering (i.e., field disturbance) or DC coupling • The homodyne sensors generate 10-ns EM wave trains at 2 GHz frequency ( λ = 15 cm), with a 2-MHz prf (1/e tissue penetration is 3 to 5 cm) • The power levels are <1.0 nJ/pulse and <0.2 mW of average radiated power (can be <0.02 mW) • The power and energy densities on human tissues are one order-of-magnitude below international continuous exposure levels of 1 mW/cm 2 , and can be made 100-fold lower JFH:2/4/99.tlb 6
The interpretation of EM sensor information depends upon its use in near, intermediate, or far field modes Near Field Mode* Far Field Mode* (i.e., Radar) Air/Material Interface Air/Material Interface antenna antenna Area Area eff ff r r r < r >> λ < Α 1/2 λ > Α 1/2 Signal # = const. A eff x 1/r 4 Signal* = const. Δ r x A eff x (d/dr envelope) = 0.02 V/mm/cm 2 at r = 6 cm * Homodyne approximation # Multiple pulse radar approximation JFH:2/4/99.tlb 7
Good filtering makes small “relative motions” of interfaces measurable, e.g., >50 dB signal to noise Sensitivity E field reflecting Envelope Interface Motions amplitude 1.0 Position cm 0 2 5 4 6 Magnified � E � 10-6 V Position - time 2 Magnified time 1 � x � 10 µ M JFH:2/4/99.tlb 8
Multiple EM-wave cycles, homodyne detection, and AC filtering are used for glottal experiments Wave Packet Transmit Antenna EM Waves ε = 50 . Sensitivity function versus distance of object from sensor Mixer multiplies local and received wave train Receive Antenna 70 Hz 3 kHz Integrator High & Low pass filters Glottal Signal JFH:2/4/99.tlb 9
An EM sensor signal reflected from the glottal area, calibrated against high speed videos of glottal cycle, shows good signal timing agreement JFH:2/4/99.tlb 10
Glottal EM Sensor (GEMS) tissue measurements show strong correlation with vocal-fold electroglottography (EGG) signal * * Experiments conducted at the U. of Iowa, National Center for Voice and Speech in collaboration with Prof. I. Titze & Dr. B. Story, and W. Lea of Speech Sciences. JFH:2/4/99.tlb 11
EM-sensed glottal tissue data and pitch rate show distinct individuality on all users tested. Generalized signal structure agrees with EGG data.* * Experiments conducted at the U. of Iowa, National Center for Voice and Speech in collaboration with Prof. I. Titze & Dr. B. Story, and W. Lea of Speech Sciences. JFH:2/4/99.tlb 12
Micropower EM sensors vastly increase the amount of information to characterize an individual ’ s articulator conditions during speech • Provides generalized locations of vocal articulator interfaces during human speech at >1 kHz rates — Measures vocal folds, tongue, lips, jaw, velum — Measures articulators not influencing the acoustics • Enables the measurement of the glottal cycle, definition of synchronous frames, and an estimate of a voiced excitation function, with frequencies from 70 Hz to 7 kHz • Obtains physiological values of each individual ’ s speech organs and their EM wave reflection coefficients • Enables these measurements non-invasively, safely, and economically in the presence of acoustic noise JFH:2/4/99.tlb 13
EM sensors enable much improved pitch measurements, especially in noisy environments Three Acoustic Approaches EM Sensor-Based Pitch Measurement Measure zero crossing time interval (no noise) JFH:2/4/99.tlb 14
EM sensors enable glottal cycle measurements accurately, < + 1 Hz, and automatically (e.g., onset) Glottal & Pharynx motions for unvoiced speech
The EM sensor yields smoother, more accurate pitch contours relative to traditional methods They were all good men Audio time Radar EM sensor EM sensor time Audio Audio CEP CEP Hz time JFH:2/4/99.tlb 16
Unlike acoustic algorithms, EM sensor pitch algorithms work well in the presence of acoustic noise Onset of Second Speaker EM Sensor Problems Cepstral Problems Autocorrelation JFH:2/4/99.tlb 17
From tuning fork experiments, we determine the “pitch-estimating” performance of EM sensors Kflops • Using known signals, we compute pitch using: 10000 Cepstral, Autocor, & 1000 EM signal zero-crossings 100 10 • EM sensor pitch estimates: 1 - 100 times faster in EM sensor Cepstral Autocor GEMS computational efficiency - 5 to 20 times smaller errors - Insensitive to acoustic noise Error Rate (%) - Provide real-time pitch 0.3 synchronous processing 0.25 0.2 0.15 0.1 0.05 0 Cepstral Autocor GEMS EM sensor JFH:2/4/99.tlb 18
The human vocal tract has excitation source(s), E, followed by a sequence of tubes and resonators that can be described as H, using linear equations Horizontal vocal tract with 4 resonator chambers vocal soft palate folds 4 microphone Pressure E H A subglottus subglottis 2 1 3 Reservoir (lungs) pharynx lips tongue EM Jaw, Tongue, jaw sensor Palate Sensor spectral output or = H( ω ) = A( ω ) / E( ω ) Transfer function � JFH:2/4/99.tlb 19
The EM Sensor gives access to the real-time excitation function to use in an ARMA model y(t) x(t) h(t) Glottal excitation Audio function Vocal tract impulse response H(z) = Y(z)/X(z) Transfer function zeros B + B z + ... -1 H(z) = poles 0 1 A + A z + A z + ... - 1 - 2 0 1 2 Poles of the transfer function are the resonances of the vocal tract Zeros of the transfer function are the anti-resonances of the vocal tract JFH:2/4/99.tlb 20
Using a synthetic transfer function, we see improved characterization of “zeros” in transfer functions when using pole/zero (e.g., ARMA) modeling X(z) EM Excitation 4 pole/2 zero Synthetic Transfer Function Synthetic H(z) Transfer Calculated Function Transfer Functions 4 pole/2 zero ARMA ARMA H(z) Synthetic Y(z) Audio 4 pole LPC LPC H(z) 16 coeffi ficients Cepstral Cepstral H(z) Frequency (Hz) JFH:2/4/99.tlb 21
The ARMA model yields accurate and robust transfer functions, which compare well to traditional models /i/ 20 Coeffi ficient Cepstral model (black) F 2 15 pole/15 zero ARMA model (blue) 15 Pole LPC model (red) F 1 JFH:2/4/99.tlb 22
The EM sensor glottal information enables pitch synchronous signal processing of speech The sample below illustrates a commonly confused segment of the phrase “recognize speech” JFH:2/4/99.tlb 23
Another example of pitch synchronous ARMA processing The sample below illustrates a commonly confused segment of the phrase “wreck a nice beach” JFH:2/4/99.tlb 24
Three separate EM radar sensors have been used to measure several articulator motions as the word “print” is spoken JFH:2/4/99.tlb 25
Lower frequency, 0.2 - 10 Hz EM sensor channels can provide generalized articulator timing information EM Sensor /a/ /ng/ /a/ velum velulm closed Spectrogram position nasal tract palate oral tract Acoustic pharynx microphone EM sensor EM Sensors JFH:2/4/99.tlb 26
Recommend
More recommend