Voice Capture and Analysis Cody Narber Computer and Information Science Department Kansas State University
Frequency Frequency is a measure of repeating events per unit time. In audio it is the measure of air pulses per second. The main unit of measurement is Hertz (Hz), which is 1/t, where t is the period of the wave (shown below). Every signal can be expressed as a sum of sine and cosine terms. This is known as the Fourier Theorem and is the basis for the Fourier Transform , which decomposes a signal into these parts. Efficient algorithms exist to approximate this decomposition (namely the FFT). Thus we can apply the FFT to an audio signal to extract the frequency terms that comprise the signal.
Spectrum The frequency spectrum is the plotting of the frequency and the corresponding amplitudes that are present in the signal. The amplitude is the height of the peaks in the sinusoidal waves that compose the signal, or the strength of that frequency present. A spectrogram is a plotting of the frequency spectrum at each moment of time (darker areas are higher amplitudes, with the y-axis being frequency, and x-axis being time).
Formants Formants are peaks in the frequency spectrum, or the frequencies that are most prevalent in the signal. Several formants exist in spoken samples and are used for vocal recognition (table below showing the average frequencies that are associated with vowels). These peaks correspond to resonance in sound sources like musical instruments, or anything with sound chambers (for humans this would be the nasal and oral cavity). The fundamental frequency is the first formant (F 0 ) and is the pitch that humans detect. Vowel formant data from Peterson and Barney, 1952
Special Frequencies There are certain frequencies of Average Human Hearing Frequency sounds that are of special note. Lower High 20 Hz 20,000 Hz The hearing statistics are for healthy young adult. as people age their Average Human Spoken Frequency ability to hear the far end sounds Male Female decreases. 120 Hz 210 Hz Musical Notes using Equal-Tempered tuning [A4 = 440Hz] Note Octave=1 Octave=2 Octave=3 Octave=4 Octave=5 Octave=6 A 55 110 220 440 880 1,760 A#/Bb 58 117 233 466 932 1,865 B 62 123 247 494 988 1,976 C 65 131 262 523 1,047 2,093 C#/Db 69 139 277 554 1,109 2,217 D 73 147 294 587 1,175 2,349 D#/Eb 78 156 311 622 1,245 2,489 E 82 165 330 659 1,319 2,637 F 87 175 349 698 1,397 2,794 F#/Gb 92 185 370 740 1,480 2,960 G 98 196 392 784 1,568 3,136 G#/Ab 104 208 415 831 1,661 3,322 A 110 220 440 880 1,760 3,520
Voice, Hearing, and Microphones When speaking the vocal cords vibrate which closes the airway which stops and starts air flow. The air then resonates in the oral and nasal cavities. It is this stop and start of airflow that creates what are known as voiced sounds (ones that use the vocal cords, namely vowels). Latitudinal waves are created by this stopping and starting of airflow. The faster the cords vibrate the closer together the waves and thus higher frequency sounds are produced. Our eardrums pick up these compression/decompression waves by moving back and forth triggering neurons that send impulses to be deciphered our brain. Dynamic Microphones work in the same way, by having a plate that moves in and out, along a magnet. This movement of wires along the magnet creates electrical impulses, which is what is saved in the computer. image from http://www.mediacollege.com/audio/microphones/dynamic.html
Applications The purpose of studying voice and it's constructive parts (frequency, energy, formants, etc.) is for the variety of applications that can be explored. Some of these topics have not had much research done, and are topics that are gaining a lot of interest recently with newer and newer technological improvements. ● Voice Recognition (has improved a lot in the past couple of years) ● Voice Synthesis (using emotion and inflections to make it more realistic) ● Voice Emotional Analysis (clinical and wellness applications) ● Voice Stress Detection (lie detection, and operator state) ● Etc. The reason voice analysis is becoming more and more popular is because of it's non-invasive data capture (much like that of vision analysis of facial expression).
Recommend
More recommend