A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering D t t f El t i l d C t E i i Old Dominion University, Norfolk, VA 23529, USA. * Currently at Binghamton University 09/17/2006 09/17/2006 1
O tline Outline � Introduction � Algorithm � Algorithm overview � The use of nonlinear processing � Pitch tracking from the spectrum Pit h t ki f th t � Experimental evaluation � C � Conclusion l i 2
I t Introduction d ti � Pitch(the fundamental frequency) applications Pi h( h f d l f ) li i � Automatic speech recognition (ASR), speech synthesis, speech articulation training aids, etc. p g , � Pitch detection algorithms � “Robust and accurate fundamental frequency estimation b based on dominant harmonic components,” Nakatani, etc d d i h i ” k i => High accuracy for noisy speech reported using the harmonic dominance spectrum � “Yet another algorithm for pitch tracking(YAAPT),” Zahorian, etc => Hybrid spectral-temporal processing for pitch tracking > Hybrid spectral-temporal processing for pitch tracking 3
Al Algorithm Overview ith O i Nonlinear processing p g Original Original Squared Value Squared Value Speech of Speech FFT Spectrum Spectrum F0 candidates estimation F0 candidates estimation Pitch Tracking F0 candidates 0 ca d dates F0 candidates 0 ca d dates Spectral F0 track (Original Speech) (Squared Value) Candidates refinement Refined F0 Refined F0 Refined F0 Refined F0 Candidates Candidates Final F0 determination using dynamic programming using dynamic programming Fi Final F0 l F0 4
Th U The Use of Nonlinear Processing f N li P i � Restoration of missing fundamental in telephone speech i f i i f d l i l h h � A periodic sound is characterized by the spectrum of its harmonics harmonics � The signal the fundamental missed be approximated as = ω ω + + ω ω ω ω + + y y ( ( t t ) ) b b cos( cos( 2 2 t t ) ) b b cos( cos( 3 3 t t ) ) b b cos( cos( t t ) ) 1 1 2 2 3 3 1 st harmonic 2 nd harmonic Fundamental � After squaring and applying trigonometric identities [ ] ( ) ( ) ( ) + 2 2 2 = + ω + ω b b 2 b y t b b cos t cos 4 t 2 3 2 2 3 2 2 ( ( ) ) ( ( ) ) 2 + ω + ω b b b cos 5 t cos 6 t 3 2 2 3 3 2 2 The fundamental reappears 5
Illustration of Nonlinear Processing Ill t ti f N li P i � The telephone speech signal (top panel) and squared p p g ( p p ) q telephone signal (bottom panel) for one frame 6
Illustration of Nonlinear Processing Ill t ti f N li P i � The magnitude spectrum for the telephone (top panel) and g p p ( p p ) nonlinear processed signal (bottom panel) 7
S Spectral Effects from Nonlinear Processing t l Eff t f N li P i � The missing fundamental in the telephone speech (top panel) g p p ( p p ) is restored in the squared signal (bottom panel) Spectrum of the telephone speech 400 300 300 ) Frequency (Hz 200 100 18 18.5 19 19.5 20 20.5 21 21.5 22 22.5 23 Time (Seconds) Spectrum of the nonlinear processed signal 400 300 z) Frequency (Hz 200 100 18 18.5 19 19.5 20 20.5 21 21.5 22 22.5 23 Time (Seconds) 8
Pit h T Pitch Tracking From the Spectrum ki F th S t � The pitch track from the spectrum refines the pitch candidates estimated from the temporal method � To achieve a noise robust pitch track from the spectrum, an autocorrelation type of function is t t l ti t f f ti i proposed 9
A t Autocorrelation type of Function l ti t f F ti � The function takes into account multiple harmonics � The function takes into account multiple harmonics Autocorrelation type of function Spectrum 1 0.2 0.8 0.15 k 4k 2k 3k 0.6 0.6 0.1 X X X 0.4 0.05 0.2 0 0 0 100 200 300 400 0 100 200 300 400 500 600 700 800 900 1000 Frequency (Hz) WL Frequency (Hz) � Equation + N 1 WL / 2 ∑ ∏ ∑ ∏ = + y ( k ) f ( nk i ) = − = i WL / 2 n 1 < < k : Frequency index, k k k : The spectrum, f ( i ) F 0 _ min F 0 _ max WL : Window length (20Hz) N : The number of harmonics (3), 10
P Peaks in Autocorrelation Type of Function k i A t l ti T f F ti S pec trum 0.4 0.3 Amplitude 0.2 0.1 0 0 200 400 600 800 1000 1200 F requenc y (Hz ) P eak s in autoc orrelation ty pe of func tion 1 Amplitude 0.5 0 0 50 100 150 200 250 300 350 400 450 F requenc y (Hz ) A A very prominent peak is observed in the proposed function i t k i b d i th d f ti 11
Candidate Insertion to Reduce Pitch Doubling/Halving D bli /H l i � If all candidates are larger than a threshold (typically 150 If all candidates are larger than a threshold (typically 150 Hz), an additional candidate is inserted at half the frequency of the highest-ranking candidate � Similar logic is used to reduce pitch halving � Similar logic is used to reduce pitch halving Peaks in autocorrelation type of function 1 P2 (Hz)= P1 (Hz)/ 2 P1 mplitude 0.5 Am 0 0 50 250 100 150 200 300 350 400 Frequency(Hz) 12
E Experimental Evaluation i t l E l ti � Database D b � Keele pitch extraction database � 5 male and 5 female speakers about 35seconds speaker � 5 male and 5 female speakers, about 35seconds speaker � High quality speech and telephone speech � Additive Gaussian noise � Controls (reference pitch) � Control C1: supplied in Keele database C t l C1 li d i K l d t b � Control C2: computed from the laryngograph signal with the proposed algorithm 13
D fi iti Definition of Error Measures f E M � Gross error � The percentage of frames such that the pitch estimate of the tracker deviates significantly (typically 20%) from the tracker deviates significantly (typically 20%) from the reference pitch (control) � Only evaluated in the voiced sections of the reference 14
E Experiment 1 Results i t 1 R lt � Individual performance of the proposed algorithm p p p g Control Studio, Studio, Telephone, Telephone, Clean (%) ( ) 5dB Noise(%) ( ) Clean (%) ( ) 5dB Noise(%) ( ) YAAPT C1 4.26 7.62 8.14 17.85 YAAPT* C1 1.59 1.99 2.69 4.48 S Spectral l C1 4.23 4.45 6.52 6.95 method NCCF C1 3.58 4.52 8.00 16.61 YAAPT*: Using control C1 for the spectral pitch track NCCF : Normalized cross correlation function, used as the temporal method in YAPPT method in YAPPT 15
E Experiment 2 Results i t 2 R lt � The results of the new method with various error thresholds Error Control Studio, Studio, Telephone, Telephone, Threshold Clean (%) Clean (%) 5dB Noise(%) Clean (%) 5dB Noise(%) Clean (%) 5dB Noise(%) 5dB Noise(%) 10% C1 5.46 7.31 9.39 16.14 10% 10% C2 C2 4.18 4.18 6.06 6.06 7.77 7.77 14.78 14.78 20% C1 2.90 3.65 4.86 7.45 20% C2 1.56 2.16 3.27 5.85 40% C1 2.25 2.44 2.75 3.63 40% C2 0.91 1.06 0.99 2.05 16
C Comparisons i Studio Studio, Studio Studio, Telephone Telephone, Telephone, Telephone Control Clean (%) 5dB Noise(%) Clean (%) 5dB Noise(%) Proposed C1 C1 2 90 2.90 3 65 3.65 4 86(4 52 *) 7 45(5 90 *) 4.86(4.52 ) 7.45(5.90 ) Method DASH C1 2.81 2.32 3.73* 4.15 * REPS C1 2.68 2.98 6.91* 8.49 * YIN C1 2.57 7.22 7.55* 14.6* � DASH, REPS, YIN: the results are reported in “Robust and accurate fundamental frequency estimation ... ,” Nakatani, etc. � * SRAEN filt � *: SRAEN filter simulated telephone speech i l t d t l h h 17
C Conclusion l i � A new pitch-tracking algorithm has been developed which combines multiple information sources to enable accurate robust F0 tracking enable accurate robust F0 tracking � An analysis of errors indicates better performance for both high quality and telephone speech than for both high quality and telephone speech than previously reported performance for pitch tracking � Acknowledgements � This work was partially supported by JWFC 900 � This work was partially supported by JWFC 900 18
Recommend
More recommend