Transcript Document
A Spectral-Temporal Method for Pitch Tracking Stephen A. Zahorian*, Princy Dikshit, Hongbing Hu* Department of Electrical and Computer Engineering Old Dominion University, Norfolk, VA 23529, USA. * Currently at Binghamton University 09/17/2006 1 Outline Introduction Algorithm Algorithm overview The use of nonlinear processing Pitch tracking from the spectrum Experimental evaluation Conclusion 2 Introduction Pitch(the fundamental frequency) applications Automatic speech recognition (ASR), speech synthesis, speech articulation training aids, etc. Pitch detection algorithms “Robust and accurate fundamental frequency estimation based on dominant harmonic components,” Nakatani, etc => High accuracy for noisy speech reported using the harmonic dominance spectrum “Yet another algorithm for pitch tracking(YAAPT),” Zahorian, etc => Hybrid spectral-temporal processing for pitch tracking 3 Algorithm Overview 4 The Use of Nonlinear Processing Restoration of missing fundamental in telephone speech A periodic sound is characterized by the spectrum of its harmonics The signal the fundamental missed be approximated as y(t ) b1 cos(t ) b2 cos(2t ) b3 cos(3t ) Fundamental 1st harmonic 2nd harmonic After squaring and applying trigonometric identities y t b 2 2 2 b3 2 2 b b cost b cos4t 2 3 2 b2b3 cos5t 2 2 b3 2 2 cos6t The fundamental reappears 5 Illustration of Nonlinear Processing The telephone speech signal (top panel) and squared telephone signal (bottom panel) for one frame 6 Illustration of Nonlinear Processing The magnitude spectrum for the telephone (top panel) and nonlinear processed signal (bottom panel) 7 Spectral Effects from Nonlinear Processing The missing fundamental in the telephone speech (top panel) is restored in the squared signal (bottom panel) Spectrum of the telephone speech Frequency (Hz) 400 300 200 100 18 18.5 19 19.5 20 20.5 Time (Seconds) 21 21.5 22 22.5 23 21.5 22 22.5 23 Spectrum of the nonlinear processed signal Frequency (Hz) 400 300 200 100 18 18.5 19 19.5 20 20.5 Time (Seconds) 21 8 Pitch Tracking From the Spectrum The pitch track from the spectrum refines the pitch candidates estimated from the temporal method To achieve a noise robust pitch track from the spectrum, an autocorrelation type of function is proposed 9 0.2 0.15 0.1 Autocorrelation type of Function 0.05 0 0 200 400 600 Frequency (Hz) 800 1000 The function takes into account multiple harmonics Autocorrelation type of function Spectrum 1 0.2 0.8 0.15 k 2k 0.1 X 4k 3k X 0.6 X 0.4 0.05 0 0.2 0 0 100 200 WL Equation 300 400 500 600 Frequency (Hz) 700 800 900 1000 0 100 200 Frequency (Hz) 300 400 Autocorrelation type of function 1 0.8 y (k ) 0.6 WL / 2 N 1 f (nk i) i WL / 2 n 1 0.4 0.2 f (i ) : The spectrum, N : The 0 0 50 k : Frequency index, kF 0 _ min k kF 0 _ max number of harmonics (3), 100 150 200 250 Frequency (Hz) 300 350 400 WL: Window length (20Hz) 10 Peaks in Autocorrelation Type of Function Spectrum Amplitude 0.4 0.3 0.2 0.1 0 0 200 400 600 800 Frequency(Hz) Peaks in autocorrelation type of function 1000 1200 Amplitude 1 0.5 0 0 50 100 150 200 250 Frequency(Hz) 300 350 400 450 A very prominent peak is observed in the proposed function 11 Candidate Insertion to Reduce Pitch Doubling/Halving If all candidates are larger than a threshold (typically 150 Hz), an additional candidate is inserted at half the frequency of the highest-ranking candidate Similar logic is used to reduce pitch halving Peaks in autocorrelation type of function 1 Amplitude P2(Hz)=P1(Hz)/2 P1 0.5 0 0 50 100 150 200 250 Frequency(Hz) 300 350 400 12 Experimental Evaluation Database Keele pitch extraction database 5 male and 5 female speakers, about 35seconds speaker High quality speech and telephone speech Additive Gaussian noise Controls (reference pitch) Control C1: supplied in Keele database Control C2: computed from the laryngograph signal with the proposed algorithm 13 Definition of Error Measures Gross error The percentage of frames such that the pitch estimate of the tracker deviates significantly (typically 20%) from the reference pitch (control) Only evaluated in the voiced sections of the reference 14 Experiment 1 Results Individual performance of the proposed algorithm Control Studio, Clean (%) Studio, Telephone, Telephone, 5dB Noise(%) Clean (%) 5dB Noise(%) C1 4.26 7.62 8.14 17.85 YAAPT* C1 1.59 1.99 2.69 4.48 Spectral method C1 4.23 4.45 6.52 6.95 NCCF C1 3.58 4.52 8.00 16.61 YAAPT YAAPT*: Using control C1 for the spectral pitch track NCCF : Normalized cross correlation function, used as the temporal method in YAPPT 15 Experiment 2 Results The results of the new method with various error thresholds Error Control Threshold Studio, Clean (%) Studio, Telephone, Telephone, 5dB Noise(%) Clean (%) 5dB Noise(%) 10% C1 5.46 7.31 9.39 16.14 10% C2 4.18 6.06 7.77 14.78 20% C1 2.90 3.65 4.86 7.45 20% C2 1.56 2.16 3.27 5.85 40% C1 2.25 2.44 2.75 3.63 40% C2 0.91 1.06 0.99 2.05 16 Comparisons Studio, Clean (%) Studio, Telephone, Telephone, 5dB Noise(%) Clean (%) 5dB Noise(%) Proposed C1 Method 2.90 3.65 4.86(4.52 *) 7.45(5.90 *) DASH C1 2.81 2.32 3.73* 4.15 * REPS C1 2.68 2.98 6.91* 8.49 * YIN C1 2.57 7.22 7.55* 14.6* Control DASH, REPS, YIN: the results are reported in “Robust and accurate fundamental frequency estimation ... ,” Nakatani, etc. *: SRAEN filter simulated telephone speech 17 Conclusion A new pitch-tracking algorithm has been developed which combines multiple information sources to enable accurate robust F0 tracking An analysis of errors indicates better performance for both high quality and telephone speech than previously reported performance for pitch tracking Acknowledgements This work was partially supported by JWFC 900 18