SPEECH LAB NTHU EE A Preliminary Studies on Landmark detection Advisor: Hsiao-Chuan Wang Speaker: Jyh-Min CHENG Date: Aug. 18, 2006
Introduction SPEECH LAB NTHU EE � A knowledge-based speech recognition system is dedicated to processing speech (versus signals in general) and therefore is efficient � Rather than explicitly specifying speech knowledge in a recognition system, a statistical approach builds models by training on speech data, thereby implicitly acquiring knowledge on its own 2
Introduction (cont.) SPEECH LAB NTHU EE � Statistical methods have been successful for large-vocabulary, speaker-independent speech recognition � Lee, K.-F. (1989). Automatic speech recognition: the development of the SPHINX system � Heavily reliance on data, statistical methods do not generalize easily to tasks for which they are not explicitly trained � Retraining, adaptation, etc. 3
Introduction (cont.) SPEECH LAB NTHU EE � Performance degrades when there are environment mismatch � Das, S., Bakis, R., A., Nahamoo, D., and Picheny, M. (1993). Influence of background noise and microphone on the performance of the IBM Tangora speech recognition system � A combination of both knowledge-based and statistical approaches � Knowledge sources are added such as phone duration, an auditory front-end, mel-frequency scale, etc. 4
Introduction (cont.) SPEECH LAB NTHU EE � Knowledge-based speech recognition system was proposed � Stevens, K. N., Manuel, S. Y., Shattuck-Hufinagel, S., and Liu, S. (1992). “ Implementation of a model for lexical access based on features ” , ICSLP 5
6 SPEECH LAB NTHU EE Introduction (cont.)
Introduction (cont.) – Distinctive features SPEECH LAB NTHU EE � Distinctive features concisely describe the sounds of a language at a sub-segmental level � They have a relatively direct relation to acoustics and articulation � Jacobson, R., and Zue, V. W. (1952). “ Preliminaries to speech analysis ” � They can concisely describe many of the contextual variations of a segment � Speaking styles, phonological assimilation across word boundaries, etc. 7
Introduction (cont.) – Landmark detection SPEECH LAB NTHU EE � Landmarks are a guide to the presence of underlying segments, which organize distinctive features into bundles � Define regions in an utterance when the acoustic correlates of distinctive features are most salient � They mark perceptual foci and articulatory targets 8
Introduction (cont.) – Landmark detection SPEECH LAB NTHU EE � For some phonetic contrasts, a listener focuses on landmarks to get the acoustic cues necessary for deciphering the underlying distinctive features � Stevens, K. N. (1985). “ Evidence for the role of acoustic boundaries in the perception of speech sounds ” � Furui, S. (1986). “ On the role of spectral transition for speech perception ” � Ohde, R. M. (1994). “ The developmental role of acoustic boundaries in speech perception ” 9
Introduction (cont.) – Landmark detection SPEECH LAB NTHU EE � After finding out the landmarks, the subsequent processing can focus on relevant speech portions, instead of treating each part of the signal equally important � Minimizes the amount of processing necessary � Independent of timing factors, like speaking rate and segmental duration, etc. � Gives timing information to aid in later processing 10
Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Landmark detection is just one way to organize the speech waveform � Frame-based processing and Segmentation are two other possibilities 11
Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Frame-based processing is the most popular way of dividing up the speech waveform � Segmentation is more structured than frame- based processing � Finds boundaries in the speech waveform � Delimit unequal-length, semi-steady-state, abutting regions, with each region corresponding to a phone or sub-phone unit 12
Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Subsequent processing focuses on these regions, typically acquiring averages across a region and sometimes measuring attributes near the boundaries � Gish, H., and Ng. K. (1993). “ A segmental speech model with applications to word spotting ” , ICASSP � Zue, V. W., Glass, J. R., Goodine, D., Leung, H., Philips, M., Pilifroni, J., and Seneff, S. (1990b). “ Recent progress on the SUMMIT system ” 13
Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Segmentation approach performs better than or comparably to a frame-based approach while reducing the computational load in training and testing by a significant amount � Flammia, G., Dalsgaard, P., Anderson, O., and Linberg, B. (1992). “ Segment based variable frame rate speech analysis and recognition using spectral variation function ” , ICSLP � Marcus, J. (1993). “ Phonetic recognition in a segmental-based HMM ” , ICASSP 14
Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Segmentation was a popular method of organizing speech waveform in the 1970s through mid-1980s � Compatible with acoustic-phonetic processing � Weinstein, C. J., McCandless, S. S., Mondshein, L. F., and Zue, V. W. (1975). “ A system for acoustic-phonetic analysis of continuous speech ” , IEEE ASSP 15
Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Segmentation failed when parts of the waveform do not have sharp boundaries, like those corresponding to diphthongs and semivowels � Over-segmentation � Andre-Obrecht, R. (1988). “ A new statistical approach for the automatic segmentation of continuous speech signals ” , IEEE ASSP � Multi-level representation � Glass, J. R. (1988). “ Finding acoustic regularities in speech: applications to phonetic recognition ” 16
Introduction (cont.) – Landmark detection SPEECH LAB Landmark detection, Frame-based processing, and Segmentation NTHU EE � Landmark detection is different from frame- based processing and segmentation � Landmark are foci , so speech processing is done around a landmark rather than in between two landmarks � Not all boundaries are landmarks, and not all landmarks are boundaries � The problem of semivowels and diphthongs is avoided altogether � Typically more hierarchical � Associated with distinctive features rather than associated with phones in segmentation 17
Objective SPEECH LAB NTHU EE � The most numerous types of landmarks are acoustically abrupt � Zue, V., Seneff, S., and Glass, J. (1990a). “ Speech database development at MIT: TIMIT and beyond ” , speech commun. � An estimate based on a phonetically balanced subset of sentences in the TIMIT corpus shows that acoustically abrupt landmarks comprise approximately 68% of the total number of landmarks in speech � Often associated with consonantal segments, like a stop closure or release 18
I. LANDMARKS SPEECH LAB NTHU EE � Categorized into four groups � Abrupt-consonantal (AC) � Abrupt (A) � Nonabrupt (N) � Vocalic (V) 19
I. LANDMARKS (cont.) SPEECH LAB NTHU EE � Phonologically, segments can be classified as [+ consonantal] or [-consonantal] � Sagey, E. (1986). “ The representation of features and relations in nonlinear phonology ” � A [+ consonantal] involves a primary articulator forming a tight constriction in the midline of the vocal tract (lips, tongue blade, tongue body) � A [-consonantal] does not involve a primary articulator and not forming a tight constriction (soft palate, and glottis) � Speech is formed by a series of articulator narrowings and releases 20
I. LANDMARKS (cont.) SPEECH LAB NTHU EE � The most salient of these narrowings and releases are acoustically abrupt � An acoustically abrupt constriction involving a primary articulator is typically tight and is a consequence of implementing a [+ consonantal] segment � An abrupt-consonantal ( AC ) landmark marks the closure and another marks the release of one of these constrictions 21
I. LANDMARKS (cont.) SPEECH LAB NTHU EE � The clearest manifestation of an AC landmark is when the constriction occurs adjacent to a Outer AC landmark [-consonantal] segment � A pair of these landmarks, one on either side of the constriction, will be referred to as the outer AC landmarks � Ex: [b] closure and release in “ able ” � Other landmarks can occur within or outside of the pair of outer AC landmarks 22
I. LANDMARKS (cont.) SPEECH LAB NTHU EE � A common sequence of landmarks is one in which the outer AC landmarks are governed by the same underlying segment and, thus are implemented by the same articulator � Ex: [b] closure and release in “ able ” � Some outer AC landmarks are not governed by the same articulator � Ex: [p] closure and [d] release in “ tap dance ” 23
Recommend
More recommend