algorithms for dysfluency detection in symbolic sequences
play

Algorithms for Dysfluency Detection in Symbolic Sequences using - PowerPoint PPT Presentation

Algorithms for Dysfluency Detection in Symbolic Sequences using Suffix Arrays alfy 1 , 2 chal 1 J. P J. Posp 1 Slovak University of Technology, Faculty of Informatics and Information Technologies, Bratislava, Slovakia 2 Slovak Academy of


  1. Algorithms for Dysfluency Detection in Symbolic Sequences using Suffix Arrays alfy 1 , 2 ıchal 1 J. P´ J. Posp´ 1 Slovak University of Technology, Faculty of Informatics and Information Technologies, Bratislava, Slovakia 2 Slovak Academy of Sciences, Institute of Informatics, Bratislava, Slovakia Text Speech and Dialogue, September 3, 2013

  2. Overview ◮ Introduction to Dysfluencies ◮ Motivation in Dysfluent Speech Recognition ◮ Common Approach & Problem with “Complex” Dysfluencies ◮ Methodology ◮ Results ◮ Conclusion

  3. Introduction to Dysfluencies ◮ D y sfluencies are disruptions or breaks in the smooth flow of speech. (Shipley & McAfee, 1998) ◮ Unlike read speech, spontaneous speech contains high rates of d i sfluencies (Shriberg, 1994)

  4. "Normal" Understanding Disfluencies Different Types Hesitations (pause) Interjections (um, uh, er) of Speech Revisions ("I want-I need that") Repetitions of phrases ("I want- I want that") Disfluencies Disfluencies occur more frequently Repetitions of multisyllabic Reactions to whole words (“mommy- disfluencies increase mommy-mommy let’s go.”) Tension or struggle increases Repetitions of monosyllabic whole words (“I-I-I want to go.”) Duration (length) of disfluencies increases NOTE: "Normal" disfluencies can be used to avoid or Tension during postpone stuttering (e.g., "normal" disfluencies "Stuttered" “I um, you know, uh, I want to um, g-g-g-o with you.”) Disfluencies Repetitions of sounds or syllables ("li-li-like this") Prolongations ("llllllike this") From Yaruss & Reardon (2006), Young Children Who Stutter: Information and Support for Parents. New York: National Stuttering Association (NSA). Blocks ("l---ike this")

  5. Motivation in Dysfluent Speech Recognition Dysfluent speech recognition: ◮ Speech Language Pathology (SLP) - automatic & objective evaluation e.g. analysis tool ◮ Automatic Speech Recognition (ASR) - improve the accuracy e.g. module

  6. Problem with Dysfluencies ◮ statistical distribution of atomic parts of speech - build Automatic Speech Recognition (ASR) system ◮ sparse regularity of dysfluencies - design ASR (like Hidden Markov Models (HMM)) ◮ ASR complexity - define every transition between states which can occur in case of dysfluent events

  7. Conventions In our work we used convention: ◮ “simple” dysfluencies - e.g. part-word/syllable repetitions (R1), prolongations (P); already studied in many works “simple” dysfluencies e.g. P: rrrun, R1: re re research ◮ “complex” dysfluencies - specified as a chaotic mixture of dysfluent events (e.g. repetition of phrase, prolongation combined with hesitation & repetition) are frequent in stutterers speech “complex” dysfluencies e.g. I do my, I do my work; j j j jer j j jer ja just

  8. Common Approach & Problem with “Complex” Dysfluencies common approach ◮ fix a window (e.g. 200 - 800 ms) ◮ build a dysfluency recognition system (e.g. Artificial Neural Networks, Support Vector Machines) ◮ recognize the “simple” dysfluent events in a fixed interval problem ◮ dysfluencies frequently do not fit the fixed window, but are dynamically distributed throughout much longer 2 - 4 s intervals ◮ how to choose the right window size for “complex” dysfluencies?

  9. Our Methodology Speech language ◮ our solution: combine & pathology Data mining apply methods from other Dysfluencies Time series fields of science 0.5 Amplitude 0 ◮ SLP - knowledge, −0.5 0 1000 2000 Vector dysfluencies Alg. 1-2 ◮ Data Mining - mining time Sequence analysis series, Symbolic Aggregate Approximation (SAX) Bioinformatics ◮ Bioinformatics - sequence (DNA) analysis, Suffix Arrays

  10. Methodology: Corpus ◮ University College London Archive of Stuttered Speech (UCLASS) ◮ Howell, Huckvale, 2004, ˜ 500 recordings, 16 - 44.1 KHz, 2 - 15 min playing time, age 8 - 47 year, male / female ◮ Howell, Davis, Bartrip, 2009, 12 selected recordings, working set from UCLASS ◮ We annotated & used subset of this working set, 22.05 KHz, 19:32 min playing time

  11. Methodology: Feature Extraction PAA, SAX Speech ◮ speech, 22.5 KHz ◮ short-time energy, X = x 1 , . . . x n ◮ Piecewise Aggregate Approx. Short time energy X = x 1 , . . . x N (1) SAX n N i � x i = N (2) x j Lexical content: n “c can c c can” j = n N ( i − 1)+1 ◮ Symbolic Aggregate Approx. B = β 1 , ..., β a − 1 (3) � W = � w 1 , . . . � (4) w m ◮ map X → � W w i = a i , iff β j − 1 < x j < = β j � (5)

  12. Methodology: Data Structure, Suffix Arrays i Pos[i] C[Pos[i] … n] i = 1 2 3 4 5 6 7 8 9 10 11 ◮ large sequence 1 11 $ C = p r o c e s s i n g $ 2 4 cessing$ 3 5 essing$ C = c 0 c 1 . . . c N − 1 4 10 g$ 5 8 ing$ ◮ suffix of C , 6 9 ng$ 7 3 ocessing$ C i = c i c i +1 . . . c N − 1 8 1 processing$ 9 2 rocessing$ 10 7 sing$ ◮ lexicog. sorted array, Pos 11 6 ssing$ ◮ Pos [ k ], k th smallest suffix in the set C 0 , C 1 , . . . C N − 1 ◮ assume that Pos is given then ◮ C Pos [0] < C Pos [1] < · · · < C Pos [ N − 1] ◮ where ’ < ’ denotes the lexicog. order

  13. Methodology: Our Derived Functions Speech waveform ◮ prolongations are characterized by 0.5 Amplitude 0 minimal difference between n neighboring frames −0.5 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time (s) Wideband spectrogram ◮ video segmentation → were Frequency (Hz) 10000 5000 adapted for speech 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time (s) x = x 1 , . . . x N , y = y 1 , . . . y N (6) Prolongation detection functions 1 Function D(x,y) Db(x,y) Dh(x,y) N � 0.5 Dg(x,y) D ( x , y ) = 1 | x i − y i | (7) 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 N Time (s) i =1 Lexical content: “personal � b s:eedee player” D b ( x ) = D ( x i , x ( i + l ) ) (8) i =1 � h D h ( H x ) = D ( H x ( i ) , H x ( i + l )) (9) i =1

  14. Methodology: Our Developed Algorithms ◮ Alg. 1 - for speech pattern searching ◮ Alg. 2 - for searching repeated patterns (repetitions) in speech ◮ P short sequence, C long sequence, s is a shift, l is C length 1: while i < n do ⊲ Begin: Alg.2() In i -th window 1st block set to P, remaining blocks put to 2: C Compute Pos for P. ⊲ Pos is a suffix array 3: With Pos construe Tab for P. ⊲ Tab is a look up table 4: while s < l do ⊲ Begin: Alg.1() 5: Use Tab to query C in P. 6: Save patterns position and patterns length. 7: ⊲ End: Alg.1() end while 8: 9: end while ⊲ End: Alg.2()

  15. Methodology: Our Features for “Complex” Dysfluencies For every 5 s long interval, 3 features of 100ms blocks were computed: ◮ patterns average redundancy ◮ patterns relative frequency ◮ patterns redundancies sum Algorithms iterative output Evaluate columns 1 1 6 window 1 window 2 2 7 3 3 8 4 4 9 Evaluate 5 5 10 rows 6 window 2 window 1 window 2 7 8 9 10

  16. Methodology: Main Steps in Running Algorithms ◮ Alg. 1-2 based on SAX - Symrep ◮ in relational DBs, short query is executed in a large set of data ◮ Alg. 1 - opposite to relat. DBs, query a long sequence C in a short sequence P ◮ Alg. 2 - adaptation capability to unknown repeated speech pattern length ◮ DTW based on MFCC - Specrep

  17. Results: Statistical Analysis process of classifier design: ◮ measurement of data class separability - correlation ◮ study of data characteristics - Mann-Whitney U-test compare features: ◮ Specrep - DTW on basis of MFCC features ◮ Symrep - our developed algorithms on basis of SAX ◮ r - correlation coefficients ◮ h - accepted hypotheses ( p-values < 0 . 05 level)

  18. Results: Objective Assessment ◮ SVMs to perform objective assessment of MFCC, Specrep , Symrep ◮ training (80 %) & testing (20 %) sets ◮ we trained individual SVMs, sigmoidal kernel function

  19. Conclusion ◮ derived functions for prolongation detection ◮ developed algorithms Alg.1-2 - detection of “complex” dysfluencies ◮ new designated features - statistically analyzed ◮ objective assessment of new features & MFCC by SVM, 47.4% ◮ symbolic sequences are competitive to spectral domain

  20. Bibliography 1/2 Camastra, F. and Vinciarelli, A., Machine Learning for Audio, Image and Video Analysis: Theory and Applications . Springer-Verlag London Limited, 2008. Hamel, L., Knowledge Discovery with Support Vector Machines . John Wiley & Sons, Inc., Hoboken, NJ, USA, July 2009. Howell, P., Davis, S., Bartrip, J., The UCLASS archive of stuttered speech . Journal of Speech, Language, and Hearing Research, 52, pp. 556-569, 2009. Keogh, E., Chakrabarti, K., Pazzani, M., and Mehrotra, S., Dimensionality Reduction for Fast Similarity Search in Large Time Series Databases . Knowledge and Information Systems 3, pp. 263-286, 2001.

Recommend


More recommend