unsupervised vocabulary induction
play

Unsupervised Vocabulary Induction 8 month-old babies exposed to - PowerPoint PPT Presentation

Infant Language Acquisition (Saffran et al., 1997) Unsupervised Vocabulary Induction 8 month-old babies exposed to stream of syllables Stream composed of synthetic words (pabikumalikiwabufa) After only 2 minutes of exposure, infants


  1. Infant Language Acquisition (Saffran et al., 1997) Unsupervised Vocabulary Induction • 8 month-old babies exposed to stream of syllables • Stream composed of synthetic words (pabikumalikiwabufa) • After only 2 minutes of exposure, infants can MIT distinguish words from non-words (e.g., pabiku vs. kumali) Today: Unsupervised Vocabulary Vocabulary Induction Induction Task: Unsupervised learning of word boundary segmentation • Simple: Ourenemiesareinnovativeandresourceful,andsoarewe. • Vocabulary Induction from Unsegmented Text Theyneverstopthinkingaboutnewwaystoharmourcountry • Vocabulary Induction from Speech Signal andourpeople,andneitherdowe. – Sequence Alignment Algorithms • More ambitious:

  2. Word Segmentation (Ando&Lee, 2000) Algorithm for Word Segmentation (Cont.) Key idea: for each candidate boundary, compare the frequency of the n-grams adjacent to the proposed boundary with the frequency of the n-grams that Place boundary at all locations l such that either: straddle it. • l is a local maximum: v N ( l ) > v N ( l − 1) and ? S S v N ( l ) > v N ( l + 1) 1 2 T I N G E V I D • v N ( l ) ≥ t , a threshold parameter t 1 t 2 t 3 t V (k) For N = 4 , consider the 6 questions of the form: N A B | C D | W X | Y| Z ”Is # ( s i ) ≥ # ( t j ) ?”, where #(x) is the number of occurrences of x Example: Is “TING” more frequent in the corpus than ”INGE”? Algorithm for Word Segmentation Experimental Framework s n non-straddling n-grams to the left of location k 1 s n non-straddling n-grams to the right of location k 2 t n straddling n-gram with j characters to the right of location k j I ≥ ( y, z ) indicator function that is 1 when y ≥ z , and 0 otherwise. • Corpus: 150 megabytes of 1993 Nikkei newswire 1. Calculate the fraction of affirmative answers for • Manual annotations: 50 sequences for development each n ≤ N : set (parameter tuning) and 50 sequences for test set 2 n − 1 1 � � I ≥ (#( s n i ) , #( t n v n ( k ) = j )) • Baseline algorithms: Chasen and Juman 2 ∗ ( n − 1) i =1 j =1 morphological analyzers (115,000 and 231,000 2. Average the contributions of each n-gram order words) v N ( k ) = 1 � v n ( k ) N n ∈ N

  3. Evaluation Today: Unsupervised Vocabulary Induction • Precision (P): the percentage of proposed brackets that exactly match word-level brackets in the annotation • Vocabulary Induction from Unsegmented Text • Recall (R): the percentage of word-level annotation • Vocabulary Induction from Speech Signal brackets that are proposed by the algorithm – Sequence Alignment Algorithms P R • F = 2 ( P + R ) • F = 82% (improvement of 1.38% over Jumann and of 5.39% over Chasen) Performance on other datasets Aligning Two Sequences Given two possibly related strings S 1 and S 2 , find the longest common subsequence Orwell(English) 79.8 Song lyrics (Romaji) 67.6 Cheng & Mitzenmacher Goethe (German) 75.2 Verne (French) 72.9 Arrighi (Italian) 73.1

  4. How can We Compute Best Alignment Key Insight: Score is Additive • We need a scoring system for ranking alignments – Substitution Cost A G T C Compute best alignment recursively A 1 0.5 -1 -1 • For a given aligned pair ( i, j ) , the best alignment is: G -0.5 1 -1 -1 Best alignment of S 1[1 . . . i ] and S 2[1 . . . j ] T -1 -1 +1 -0.5 + Best alignment of S 1[ i . . . n ] and S 2[ j . . . m ] C -1 -1 -0.5 1 – Gap (insertion&deletion) Cost Can We Simply Enumerate All Possible Alignment Matrix Alignments? Alignment of two sequences can be modeled as a task of • Naive enumeration is prohibitively expensive finding the path with the highest weight in a matrix � � 2 m + n n + m = ( m + n )! ≈ H E A G A W G ( m !) 2 m � ( n ∗ m ) Alignment: - - P - A W - n=m Enumeration Corresponding Path: 10 184,756 H E A G A W G 20 1.4E+11 + + + 100 9.00E+58 P + + • Alignment using dynamic programming can be done in A + O ( n · m ) W + +

  5. Global Alignment: Needleman-Wunsch Dynamic Programming Formulation Algorithm • To align two strings x , y , we construct a matrix F • We know how to compute the best score – F(i,j): the score of the best alignment between – The number at the bottom right entry (i.e., the initial segment s 1 ...i of x up to x i and the F ( n, m ) ) initial segment y 1 ...j of y up to y j • But we need to remember where it came from • We compute F recursively: F (0 , 0) = 0 – Pointer to the choice we made at each step F(i−1,j−1) F(i,j−1) • Retrace path through the matrix s(xi,yj) −d – Need to remember all the pointers F(i−1,j) F(i,j) −d Time: O ( m · n ) Dynamic Programming Formulation Local alignment: Smith-Waterman Algorithm s ( x i , y j ) similarity between x i and y j gap penalty d • Global alignment: find the best match between  F ( i − 1 , j − 1) + s ( x i , y j )  sequences from one end to the other   F ( i, j ) = max F ( i − 1 , j ) − d • Local alignment: find the best match between   F ( i, j − 1) − d  subsequences of two sequences Boundary conditions: – Useful for comparing highly divergent sequences when only local similarity is expected • The top row: F ( i, 0) = − id F ( i, 0) represents alignments of prefix x to all gaps in y • The left column: F (0 , j ) = − jd

  6. Dynamic Programming Formulation Today: Unsupervised Vocabulary Induction  0     F ( i − 1 , j − 1) + s ( x i , y j )   F ( i, j ) = max F ( i − 1 , j ) − d      F ( i, j − 1) − d • Vocabulary Induction from Unsegmented Text  • Vocabulary Induction from Speech Signal Boundary conditions: F ( i, 0) = F (0 , j ) = 0 – Sequence Alignment Algorithms Finding the best local alignment • Find the highest value of F ( i, j ) , and start the traceback from there • The traceback ends when a cell with value 0 is found Local vs. Global Alignment Finding Words in Speech H E A G A W G P -2 -1 -1 -2 -1 -4 -2 SimilarityMatrix A -2 -1 5 0 5 -3 0 • Traditional approached to speech recognition are W -3 -3 -3 -3 -3 15 -3 supervised: H E A G A W G 0 -8 -16 -24 -32 -40 -48 -56 – Recognizers are trained using a large corpus of GlobalAlignment P -8 -2 -9 -17 -25 -33 -42 -49 A -16 -10 -3 -4 -12 -20 -28 -36 speech with corresponding transcripts W -24 -18 -11 -6 -7 -15 -5 -13 – During the training process, a recognizer is H E A G A W G 0 0 0 0 0 0 0 0 provided with a vocabulary LocalAlignment P 0 0 0 0 0 0 0 0 A 0 0 0 5 0 5 0 0 • Is it possible to learn vocabulary directly from W 0 0 0 0 2 0 20 12 speech signal?

  7. Vocabulary Induction: Outline Spectral Vectors • Spectral vector is a vector where each component is a measure of energy in a particular frequency band • We divide acoustic signal (a one dimensional wave form) into short overlapping intervals (25 msec with 15 msec overlap) • We convert each overlapping window using Fourier transform Comparing Acoustic Signals Example of Spectral Vectors he too was diagnosed with paranoid schizophrenia 6000 Freq (Hz) 4000 2000 0 0.5 1 1.5 2 2.5 3 3.5 4 Time (sec) were willing to put nash’s schizophrenia on record 6000 Freq (Hz) 4000 2000 0 0.5 1 1.5 2 2.5 3 3.5 Time (sec)

  8. Comparing Spectral Vectors Computing Local Alignment • Divide acoustic signal to “word segments” based on pauses • Compute spectral vectors for each segment • Build a distance matrix for each pair of “word segments” – use Euclidean distance to compare between spectral vectors Example of Distance Matrix Clustering Similar Utterance

  9. Examples of Computed Clusters

Recommend


More recommend