Eric Fosler-Lussier The Ohio State University Geoff Zweig - PowerPoint PPT Presentation

Direct sequence modeling  In speech and language processing, usually want to operate over sequences, not single classifications  What happens if we “change the direction” of arrows of an HMM? A direct model of P(S|O).  P ( S | O )  P ( S 1 | O 1 ) P ( S i | S i  1 , O i ) S1 S2 S3 i  1 P(S i |S i-1 ,O i ) ฀  O1 O2 O3 28

MEMMs  If a log linear term is used for P(S i |S i-1 ,O i ) then this is a Maximum Entropy Markov Model (MEMM) (Ratnaparkhi 1996, McCallum, Freitag & Pereira 2000)  Like MaxEnt, we take features of the observations and learn a weighted model S1 S2 S3  P ( S | O )  P ( S 1 | O 1 ) P ( S i | S i  1 , O i ) P(S i |S i-1 ,O i ) i  1        exp  j f j ( S i  1 , S i , O , i )     j i O1 O2 O3 ฀  29 ฀ 

MEMMs  Unlike HMMs, transitions between states can now depend on acoustics in MEMMs  However, unlike HMM, MEMMs can ignore observations  If P(S i =x|S i-1 =y)=1, then P(S i =x|S i-1 =y,O i )=1 for all O i (label bias)  Problem in practice? S1 S2 S3 P(S i |S i-1 ,O i ) O1 O2 O3 30

MEMMs in language processing  One prominent example in part-of-speech tagging is the Ratnaparkhi “MaxEnt” tagger (1996)  Produce POS tags based on word history features  Really an MEMM because it includes the previously assigned tags as part of its history  Kuo and Gao (2003- 6) developed “Maximum Entropy Direct Models” for ASR  Again, an MEMM, this time over speech frames  Features: what are the IDs of the closest Gaussians to this point? 31

Joint sequence models  Label bias problem: previous “decisions” may restrict the influence of future observations  Harder for the system to know that it was following a bad path  Idea: what if we had one big maximum entropy model where we compute the joint probability of hidden variables given observations?  Many-diplomat problem: P(Dmat 1 …Dmat N |Flag 1 …Flag N ,Lights 1 …Lights N )  Problem: State space is exponential in length  Diplomat problem: O(2 N ) 32

Factorization of joint sequences  What we want is a factorization that will allow us to decrease the size of the state space  Define a Markov graph to describe factorization: Markov Random Field (MRF)  Neighbors in graph contribute to the probability distribution  More formally: probability distribution is factored by the cliques in a graph 33

Markov Random Fields (MRFs)  MRFs are undirected (joint) graphical models  Cliques define probability distribution  Configuration size of each clique is the effective state space  Consider 5-diplomat series One 5-clique (fully connected) D1 D2 D3 D4 D5 Effective state space is 2 5 (MaxEnt) Three 3-cliques (1-2-3, 2-3-4, 3-4-5) D1 D2 D3 D4 D5 Effective state space is 2 3 Four 2-cliques (1-2, 2-3, 3-4, 4-5) D1 D2 D3 D4 D5 Effective state space is 2 2 34

Hammersley-Clifford Theorem  Hammersley-Clifford Theorem related MRFs to Gibbs probability distributions  If you can express the probability of a graph configuration as a product of potentials on the cliques (Gibbs distribution), then the graph is an MRF  P ( D )  f ( c ) c  cliques ( D ) P ( D )  f ( D 1 , D 2 ) f ( D 2 , D 3 ) f ( D 3 , D 4 ) f ( D 4 , D 5 ) D1 D2 D3 D4 D5  The potentials, however, must be positive ฀   True if f (c)=exp( S f(c)) (log linear form) ฀  35

Conditional Random Fields (CRFs)  When the MRF is conditioned on observations, this is known as a Conditional Random Field (CRF) (Lafferty, McCallum & Pereira, 2001)  Assuming log-linear form (true of almost all CRFs), then probability is determined by weighted functions (f i ) of the clique (c) and the observations (O)   P ( D | O )  1    i f i ( c , O ) exp   Z   c  cliques ( D ) i   P ( D | O )  1      i f i ( c , O ) Z exp     c  cliques ( D ) i   log( P ( D | O ))   i f i ( c , O )  log( Z ) c  cliques ( D ) i 36 ฀ 

Conditional Random Fields (CRFs)  When the MRF is conditioned on observations, this is known as a Conditional Random Field (CRF) (Lafferty, McCallum & Pereira, 2001)  Assuming log-linear form (true of almost all CRFs), then probability is determined by weighted functions (f i ) of the clique (c) and the observations (O)   P ( D | O )  1   For general graphs, computing  i f i ( c , O ) exp   Z   this quantity is #P-hard, requiring c  cliques ( D ) i approximate inference.   P ( D | O )  1      i f i ( c , O ) Z exp   However, for special graphs the   c  cliques ( D ) i complexity is lower. For example,   log( P ( D | O ))   i f i ( c , O )  log( Z ) linear chain CRFs have polynomial time algorithms. c  cliques ( D ) i ฀ 

Log-linear Linear Chain CRFs  Linear-chain CRFs have a 1 st order Markov backbone  Feature templates for a HMM-like CRF structure for the Diplomat problem D1 D2 D3 D4 D5  f Bias (D i =x, i) is 1 iff D i =x  f Trans (D i =x,D i+1 =y,i) is 1 iff D i =x and D i+1 =y  f Flag (D i =x,Flag i =y,i) is 1 iff D i =x and Flag i =y  f Lights (D i =x,Lights i =y,i) is 1 iff D i =x and Lights i =y  With a bit of subscript liberty, the equation is   5 5 5 4 1     1...5 )   B f Bias ( D i )   F f Flag ( D i , F i )   L f Lights ( D i , L i )   T f Trans ( D i , D i  1 )   P ( D 1 ... D 5 | F 1...5 , L Z ( F , L ) exp   i  1 i  1 i  1 i  1 38 ฀ 

Log-linear Linear Chain CRFs  In the previous example, the transitions did not depend on the observations (HMM-like)  In general, transitions may depend on observations (MEMM-like)  General form of linear chain CRF groups features as state features (bias, flag, lights) or transition features  Let s range over state features, t over transition features  i indexes into the sequence to pick out relevant observations   n  1 n 1     P ( D | O )   s f s ( D i , O , i )   t f t ( D i , D i  1 , O , i ) Z ( O ) exp     s  stateFtrs i  1 t  transFtrs i  1 39 ฀ 

A quick note on features for ASR  Both MEMMs and CRFS require the definition of feature functions  Somewhat obvious in NLP (word id, POS tag, parse structure)  In ASR, need some sort of “symbolic” representation of the acoustics  What are the closest Gaussians (Kuo & Gao, Hifny & Renals)  Sufficient statistics (Layton & Gales, Gunawardana et al)  With sufficient statistics, can exactly replicate single Gaussian HMM in CRF, or mixture of Gaussians in HCRF (next!)  Other classifiers (e.g. MLPs) (Morris & Fosler-Lussier)  Phoneme/Multi-Phone detections (Zweig & Nguyen) 40

Sequencing: Hidden Structure (1)  So far there has been a 1-to-1 correspondence between labels and observations  And it has been fully observed in training DET N V the dog ran 41

Sequencing: Hidden Structure (2)  But this is often not the case for speech recognition  Suppose we have training data like this: Transcript “The Dog” Audio (spectral representation) 42

Sequencing: Hidden Structure (3) DH IY IY D AH AH G Is “The dog” segmented like this? 43

Sequencing: Hidden Structure (3) DH DH IY D AH AH G Or like this? 44

Sequencing: Hidden Structure (3) DH DH IY D AH G G Or maybe like this? => An added layer of complexity 45

This Can Apply in NLP as well callee caller Hey John Deb Abrams calling how are you callee caller Hey John Deb Abrams calling how are you How should this be segmented? Note that a segment level feature indicating that “Deb Abrams” is a ‘good’ name would be useful 46

Approaches to Hidden Structure  Hidden CRFs (HRCFs)  Gunawardana et al., 2005  Semi-Markov CRFs  Sarawagi & Cohen, 2005  Conditional Augmented Models  Layton, 2006 Thesis – Lattice C-Aug Chapter; Zhang, Ragni & Gales, 2010  Segmental CRFs  Zweig & Nguyen, 2009  These differ in  Where the Markov assumption is applied  What labels are available at training  Convexity of objective function  Definition of features 47

Approaches to Hidden Structure Method Markov Segmentation Features Assumption known in Prescribed Training HCRF Frame level No No Semi-Markov CRF Segment Yes No Conditional Segment No Yes Augmented Models Segmental CRF Segment No No 48

One View of Structure DH AE T DH AE T Consider all segmentations consistent with transcription / hypothesis Apply Markov assumption at frame level to simplify recursions Appropriate for frame level features 49

Another View of Structure DH AE T o 1 o n DH AE T o 1 o n Consider all segmentations consistent with transcription / hypothesis Apply Markov assumption at segment level only – “ Semi Markov ” This means long-span segmental features can be used 50

Examples of Segment Level Features in ASR  Formant trajectories  Duration models  Syllable / phoneme counts  Min/max energy excursions  Existence, expectation & levenshtein features described later 51

Examples of Segment Level Features in NLP  Segment includes a name  POS pattern within segment is DET ADJ N  Number of capitalized words in segment  Segment is labeled “Name” and has 2 words  Segment is labeled “Name” and has 4 words  Segment is labeled “Phone Number and has 7 words”  Segment is labeled “Phone Number and has 8 words” 52

Is Segmental Analysis any Different?  We are conditioning on all the observations  Do we really need to hypothesize segment boundaries?  YES – many features undefined otherwise:  Duration (of what?)  Syllable/phoneme count (count where?)  Difference in C 0 between start and end of word  Key Example: Conditional Augmented Statistical Models 53

Conditional Augmented Statistical Models  Layton & Gales, “Augmented Statistical Models for Speech Recognition,” ICASSP 2006  As features use  Likelihood of segment wrt an HMM model  Derivative of likelihood wrt each HMM model parameter  Frame-wise conditional independence assumptions of HMM are no longer present  Defined only at segment level 54

Now for Some Details  Will examine general segmental case  Then relate specific approaches DH AE T o 1 o n DH AE T o 1 o n 55

Segmental Notation & Fine Print  We will consider feature functions that cover both transitions and observations  So a more accurate representation actually has diagonal edges  But we’ll generally omit them for simpler pictures  Look at a segmentation q in terms of its edges e  s l e is the label associated with the left state on an edge  s r e is the label associated with the right state on an edge  O(e) is the span of observations associated with an edge s l e s r e e 4 o(e)=o 3 56

The Segmental Equations s l e s r e e o(e)=o 3 4    e e exp( ( , , ( )) f s s o e i i l r q q  s  q, q, s o  | | | | st e i ( | ) P     e e exp( ( ' , ' , ( )) f s s o e i i l r s q q s ' s ' q, q,   ' | | | | st e i We must sum over all possible segmentations of the observations consistent with a hypothesized state sequence . 57

Conditional Augmented Model (Lattice version) in this View    e e exp( ( , , ( )) f s s o e i i l r q q  s  q, q, s o  | | | | st e i ( | ) P     e e exp( ( ' , ' , ( )) f s s o e i i l r s q q s ' s ' q, q,   ' | | | | st e i          T e e exp( ( , , ( )) exp( ( ( )) ( ( )) f s s o e L o e L o e i i l r e e e ( ) ( ) s HMM s HMM s r r r q, q, q   e i e Features precisely defined HMM model likelihood Derivatives of HMM model likelihood wrt HMM parameters 58

HCRF in this View    e e exp( ( , , ( )) f s s o e i i l r q q  s  q, q, s o  | | | | st e i ( | ) P     e e exp( ( ' , ' , ( )) f s s o e i i l r s q q s ' s ' q, q,   ' | | | | st e i      e e e e exp( ( , , ( )) exp( ( , , ) f s s o e f s s o  1 i i l r i i k k k q, q,   1 .. , e i k N i Feature functions are decomposable at the frame level Leads to simpler computations 59

Semi-Markov CRF in this View    e e exp( ( , , ( )) f s s o e i i l r q q  s  q, q, s o  | | | | st e i ( | ) P     e e exp( ( ' , ' , ( )) f s s o e i i l r s q q s ' s ' q, q,   ' | | | | st e i       e e e e exp( ( , , ( )) exp( ( , , ( )) f s s o e f s s o e i i l r i i l r q q s q, q, q* ,    | | | | st e i e i A fixed segmentation is known at training Optimization of parameters becomes convex 60

Structure Summary  Sometimes only high-level information is available  E.g. the words someone said (training)  The words we think someone said (decoding)  Then we must consider all the segmentations of the observations consistent with this  HCRFs do this using a frame-level Markov assumption  Semi-CRFs / Segmental CRFs do not assume independence between frames  Downside: computations more complex  Upside: can use segment level features  Conditional Augmented Models prescribe a set of HMM based features 61

Key Tasks  Compute optimal label sequence (decoding)  arg max ( | , ) P s o s  Compute likelihood of a label sequence  ( | , ) P s o  Compute optimal parameters (training)   arg max ( | , ) P s d o d  d 64

Key Cases Viterbi Assumption Hidden Structure Model NA NA Log-linear classification Frame-level No CRF Frame-level Yes HCRF Segment-level Yes (decode only) Semi-Markov CRF Segment-level Yes (train & decode) C-Aug, Segmental CRF 65

Decoding  The simplest of the algorithms  Straightforward DP recursions Viterbi Assumption Hidden Structure Model NA NA Log-linear classification Frame-level No CRF Frame-level Yes HCRF Segment-level Yes (decode only) Semi-Markov CRF Segment-level Yes (train & decode) C-Aug, Segmental CRF Cases we will go over 66

Flat log-linear Model   exp( ( , )) f x y i i  i ( | ) p y x    exp( ( , ' )) f x y i i ' y i    * arg max exp( ( , )) y f x y i i y i Simply enumerate the possibilities and pick the best. 67

A Chain-Structured CRF s j-1 s j … … o j   exp( ( , , ) f s s o  1 i i j j j s o  j i ( | ) P    exp( ( ' , ' , )) f s s o  1 i i j j j ' s j i  s   * arg max exp( ( , , ) f s s o  1 i i j j j s j i Since s is a sequence there might be too many to enumerate. 68

Chain-Structured Recursions The best way of getting here is the best way of getting here somehow and then making the s m-1 =q’ s m =q transition and accounting for … … the observation o m d ( m,q ) is the best label sequence score that ends in position m with label q  d  d   ( , ) arg max ( 1 , ' ) exp( ( ' , , ) ) m q m q f q q o i i m ' q i d   ( 0 , ) 1 Recursively compute the d s Keep track of the best q’ decisions to recover the sequence 69

Segmental/Semi-Markov CRF s l e s r e e o(e) o m-d o m o 1 o n    e e exp( ( , , ( )) f s s o e i i l r q q  s  q, q, s o  | | | | st e i ( | ) P     e e exp( ( ' , ' , ( )) f s s o e i i l r s q q s ' s ' q, q,   ' | | | | st e i 70

Segmental/Semi-Markov Recursions y’ y o m-d o m o 1 o n d ( m,y ) is the best label sequence score that ends at observation m with state label y  d  d   m ( , ) arg max ( , ' ) exp( ( ' , , ) ) m y m d y f y y o   1 i i m d ', y d i d   ( 0 , ) 1 Recursively compute the d s Keep track of the best q’ and d decisions to recover the sequence 71

Computing Likelihood of a State Sequence Viterbi Assumption Hidden Structure Model NA NA Flat log-linear Frame-level No CRF Frame-level Yes HCRF Segment-level Yes (decode only) Semi-Markov CRF Segment-level Yes (train & decode) C-Aug, Segmental CRF Cases we will go over 72

Flat log-linear Model Plug in hypothesis   exp( ( , )) f x y i i  i ( | ) p y x    exp( ( , ' )) f x y i i ' y i Enumerate the possibilities and sum. 73

A Chain-Structured CRF s j-1 s j … … o j Single hypothesis s Plug in and compute   exp( ( , , ) f s s o  1 i i j j j s o  j i ( | ) P    exp( ( ' , ' , )) f s s o  1 i i j j j ' s j i Need a clever way of summing over all hypotheses To get normalizer Z 74

CRF Recursions m    a   ( , ) exp ( , ) m q f s s o  1 , i i j j j  s  1 .. j m i st s q 1 m a ( m,q ) is the sum of the label sequence scores that end in position m with label q   a  a   ( , ) ( 1 , ' ) exp( ( ' , , ) ) m q m q f q q o i i m ' q i a   ( 0 , ) 1   a ( , ' ) Z N q ' q Recursively compute the a s Compute Z and plug in to find P( s | o ) 75

Segmental/Semi-Markov CRF s l e s r e e o(e) o m-d o m o 1 o n For segmental CRF numerator requires    e e exp( ( , , ( )) f s s o e a summation too i i l r q q  s  q, q, s o  | | | | st e i ( | ) P     e e exp( ( ' , ' , ( )) f s s o e i i l r Both Semi-CRF and s q q s ' s ' q, q,   ' | | | | st e i segmental CRF require the same denominator sum

SCRF Recursions: Denominator    a   e e ( , ) exp ( , , ( )) m y f s s o e i i l r s s q q s q q, q,     ( ) | | | |, ( ) st last y st last m e i a label a position a ( m,y ) is the sum of the scores of all labelings and segmentations that end in position m with label y   a  a   m ( , ) ( , ' ) exp( ( ' , , ) ) m y m d y f y y o   1 i i m d ' y d i a   ( 0 , ) 1   a ( , ' ) Z N y Recursively compute the a s ' y Compute Z and plug in to find P( s | o ) 77

SCRF Recursions: Numerator s y-1 s y o m-d o m o 1 o n Recursion is similar with the state sequence fixed. a * (m,y) will now be the sum of the scores of all segmentations ending in an assignment of observation m to the y th state. Note the value of the y th state is given! y is now a positional index rather than state value. 78

Numerator (con’t.) s y-1 s y o m-d o m o 1 o n   a   * e e ( , ) exp ( , , ( )) m y f s s o e i i l r q q q, q,   | | st y e i    a a    * * m ( , ) ( , 1 ) exp( ( , , ) ) m y m d y f s s o    1 1 i i y y m d d i a   * ( 0 , ) 1 Note again that here y is the position into a given state sequence s 79

Summary: SCRF Probability    e e exp( ( , , ( )) f s s o e i i l r q q s q , q , s o    | | | | st e i ( | ) P     e e exp( ( ' , ' , ( )) f s s o e i i l r s q q s ' s ' q , q ,   ' | | | | st e i s a * ( , | |) N   a ( , ) N q q Compute alphas and numerator-constrained alphas with forward recursions Do the division 80

Training Viterbi Assumption Hidden Structure Model NA NA Log-linear classification Frame-level No CRF Frame-level Yes HCRF Segment-level Yes (decode only) Semi-Markov CRF Segment-level Yes (train & decode) C-Aug, Segmental CRF Will go over simplest cases. See also • Gunawardana et al., Interspeech 2005 (HCRFs) • Mahajan et al., ICASSP 2006 (HCRFs) • Sarawagi & Cohen, NIPS 2005 (Semi-Markov) • Zweig & Nguyen, ASRU 2009 (Segmental CRFs) 81

Training  Specialized approaches  Exploit form of Max-Ent Model  Iterative Scaling (Darroch & Ratcliff, 1972)  f i (x,y) >= 0 and S i f i (x,y)=1  Improved Iterative Scaling (Berger, Della Pietra & Della Pietra, 1996)  Only relies on non-negativity  General approach: Gradient Descent  Write down the log-likelihood for one data sample  Differentiate it wrt the model parameters  Do your favorite form of gradient descent  Conjugate gradient  Newton method  R-Prop  Applicable regardless of convexity 82

Training with Multiple Examples  When multiple examples are present, the contributions to the log-prob (and therefore gradient) are additive  s o  ( | ) L P j j j  s o  log log ( | ) L P j j j  To minimize notation, we omit the indexing and summation on data samples 83

Flat log-linear model   exp( ( , )) f x y i i  ( | ) i p y x    exp( ( , ' )) f x y i i ' y i        ' log ( | ) ( , ) log exp( ( , )) P y x f x y f x y i i i i ' i i y d    ' exp( ( , )) f x y  i i d d '   i k y log ( | ) ( , ) P y x f x y     k ' exp( ( , )) d f x y k i i ' i y 84

Flat log- linear Model Con’t.  d   ' exp( ( , )) f x y  i i d d '   i y k log ( | ) ( , ) P y x f x y  k d Z k    ' ( , ' ) exp( ( , )) f x y f x y k i i '   i y ( , ) f x y k Z    ( , ) ( , ' ) ( ' | ) f x y f x y P y x k k ' y This can be computed by enumerating y’

A Chain-Structured CRF s j-1 s j … … o j   exp( ( , , ) ) f s s o  1 i i j j j s o  j i ( | ) P    exp( ( ' , ' , )) f s s o  1 i i j j j ' y j i 86

Chain- Structured CRF (con’t.)    s o     log ( | ) ( , , ) log exp( ( ' , ' , ) ) P f s s o f s s o   i i j 1 j j i i j 1 j j ' j i s j i d  s o  log ( | ) ( , , ) P f s s o   1 k j j j d j k 1      ( ( ' , ' , ) ) exp( ( ' , ' , ) ) f s s o f s s o   1 1 k j j j i i j j j Z ' s j j i     ( , , ) ( ' | ) ( ' , ' , ) f s s o P s o f s s o   1 1 k j j j k j j j ' j s j Second is similar to the simple log-linear model, but: Easy to compute first term * Cannot enumerate s’ because it is now a sequence * And must sum over positions j 87

Forward/Backward Recursions    a ( m,q ) is sum of partial a   ( , ) exp ( , , ) m q f s s o  1 i i j j j path scores ending  s  m j 1 .. m i st s q 1 m at position m, with    a   ( 1 , ' ) exp( ( ' , , ) ) m q f q q o label q (inclusive of i i m observation m) ' q i a   ( 0 , ) 1   a ( , ) Z N q q  ( m,q ) is sum of partial       path scores starting ( , ) exp( ( , ) ) m q f s s o  1 , i i j j j at position m, with   s  N j m 1 .. N i st s q m m label q (exclusive of       ( 1 , ' ) exp( ( , ' , ) ) m q f q q o observation m)  1 i i m ' q i 88

Gradient Computation d  s o  log ( | ) ( , , ) P f s s o   1 k j j j d j k 1      ( ( ' , ' , ) ) exp( ( ' , ' , ) ) f s s o f s s o   k j 1 j j i i j 1 j j Z ' s j j i   ( , , ) f s s o  1 k j j j j 1    a    ( , ) ( 1 , ' ) exp( ( , ' , ) ) ( , ' , ) j q j q f q q o f q q o   i i j 1 k j 1 Z ' j q q i 1) Compute Alphas 2) Compute Betas 3) Compute gradient 89

Segmental Versions  More complex; See  Sarawagi & Cohen, 2005  Zweig & Nguyen, 2009  Same basic process holds  Compute alphas on forward recursion  Compute betas on backward recursion  Combine to compute gradient 90

Once We Have the Gradient  Any gradient descent technique possible Find a direction to move the parameters 1) Some combination of information from first and  second derivative values 2) Decide how far to move in that direction  Fixed or adaptive step size Line search  3) Update the parameter values and repeat 91

Conventional Wisdom  Limited Memory BFGS often works well  Liu & Nocedal, Mathematical Programming (45) 1989  Sha & Pereira, HLT-NAACL 2003  Malouf, CoNLL 2002  For HCRFs stochastic gradient descent and Rprop are as good or better  Gunawardana et al., Interspeech 2005  Mahajan, Gunawardana & Acero, ICASSP 2006  Rprop is exceptionally simple 92

Rprop Algorithm  Martin Riedmiller, “Rprop – Description and Implementation Details” Technical Report, January 1994, University of Karlsruhe.  Basic idea:  Maintain a step size for each parameter  Identifies the “scale” of the parameter  See if the gradient says to increase or decrease the parameter  Forget about the exact value of the gradient  If you move in the same direction twice, take a bigger step!  If you flip-flop, take a smaller step! 93

Regularization  In machine learning, often want to simplify models  Objective function can be changed to add a penalty term for complexity  Typically this is an L1 or L2 norm of the weight (lambda vector)  L1 leads to sparser models than L2  For speech processing, some studies have found regularization  Necessary: L1-ACRFs by Hifny & Renals, Speech Communication 2009  Unnecessary if using weight averaging across time: Morris & Fosler-Lussier, ICASSP 2007 94

CRF Speech Recognition with Phonetic Features Acknowledgements to Jeremy Morris

Top-down vs. bottom-up processing  State-of-the-art ASR takes a top-down approach to this problem  Extract acoustic features from the signal  Model a process that generates these features  Use these models to find the word sequence that best fits the features “speech” / s p iy ch/ 96

Bottom-up: detector combination  A bottom-up approach using CRFs  Look for evidence of speech in the signal  Phones, phonological features  Combine this evidence together in log-linear model to find the most probable sequence of words in the signal evidence evidence combination detection via CRFs voicing? “speech” burst? frication? / s p iy ch/ (Morris & Fosler-Lussier, 2006-2010) 97

Phone Recognition  What evidence do we have to combine?  MLP ANN trained to estimate frame-level posteriors for phonological features  MLP ANN trained to estimate frame-level posteriors for phone classes P(voicing|X) P(burst|X) P(frication|X ) … P( /ah/ | X) P( /t/ | X) P( /n/ | X) … 98

Phone Recognition  Use these MLP outputs to build state feature functions  { ( ), / / MLP x if y t  P (/ t / | x ) ( , ) s y x / /, (/ / | ) t P t x 0 , otherwise 99

Phone Recognition  Use these MLP outputs to build state feature functions  { ( ), / / MLP x if y t  P (/ t / | x ) ( , ) s y x / /, (/ / | ) t P t x 0 , otherwise  { ( ), / / MLP x if y t  P (/ d / | x ) ( , ) s y x / /, (/ / | ) t P d x 0 , otherwise 100

Eric Fosler-Lussier The Ohio State University Geoff Zweig - PowerPoint PPT Presentation

Eric Fosler-Lussier The Ohio State University Geoff Zweig Microsoft What we will cover Tutorial introduces basics of direct probabilistic models What is a direct model, and how does it relate to speech and language processing? How

Linda Prowse Fosler WHAT IS COPA? Founded in 2003 COPA is a regional nonprofit organization

Characterizing the impact of geometric properties of word embeddings on task performance Brendan

Automated Phenotypic Networks for the Integration of Heterogeneous Databases Yves A. Lussier 1,2

Landing and Perching on Vertical Surfaces with Microspines for Small UAVs Alexis Lussier

GTUG Why using Deduplicated-storage Fernand Lussier VP Research and Development Nonstop File

To: The Legislative Committee on Bill 6 From: Dan Lussier, Chair of the Catholic Health

European Research Initiative on CLL (ERIC) 25-27 October, 2018 ERIC Meeting, Barcelona ERIC

2941 Fairview Park 2941 Fairview Park Eric Sobel 2941 Fairview Park PROJECT TEAM Eric Sobel

Eric Wahlforss CTO/SoundCloud GOTO Aarhus 2011 L O O C Eric Wahlforss CTO/SoundCloud GOTO

Self Assembly (talk for the AERES evaluation) Eric R emila based on Florent Becker s Ph. D.

Eric Sparks eric.sparks@msstate.edu AL.com AL.com NOAA Gulf Spill Restoration Ecological

Behind the Scenes at ERIC North East Consultants Forum 26 th October 2012 Katherine Pinnock

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric

Complexity Theory Eric Price UT Austin CS 331, Spring 2020 Coronavirus Edition Eric Price (UT

Eric L. Green Eric is a partner with Green & Sklarz, LLC in Connecticut and New York The

Requirements in Conflict Player vs. Designer vs. Cheater David Callele David Callele Eric

SLAVE LAKE and EMERGENCY COMMUNICATION TOM COX VE6TOX

The invention of the network camera and the VLSI technology behind Stefan.Lundberg@axis.com Oct

A Penny In The Parking Lot 1 Digitally Distracted Kids: Reconnecting Them With The Real World

Warnings for this talk Software, not the disaster. Earthquake 22 February 2011 12:51pm

CS378 - Mobile Computing What's Next? Fragments Added in Android 3.0, a release aimed at

global diversity CFP day Welcome Your T eam T odays Schedule T odays

Roadmap Why do networking characterization? How to do network characterization (and

on Freight Presentation by Bryant Thomas Manager, Government Relations Norfolk Southern Ohio is

Eric Fosler-Lussier The Ohio State University Geoff Zweig - PowerPoint PPT Presentation

Eric Fosler-Lussier The Ohio State University Geoff Zweig Microsoft What we will cover Tutorial introduces basics of direct probabilistic models What is a direct model, and how does it relate to speech and language processing? How

Linda Prowse Fosler WHAT IS COPA? Founded in 2003 COPA is a regional nonprofit organization

Characterizing the impact of geometric properties of word embeddings on task performance Brendan

Automated Phenotypic Networks for the Integration of Heterogeneous Databases Yves A. Lussier 1,2

Landing and Perching on Vertical Surfaces with Microspines for Small UAVs Alexis Lussier

GTUG Why using Deduplicated-storage Fernand Lussier VP Research and Development Nonstop File

To: The Legislative Committee on Bill 6 From: Dan Lussier, Chair of the Catholic Health

European Research Initiative on CLL (ERIC) 25-27 October, 2018 ERIC Meeting, Barcelona ERIC

2941 Fairview Park 2941 Fairview Park Eric Sobel 2941 Fairview Park PROJECT TEAM Eric Sobel

Eric Wahlforss CTO/SoundCloud GOTO Aarhus 2011 L O O C Eric Wahlforss CTO/SoundCloud GOTO

Self Assembly (talk for the AERES evaluation) Eric R emila based on Florent Becker s Ph. D.

Eric Sparks eric.sparks@msstate.edu AL.com AL.com NOAA Gulf Spill Restoration Ecological

Behind the Scenes at ERIC North East Consultants Forum 26 th October 2012 Katherine Pinnock

Active Regression via Linear-Sample Sparsification Xue Chen Eric Price UT Austin Xue Chen, Eric

Complexity Theory Eric Price UT Austin CS 331, Spring 2020 Coronavirus Edition Eric Price (UT

Eric L. Green Eric is a partner with Green &amp; Sklarz, LLC in Connecticut and New York The

Requirements in Conflict Player vs. Designer vs. Cheater David Callele David Callele Eric

SLAVE LAKE and EMERGENCY COMMUNICATION TOM COX VE6TOX

The invention of the network camera and the VLSI technology behind Stefan.Lundberg@axis.com Oct

A Penny In The Parking Lot 1 Digitally Distracted Kids: Reconnecting Them With The Real World

Warnings for this talk Software, not the disaster. Earthquake 22 February 2011 12:51pm

CS378 - Mobile Computing What's Next? Fragments Added in Android 3.0, a release aimed at

global diversity CFP day Welcome Your T eam T odays Schedule T odays

Roadmap Why do networking characterization? How to do network characterization (and

on Freight Presentation by Bryant Thomas Manager, Government Relations Norfolk Southern Ohio is

Eric L. Green Eric is a partner with Green & Sklarz, LLC in Connecticut and New York The