WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 - PowerPoint PPT Presentation

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 Instructor: Preethi Jyothi

Determinization/Minimization: Recap A (W)FST is deterministic if: • Unique start state • No two transitions from a state share the same input label • No epsilon input labels • Minimization finds an equivalent deterministic FST with the least • number of states (and transitions) For a deterministic weighted automaton, weight pushing + • (unweighted) automata minimization leads to a minimal weighted automaton Guaranteed to yield a deterministic/minimized WFSA under some • technical conditions characterising the automata (e.g. twins property) and the weight semiring (allowing for weight pushing)

Example: Dictionary WFST d:eps 3 a:eps 2 1 b:bad b:eps a:eps 6 5 4 c:cab d:eps 10 b:bead e:eps a:eps 0 7 8 9 c:cede e:eps d:eps e:eps d:decade 11 12 13 14 e:eps c:eps a:eps d:eps e:eps 15 16 17 18 19 20

Determinized Dictionary WFST d:eps 9 4 a:bad e:bead a:eps d:eps 1 5 10 14 b:eps b:eps a:cab 11 6 c:eps 0 2 e:cede d:decade d:eps e:eps 7 15 12 e:eps 3 c:eps a:eps d:eps e:eps 8 13 16 17 18

Minimized Dictionary WFST a:bad a:eps 4 d:eps e:bead 1 5 b:eps b:eps 6 9 a:cab c:eps 2 e:eps 0 e:cede d:decade d:eps 7 10 e:eps c:eps a:eps 3 8 11

WFST-based ASR System Acoustic   Context   Pronunciation   Language   Models Transducer Model Model Acoustic   Word   Triphones Monophones Words Indices Sequence

WFST-based ASR System Acoustic   Context   Pronunciation   Language   Models Transducer Model Model Acoustic   Word   Triphones Monophones Words Indices Sequence H a/a_b f 4 : ε f 1 : ε f 3 : ε f 5 : ε f 0 : a+a+b f 2 : ε f 4 : ε f 6 : ε } b/a_b FST Union + One 3-state   Closure HMM for   . Resulting each   FST . triphone H . x/y_z

WFST-based ASR System Acoustic   Context   Pronunciation   Language   Models Transducer Model Model Acoustic   Word   Triphones Monophones Words Indices Sequence C x:x/ ε _ ε y:y/ ε _x x:x/ ε _y x:x/y_x x:x/y_ ε ε ,* x:x/y_y y,x x, ε x:x/x_x x:x/ ε _x y:y/x_x x:x/x_y x,y x,x y:y/x_y y:y/y_x y:y/y_y y,y y:y/y_ ε y:y/x_ ε y, ε x:x/x_ ε y:y/ ε _y y:y/ ε _ ε C -1 : Arc labels: “monophone : phone / left-context_right-context” Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

WFST-based ASR System Acoustic   Context   Pronunciation   Language   Models Transducer Model Model Acoustic   Word   Triphones Monophones Words Indices Sequence L (a) t: ε /0.3 ax: ε /1 ey: ε /0.5 2 3 4 dx: ε /0.7 ae: ε /0.5 d:data/1 1 0 d:dew/1 uw: ε /1 5 6 (b) Figure reproduced from “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., 2002

WFST-based ASR System Acoustic   Context   Pronunciation   Language   Models Transducer Model Model Acoustic   Word   Triphones Monophones Words Indices Sequence G are/0.693 walking birds/0.404 the 0 were/0.693 animals/1.789 is boy/1.789

Constructing the Decoding Graph Acoustic   Context   Pronunciation   Language   Models Transducer Model Model Acoustic   Word   Triphones Monophones Words Indices Sequence H C L G Decoding graph, D = H ⚬ C ⚬ L ⚬ G Construct decoding search graph using H ⚬ C ⚬ L ⚬ G that maps   acoustic states to word sequences Carefully construct D using optimization algorithms: D = min(det(H ⚬ det(C ⚬ det(L ⚬ G)))) Decode test utterance O by aligning acceptor X (corresponding to O)   with H ⚬ C ⚬ L ⚬ G: W ∗ = arg min X ⚬ H ⚬ C ⚬ L ⚬ G W = out [ π ] where π is a path in the composed FST, out[ π ] is the output label sequence of π “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Constructing the Decoding Graph Acoustic   Context   Pronunciation   Language   Models Transducer Model Model Acoustic   Word   Triphones Monophones Words Indices Sequence H C L G Decode test utterance O by aligning acceptor X (corresponding to O)   with H ⚬ C ⚬ L ⚬ G: W ∗ = arg min X ⚬ H ⚬ C ⚬ L ⚬ G W = out [ π ] where π is a path in the composed FST, out[ π ] is the output label sequence of π Structure of X (derived from O): f 0 :19.12 f 0 :18.52 f 0 :10.578 f 0 :9.21 f 1 :12.33 f 1 :13.45 f 1 :5.645 f 1 :14.221 ⠇ ⠇ ⠇ ⠇ ………… f 500 :20.21 f 500 :10.21 f 500 :8.123 f 500 :11.233 f 1000 :11.11 f 1000 :15.99 f 1000 :5.678 f 1000 :15.638 “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Constructing the Decoding Graph Acoustic   Context   Pronunciation   Language   Models Transducer Model Model Acoustic   Word   Triphones Monophones Words Indices Sequence H C L G f 0 :19.12 f 0 :18.52 f 0 :10.578 f 0 :9.21 f 1 :12.33 X f 1 :13.45 f 1 :5.645 f 1 :14.221 ⠇ ⠇ ⠇ ⠇ ………… f 500 :20.21 f 500 :10.21 f 500 :8.123 f 500 :11.233 f 1000 :11.11 f 1000 :15.99 f 1000 :5.678 f 1000 :15.638 • Each f k maps to a distinct triphone HMM state j • Weights of arcs in the i th chain link correspond to observation probabilities b j (o i ) • X is a very large FST which is never explicitly constructed! • H ⚬ C ⚬ L ⚬ G is typically traversed dynamically (search algorithms will be covered   later in the semester) “Weighted Finite State Transducers in Speech Recognition”, Mohri et al., Computer Speech & Language, 2002

Impact of WFST Optimizations 40K NAB Evaluation Set ’95 (83% word accuracy) network states transitions 1,339,664 3,926,010 G 8,606,729 11,406,721 L � G det ( L � G ) 7,082,404 9,836,629 C � det ( L � G )) 7,273,035 10,201,269 det ( H � C � L � G ) 18,317,359 21,237,992 network x real-time 12.5 C � L � G C � det ( L � G ) 1.2 det ( H � C � L � G ) 1.0 push ( min ( F )) 0.7 Tables from http://www.openfst.org/twiki/pub/FST/FstHltTutorial/tutorial_part3.pdf

Toolkits to work with finite-state machines AT&T FSM Library (no longer supported)   • http://www3.cs.stonybrook.edu/~algorith/implement/fsm/ implement.shtml RWTH FSA Toolkit   • https://www-i6.informatik.rwth-aachen.de/~kanthak/fsa.html Carmel   • https://www.isi.edu/licensed-sw/carmel/ MIT FST Toolkit   • http://people.csail.mit.edu/ilh/fst/ OpenFST Toolkit (actively supported)   • http://www.openfst.org/twiki/bin/view/FST/WebHome

Brief Introduction to the OpenFST Toolkit

ε:n Quick Intro to OpenFst (www.openfst.org) a :a “0”�label�is�reserved� for�epsilon 0 an :a 0 1 an a <eps> 0 Input   1 2 <eps> n an 1 alphabet   (in.txt) 0 2 a a a 2 1 2 <eps> 0 Output   a 1 alphabet   A.txt (out.txt) n 2

ε:n/1.0 Quick Intro to OpenFst (www.openfst.org) a :a/0.5 2/0. 0 an :a/0.5 0 1 an a 0.5 1 2 <eps> n 1.0 0 2 a a 0.5 1 2 0.1

Compiling & Printing FSTs The text FSTs need to be “compiled” into binary objects before further use with OpenFst utilities Command used to compile: • fstcompile --isymbols=in.txt --osymbols=out.txt A.txt A.fst Get back the text FST using a print command with the binary file: • fstprint --isymbols=in.txt --osymbols=out.txt A.fst A.txt

Composing FSTs The text FSTs need to be “compiled” into binary objects before further use with OpenFst utilities Command used to compose: • fstcompose A.fst B.fst AB.fst OpenFST requirement: One or both of the input FSTs should be • appropriately sorted before composition fstarcsort —-sort_type=olabel A.fst |\   fstcompose - B.fst AB.fst

Drawing FSTs Small FSTs can be visualized easily using the draw tool: fstdraw --isymbols=in.txt --osymbols=out.txt A.fst |\ dot -Tpdf > A.pdf 1 <eps>:n an:a 0 2 a:a

FSTs can get very large!

Basics of Speech Production

Speech Production Schematic representation of the   vocal organs Schematic from L.Rabiner and B.-H.Juang , Fundamentals of speech recognition, 1993 Figure from http://www.phon.ucl.ac.uk/courses/spsci/iss/week6.php

Sound units Phones are acoustically distinct units of speech • Phonemes are abstract linguistic units that impart different • meanings in a given language Minimal pair: pan vs. ban • Allophones are different acoustic realisations of the same phoneme • Phonetics is the study of speech sounds and how they’re produced • Phonology is the study of patterns of sounds in different languages •

Vowels Sounds produced with no obstruction to the flow of air • through the vocal tract VOWEL QUADRILATERAL Image from https://en.wikipedia.org/wiki/File:IPA_vowel_chart_2005.png

Formants of vowels Formants are resonance frequencies of the vocal tract (denoted by F1, F2, etc.) • F0 denotes the fundamental frequency of the periodic source (vibrating vocal folds) • Formant locations specify certain vowel characteristics • Image from: https://www.phon.ucl.ac.uk/courses/spsci/iss/week5.php

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 - PowerPoint PPT Presentation

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 Instructor: Preethi Jyothi Determinization/Minimization: Recap A (W)FST is deterministic if: Unique start state No two transitions from a state share the same input

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

S2S ASR Advanced issues Tight coupling Tight coupling ASR should output N ASR should

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Use of f th the SA SAWS ASR ASR for r Sp Spri ringflow Protection Optimization through

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

Speech and Language CS 188: Artificial Intelligence Spring 2011 Speech technologies

Speech(less) Presentation Basics: A Visual Guide Speech(less) Presentation Basics: A Visual Guide

1 In this presentation the two types of alkali-aggregate reaction ASR and ACR will de

ASR, NLU, DM Ling575 Spoken Dialog Systems April 12, 2017 Roadmap ASR Basic

Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie

Discussion 10: Iterators, Generators and Streams Nancy Shaw (nshaw99@berkeley.edu) Caroline

On the Concrete Security of Goldreichs Pseudorandom Generator Geo ff roy Couteau - Aurlien

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Computational Tools for the Exploration of Melodic Characteristics CompMusic Seminar, IIT-Madras,

ARIMA and ARFIMA models Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019

PIP-II R&D Program Paul Derwent DOE Independent Project Review of PIP-II 16 June 2015 The

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 - PowerPoint PPT Presentation

WFSTs in ASR & Basics of Speech Production Lecture 6 CS 753 Instructor: Preethi Jyothi Determinization/Minimization: Recap A (W)FST is deterministic if: Unique start state No two transitions from a state share the same input

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFSTs in ASR

Speech Processing 15-492/18-492 Speech Recognition Systems Other ASR techniques ASR Systems

CPSC 503 - Intro to E2E ASR Peter Sullivan - April 24th 2020 Lecture Overview Intro to ASR

S2S ASR Advanced issues Tight coupling Tight coupling ASR should output N ASR should

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Use of f th the SA SAWS ASR ASR for r Sp Spri ringflow Protection Optimization through

11-731 Machine Translation Speech 2 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 4: WFST

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

Speech and Language CS 188: Artificial Intelligence Spring 2011 Speech technologies

Speech(less) Presentation Basics: A Visual Guide Speech(less) Presentation Basics: A Visual Guide

1 In this presentation the two types of alkali-aggregate reaction ASR and ACR will de

ASR, NLU, DM Ling575 Spoken Dialog Systems April 12, 2017 Roadmap ASR Basic

Harmonic Structure Transform for Speaker Recognition Kornel Laskowski &amp; Qin Jin Carnegie

Discussion 10: Iterators, Generators and Streams Nancy Shaw (nshaw99@berkeley.edu) Caroline

On the Concrete Security of Goldreichs Pseudorandom Generator Geo ff roy Couteau - Aurlien

Math 211 Math 211 Lecture #1 Introduction August 26, 2002 2 Welcome to Math 211 Welcome to

Computational Tools for the Exploration of Melodic Characteristics CompMusic Seminar, IIT-Madras,

ARIMA and ARFIMA models Christopher F Baum ECON 8823: Applied Econometrics Boston College,

Algorithms for Big Data (IV) Chihao Zhang Shanghai Jiao Tong University Oct. 11, 2019

PIP-II R&amp;D Program Paul Derwent DOE Independent Project Review of PIP-II 16 June 2015 The

Harmonic Structure Transform for Speaker Recognition Kornel Laskowski & Qin Jin Carnegie

PIP-II R&D Program Paul Derwent DOE Independent Project Review of PIP-II 16 June 2015 The