Phones All human speech is composed from 40-50 phones, determined by - PDF document

Phones All human speech is composed from 40-50 phones, determined by the configuration of articulators (lips, teeth, tongue, vocal cords, air flow) Form an intermediate level of hidden states between words and signal Speech recognition (briefly) ⇒ acoustic model = pronunciation model + phone model ARPAbet designed for American English [iy] b ea t [b] b et [p] p et Chapter 15, Section 6 [ih] b i t [ch] Ch et [r] r at [ey] b e t [d] d ebt [s] s et [ao] b ough t [hh] h at [th] th ick [ow] b oa t [hv] h igh [dh] th at [er] B er t [l] l et [w] w et [ix] ros e s [ng] si ng [en] butt on . . . . . . . . . . . . . . . . . . E.g., “ceiling” is [s iy l ih ng] / [s iy l ix ng] / [s iy l en] Chapter 15, Section 6 1 Chapter 15, Section 6 4 Outline Speech sounds ♦ Speech as probabilistic inference Raw signal is the microphone displacement as a function of time; processed into overlapping 30ms frames, each described by features ♦ Speech sounds ♦ Word pronunciation Analog acoustic signal: ♦ Word sequences Sampled, quantized digital signal: 10 15 38 22 63 24 10 12 73 Frames with features: 52 47 82 89 94 11 Frame features are typically formants—peaks in the power spectrum Chapter 15, Section 6 2 Chapter 15, Section 6 5 Speech as probabilistic inference Phone models Frame features in P ( features | phone ) summarized by It’s not easy to wreck a nice beach – an integer in [0 . . . 255] (using vector quantization); or Speech signals are noisy, variable, ambiguous – the parameters of a mixture of Gaussians What is the most likely word sequence, given the speech signal? Three-state phones: each phone has three phases (Onset, Mid, End) I.e., choose Words to maximize P ( Words | signal ) E.g., [t] has silent Onset, explosive Mid, hissing End ⇒ P ( features | phone, phase ) Use Bayes’ rule: Triphone context: each phone becomes n 2 distinct phones, depending on P ( Words | signal ) = αP ( signal | Words ) P ( Words ) the phones to its left and right I.e., decomposes into acoustic model + language model E.g., [t] in “star” is written [t(s,aa)] (different from “tar”!) Words are the hidden state sequence, signal is the observation sequence Triphones useful for handling coarticulation effects: the articulators have inertia and cannot switch instantaneously between positions E.g., [t] in “eighth” has tongue against front teeth Chapter 15, Section 6 3 Chapter 15, Section 6 6

Phone model example Continuous speech Phone HMM for [m]: Not just a sequence of isolated-word recognition problems! – Adjacent words highly correlated 0.3 0.9 0.4 – Sequence of most likely words � = most likely sequence of words – Segmentation: there are few gaps in speech 0.7 0.1 0.6 – Cross-word coarticulation—e.g., “next thing” Onset Mid End FINAL Continuous speech systems manage 60–80% accuracy on a good day Output probabilities for the phone HMM: Onset: Mid: End: C1: 0.5 C3: 0.2 C4: 0.1 C2: 0.2 C4: 0.7 C6: 0.5 C3: 0.3 C5: 0.1 C7: 0.4 Chapter 15, Section 6 7 Chapter 15, Section 6 10 Word pronunciation models Language model Each word is described as a distribution over phone sequences Prior probability of a word sequence is given by chain rule: n Distribution represented as an HMM transition model P ( w 1 · · · w n ) = i =1 P ( w i | w 1 · · · w i − 1 ) � [ey] Bigram model: [ow] 0.2 1.0 0.5 1.0 1.0 [ow] [t] [m] [t] P ( w i | w 1 · · · w i − 1 ) ≈ P ( w i | w i − 1 ) 0.8 1.0 1.0 0.5 [ah] [aa] Train by counting all word pairs in a large text corpus More sophisticated models (trigrams, grammars, etc.) help a little bit P ([ towmeytow ] | “tomato” ) = P ([ towmaatow ] | “tomato” ) = 0 . 1 P ([ tahmeytow ] | “tomato” ) = P ([ tahmaatow ] | “tomato” ) = 0 . 4 Structure is created manually, transition probabilities learned from data Chapter 15, Section 6 8 Chapter 15, Section 6 11 Isolated words Combined HMM Phone models + word models fix likelihood P ( e 1: t | word ) for isolated word States of the combined language+word+phone model are labelled by the word we’re in + the phone in that word + the phone state in that phone P ( word | e 1: t ) = αP ( e 1: t | word ) P ( word ) Viterbi algorithm finds the most likely phone state sequence Prior probability P ( word ) obtained simply by counting word frequencies Does segmentation by considering all possible word sequences and boundaries P ( e 1: t | word ) can be computed recursively: define Doesn’t always give the most likely word sequence because ℓ 1: t = P ( X t , e 1: t ) each word sequence is the sum over many state sequences and use the recursive update Jelinek invented A ∗ in 1969 a way to find most likely word sequence where “step cost” is − log P ( w i | w i − 1 ) ℓ 1: t +1 = Forward ( ℓ 1: t , e t +1 ) and then P ( e 1: t | word ) = Σ x t ℓ 1: t ( x t ) Isolated-word dictation systems with training reach 95–99% accuracy Chapter 15, Section 6 9 Chapter 15, Section 6 12

DBNs for speech recognition end-of-word observation P(OBS | 2) = 1 P(OBS | not 2) = 0 phoneme 1 1 1 2 2 deterministic, fixed index 0 0 1 0 stochastic, learned transition phoneme n n n o o deterministic, fixed articulators stochastic, learned a a u u a u b r b r tongue, lips observation stochastic, learned Also easy to add variables for, e.g., gender, accent, speed. Zweig and Russell (1998) show up to 40% error reduction over HMMs Chapter 15, Section 6 13 Summary Since the mid-1970s, speech recognition has been formulated as probabilistic inference Evidence = speech signal, hidden variables = word and phone sequences “Context” effects (coarticulation etc.) are handled by augmenting state Variability in human speech (speed, timbre, etc., etc.) and background noise make continuous speech recognition in real settings an open problem Chapter 15, Section 6 14

Phones All human speech is composed from 40-50 phones, determined by - PDF document

Phones All human speech is composed from 40-50 phones, determined by the configuration of articulators (lips, teeth, tongue, vocal cords, air flow) Form an intermediate level of hidden states between words and signal Speech recognition (briefly)

Head Start ERSEA Training 2019-20 LOGO Mama says Please Cell Phones TURN OFF YOUR Cell phones

Calling/Connected Line Identification Presentation on Yealink IP Phones This guide provides some

Instructions: Instructions: Please mute your phones (if you do not Please mute your phones

CELL PHONES : INVES TIGATING DATA By: Christopher Robinson Cell Phones TECHNOLOGIES Terms

WEP Students may use cell phones before and after school only. Cell phones must be turned off

Scorekeeping with Smart Phones Scorekeeping with Smart Phones Mobile Solution for Outdoor Team

Evaluating the Accuracy of Data Collection on Mobile Phones: Collection on Mobile Phones: A

The Adoption of Network Goods Evidence from the Spread of Mobile Phones in Rwanda Daniel

Deep Learning g on mobile le phones - A Practit itio ionersguid ide Anirudh Koul, Siddha

FEATURE PHONES ROAD MAP w w w . o l p h o n e s . c o m CHARON SOL TALK BIG!

Phones w w w . o l p h o n e s . c o m CHARON SOL TALK BIG! 0.08 MP 1400 mAh

Advert / Phones and their impact on our streets Bristol Walking Alliance Roger Gimson Advert /

Online Surveys Americans Continue to Drop Their Landline Phones By Steven Shepard Americans

Micro-Blog: Sharing and Querying Content Through Mobile Phones and Social Participation

The smart ones behind smart phones Bhas Bapat Indian Institute of Science Education and Research

CrowdSearch: Exploiting Crowds for Accurate Real-time Image Search on Mobile Phones Michael

South Campus Continuous Learning Teacher Information 5/26-5/29 Name: Bruce Callahan Mr.

Common recognition tasks Adapted from Slide from L. Lazebnik. Fei-Fei Li Image classification

A-FRAME Build theVirtual RealityW eb MozillaVRT eam @ cvanw @ dmarcos @ fernandojsg @

Measuring Performance: Chapter 4! Or My computer is faster than your computer with thanks to

Elicitation in linguistic fieldwork or how to capture a speakers view of the world Annika

Neologisms Harvesting & Understanding Marcel K oster 06/08/2010 1 / 24 Introduction

Tuesday 31 st of March Class of 2024 Year 9 GCSE Preferences Meeting WELCOME! **PLEASE TURN

The L2 Impact on the Acquisition of Dutch: The L2 Distance Effect Job Schepens 1, 2 Frans van der

Phones All human speech is composed from 40-50 phones, determined by - PDF document

Phones All human speech is composed from 40-50 phones, determined by the configuration of articulators (lips, teeth, tongue, vocal cords, air flow) Form an intermediate level of hidden states between words and signal Speech recognition (briefly)

Head Start ERSEA Training 2019-20 LOGO Mama says Please Cell Phones TURN OFF YOUR Cell phones

Calling/Connected Line Identification Presentation on Yealink IP Phones This guide provides some

Instructions: Instructions: Please mute your phones (if you do not Please mute your phones

CELL PHONES : INVES TIGATING DATA By: Christopher Robinson Cell Phones TECHNOLOGIES Terms

WEP Students may use cell phones before and after school only. Cell phones must be turned off

Scorekeeping with Smart Phones Scorekeeping with Smart Phones Mobile Solution for Outdoor Team

Evaluating the Accuracy of Data Collection on Mobile Phones: Collection on Mobile Phones: A

The Adoption of Network Goods Evidence from the Spread of Mobile Phones in Rwanda Daniel

Deep Learning g on mobile le phones - A Practit itio ionersguid ide Anirudh Koul, Siddha

FEATURE PHONES ROAD MAP w w w . o l p h o n e s . c o m CHARON SOL TALK BIG!

Phones w w w . o l p h o n e s . c o m CHARON SOL TALK BIG! 0.08 MP 1400 mAh

Advert / Phones and their impact on our streets Bristol Walking Alliance Roger Gimson Advert /

Online Surveys Americans Continue to Drop Their Landline Phones By Steven Shepard Americans

Micro-Blog: Sharing and Querying Content Through Mobile Phones and Social Participation

The smart ones behind smart phones Bhas Bapat Indian Institute of Science Education and Research

CrowdSearch: Exploiting Crowds for Accurate Real-time Image Search on Mobile Phones Michael

South Campus Continuous Learning Teacher Information 5/26-5/29 Name: Bruce Callahan Mr.

Common recognition tasks Adapted from Slide from L. Lazebnik. Fei-Fei Li Image classification

A-FRAME Build theVirtual RealityW eb MozillaVRT eam @ cvanw @ dmarcos @ fernandojsg @

Measuring Performance: Chapter 4! Or My computer is faster than your computer with thanks to

Elicitation in linguistic fieldwork or how to capture a speakers view of the world Annika

Neologisms Harvesting &amp; Understanding Marcel K oster 06/08/2010 1 / 24 Introduction

Tuesday 31 st of March Class of 2024 Year 9 GCSE Preferences Meeting WELCOME! **PLEASE TURN

The L2 Impact on the Acquisition of Dutch: The L2 Distance Effect Job Schepens 1, 2 Frans van der

Neologisms Harvesting & Understanding Marcel K oster 06/08/2010 1 / 24 Introduction