A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky - PowerPoint PPT Presentation

A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky Eleanor Chodroff Tiago Pimentel Matthew Wiesner VoxClamantis in deserto: Ryan Cotterell Jason Eisner “a voice crying out in   Alan W Black � 1 the wilderness”

‘ipeuhcan’   ‘am Anfang’   ‘ በመጀመሪያ ’   ‘in the beginning’   Nahuatl German Amharic English In the beginning, there was SPEECH Tower of Babel � 2

‘ipeuhcan’   ‘am Anfang’   ‘ በመጀመሪያ ’   ‘in the beginning’   Nahuatl German Amharic English In the beginning, there was SPEECH Then the linguist asked: We create our new corpus, VoxClamantis v1.0,     to answer this question! How do speech and language vary? ✔ spoken readings of the Bible ✔ >600 languages ↳ prior cross-linguistic phonetic studies have relied on reported [language- ✔ time-aligned phonemic transcriptions aggregate] measurements ✔ phonetic measures for vowel and sibilant tokens � 3

This talk ① WHY we want this data ② HOW we create it ③ CASE STUDIES validating the corpus & illustrating two possible uses � 4

Why? � 5

⑤   ⑦   Motivation Variation in and across languages s s s s s Spanish Romanian s s s s s s /i/ /i/ s s s s /u/ s /u/ s s s s /o/ /o/ s s s s /e/ s /e/ s s s s /a/ /a/ s s s s s / ɨ / / ə / We know phonetic variation within a language,   How does the number and set of phonemic   but what are its range and limits? categories influence their realizations? � 6 variation

How? � 7

Resources ① speech Needed ② transcripts ③ phonemic labels ? ? ? ? Amharic b ə m ə d ʒ m ə ri ja ə Grapheme-to-Phoneme (G2P) በመጀመሪያ � 8

Resources ① speech Needed ② transcripts ③ phonemic labels ④ time alignments ⑤ phonetic measures ? ? ? ? Amharic b ə m ə d ʒ ə m ə r i j a Forced alignment በመጀመሪያ (HMM acoustic model) Phonetic measures (R or Praat): � 9 Formant frequencies, mid-frequency peak, duration…

Extraction ① speech Process ② transcripts e l b i B 9 9 6 ! s CMU Wilderness g n i d a e r (2019) with ① speech! ‘ በመጀመሪያ ’   and ② transcripts! Amharic >1TB 😲 >6 years of CPU compute 😲 � 10

Extraction ① speech Process ② transcripts CMU Wilderness dataset Chapter: ~30min 1 የፍጥረት አጀማመር በመጀመሪያ እግዚአብሔር ( ኤሎሂም ) ሰማያትንና ምድርን ፈጠረ። 2 ምድርም ቅርጽ የለሽና ባዶ ነበረች። ※ የምድርን ጥልቅ ስፍራ ሁሉ ጨለማ ውጦት ነበር። የእግዚአብሔርም ( ኤሎሂም ) መንፈስ በውሆች ላይ ይረብብ ነበር። 3 ከዚያም እግዚአብሔር ( ኤሎሂም ) “ ብርሃን ይሁን ” አለ፤ ብርሃንም ሆነ። 4 እግዚአብሔርም ( ኤሎሂም ) ብርሃኑ መልካም እንደሆነ አየ፤ ብርሃኑን ከጨለማ ለየ። 5 እግዚአብሔርም ( ኤሎሂም ) ብርሃኑን “ ቀን ” ፣ ጨለማውን “ ሌሊት ” ብሎ ጠራው። መሸ፤ ነጋም፤ የመጀመሪያ ቀን። 6 እግዚአብሔር ( ኤሎሂም ) ፣ “ ውሃን ከውሃ የሚለይ ጠፈር በውሆች መካከል ይሁን ” አለ። 7 ስለዚህ እግዚአብሔር ( ኤሎሂም ) ጠፈርን አድርጎ ከጠፈሩ በላይና ከጠፈሩ በታች ያለውን ውሃ ለየ፤ እንዳለውም ሆነ። 8 እግዚአብሔር ( ኤሎሂም ) ጠፈርን “ ሰማይ ” ብሎ ጠራው። መሸ፤ ነጋም፤ ሁለተኛ ቀን። 9 ከዚያም እግዚአብሔር ( ኤሎሂም ) ፣ “ ከሰማይ በታች ያለው ውሃ በአንድ .   … Utterance: < 30s 😲 በመጀመሪያ � 11

Extraction ① speech Process ② transcripts text ③ phonemic labels Which phonemes are present? / ɹɛ t / / ɹɛ d / phonemes read   read   G2P / ɛ / / i / text � 12

Extraction ① speech Process ② transcripts ③ phonemic labels Phoneme “Transcriptions”—- Grapheme-to-Phoneme 39 readings ① Linguist-created rules (Epitran) 690 64 . (disjoint) 18 readings ② Wisdom of Crowds (Wiktionary/WikiPron)   690 1 6 5 + our own WFST-models (Phonetisaurus 🦖 ) . All 690 readings ③ Naïve baseline (Unitran) 690 😲 “first-pass transcription” . � 13

G2P Summary 57 readings   “High-resource (HR)” 39 690 readings . “first-pass” . 18 ALL 690 readings   “First-pass (FP)” 🤕 why provide FP alignments for languages with HR ? We’ll come back to that 😊 � 14

Extraction ① speech Process ② transcripts ③ phonemic labels ? ? ? ? Amharic b ə m ə d ʒ m ə ri ja ə Forced alignment (HMM acoustic model) � 15

Extraction ① speech Process ② transcripts ③ phonemic labels ④ time alignments ? ? ? ? Amharic b ə m ə d ʒ ə m ə r i j a Forced alignment b (HMM acoustic model) start end time time � 16

Extraction ① speech Process ② transcripts ③ phonemic labels ④ time alignments ? ? ? ? Amharic b ə m ə d ʒ ə m ə r i j a Forced alignment b (HMM acoustic model) start end time time � 17

Extraction ① speech Process ② transcripts ③ phonemic labels ④ time alignments Amharic Phoneme tokens: b ə b start end m time time … � 18

Extraction ① speech Process Phonetic Measures ② transcripts ③ phonemic labels ④ time alignments ⑤ phonetic measures VOWELS SIBILANTS a a o s z z F4 F3 F2 F1 Spectral peak,   eg high-amplitude   Formants COG, Duration, ... frequencies PRAAT TEXTGRID � 19

Evaluation 🤕 Why provide both Unitran and High-Resource alignments? Use multiple sets of alignments to assess Unitran alignment quality ‣ How much does quality vary across languages? ‣ Are certain phonemes more accurate than others? ‣ What about time alignment accuracy? See paper! (+ appendices) � 20

Corpus Summary VoxClamantis v1.0 provides tokens of phoneme- level measurements in hundreds of languages! ‣ 690 recorded readings of the Bible ‣ 635 languages (ISO 639-3) ‣ 70 language families ‣ >400 million aligned phoneme-level segments ‣ Subsequent phonetic measures for all vowels and sibilants � 21

Case Studies � 22

Case Studies Case studies with VoxClamantis v1.0 Vowels   Sibilants   ~50 phonemes /s/ /z/ 48 High-Resource Readings l e c a s a t R e h p r o d u c t r c i o n o f e a s R e ① s - o s c r l ② r a p r e v n e i o u s r e s e u l t s s g s t e g g s u e s p l c i n v a l i d a p r i t e s r e s o c u r c e t i i s g u i n l � 23

Phonetic Uniformity Are shared characteristics realized uniformly within languages? (eg: vowel height, POA) (eg: measures strongly correlated) Formants : Vowels Mid-Freq Peak : Sibilants /s/, /z/: alveolar   /i/, /u/: high vowels place of articulation (eg: language) Supports hypothesis While variation exists across languages,   that this may be a   within language F1 strongly correlated universal principle Reproduce previous results,   but with many more languages � 24

Phonetic Dispersion Is inventory size correlated with articulatory precision? VOWELS 4 vowels 20 vowels i i: u u: i ɪ ᵿ e o ə e ɚ ɛ ɜ : ɔ ɔ : ɛ ɒ æ æ ɑ ɑ : a: Marshallese  English  � 25

Phonetic Dispersion Is inventory size correlated with articulatory precision? 4 vowels 20 vowels i i: u u: i ɪ ᵿ e o ə e ɚ ɛ ɜ : ɔ ɔ : ɛ ɒ æ æ ɑ ɑ : a: Marshallese  English  � 26

Phonetic Dispersion Is inventory size correlated with articulatory precision? No (Spearman ρ = 0.11, p = 0.44;   4 vowels Pearson r = 0.11, p = 0.46) 20 vowels i i: u u: i ɪ ᵿ e o ə e ɚ ɛ ɜ : ɔ ɔ : ɛ ɒ æ æ ɑ ɑ : a: Marshallese  English  Supports hypothesis that this may [not] be a   Previously shown,   universal principle but not possible to study at scale � 27

N O I T U A C + Utterance alignment B Filter -- in future, realign! + D - Automatic phoneme labels A Better G(+A)2P   % 0 A Alignment assessment! Curate more resources! 😲 Corpus representation   Curate more resources! B (e.g. speakers) � 28

Summary � 29

Conclusion VoxClamantis v1.0 corpus: voxclamantisproject.github.io aligned phoneme-level segments in hundreds of languages   57 high-resource, 690 first-pass 😲 methodology is not perfect – version 1.0! ⬇ download 🥴 use for research ⬆ contribute to v2.0! � 30

Contact Us! ! s n o i t s e u ! s Q t n e m ! m s n o o C i t u b voxclamantisproject.github.io i r t n o C voxclamantisproject@gmail.com Elizabeth Salesky Eleanor Chodroff Tiago Pimentel Matthew Wiesner VoxClamantis in deserto: Ryan Cotterell Jason Eisner “a voice crying out in   Alan W Black � 31 the wilderness”

A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky - PowerPoint PPT Presentation

A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky Eleanor Chodroff Tiago Pimentel Matthew Wiesner VoxClamantis in deserto: Ryan Cotterell Jason Eisner a voice crying out in Alan W Black 1 the wilderness

Canonical Typology Danny Hieber Hieber, Daniel W. 2011. Canonical Typology. Talk given to the

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Language Typology and Areal Linguistics Yiru July 13, 2016 Yiru Language Typology July 13,

Why phonetic transcription? Global phonetic diversity Inconsistent orthography within

Phonetics Darrell Larsen Linguistics 101 Darrell Larsen Phonetics What Is Phonetics? Phonetic

Long-Term Formant Long-Term Formant Distribution as a forensic- phonetic feature phonetic

Phonetics Darrell Larsen Linguistics 101 Darrell Larsen Phonetics What Is Phonetics? Phonetic

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

A Holistic and Sustainable Care Center Project Typology - His istory ry & & Trends

Development of a Development of a Rural Typology GI S for Rural Typology GI S for Policy Makers

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

CSCE 479/879 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

JUST THE MATHS SLIDES NUMBER 19.2 PROBABILITY 2 (Permutations and combinations) by

Software Engineering Large Practical: Mid-semester feedback Stephen Gilmore (

Gods Training Program: Volition and Thinking Gods Training Emphasizes Two Factors: 1.

lessons learned from a national digital health deployment Professor Frances S Mair Head of

POEMMA POEMMA: Probe of Extreme : Probe of Extreme Multi-Messenger Astrophysics Multi-Messenger

Media Team Formation in Social Networks Network Ties Thanks to Evimari Terzi ALGORITHMS FOR

hints on research Daniel Jackson MIT Lab for Computer Science 6898: Advanced Topics in Software

Sambuz

Useful Links

Newsletter

Mail Us

A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky - PowerPoint PPT Presentation

A Corpus For Large-Scale Phonetic Typology Elizabeth Salesky Eleanor Chodroff Tiago Pimentel Matthew Wiesner VoxClamantis in deserto: Ryan Cotterell Jason Eisner a voice crying out in Alan W Black 1 the wilderness

Canonical Typology Danny Hieber Hieber, Daniel W. 2011. Canonical Typology. Talk given to the

LEXICAL TYPOLOGY Peter Koch (Part I) Koch, Lexical typology, 2010-8-24 A. General introduction

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part II) Department of Romance Studies, Tbingen

LEXICAL TYPOLOGY LEXICAL TYPOLOGY Peter Koch (Part III) Department of Romance Studies, Tbingen

Language Typology and Areal Linguistics Yiru July 13, 2016 Yiru Language Typology July 13,

Why phonetic transcription? Global phonetic diversity Inconsistent orthography within

Phonetics Darrell Larsen Linguistics 101 Darrell Larsen Phonetics What Is Phonetics? Phonetic

Long-Term Formant Long-Term Formant Distribution as a forensic- phonetic feature phonetic

Phonetics Darrell Larsen Linguistics 101 Darrell Larsen Phonetics What Is Phonetics? Phonetic

A large-scale International IPv6 Network A large-scale International IPv6 Network www.6net.org

Corpus Stylistics: Speech, Writing and Thought Presentation in a Corpus of English Writing

The need for Corpus Statistics: Corpus analysis and the identification of linguistically relevant

A Holistic and Sustainable Care Center Project Typology - His istory ry &amp; &amp; Trends

Development of a Development of a Rural Typology GI S for Rural Typology GI S for Policy Makers

FINANCING LARGE SCALE SOLAR Large Scale Solar Conference - Sydney Gloria Chan Director, Large

TrustedOut Corpus Intelligence Corpus Intelligence Makes Intelligence Trustworthy. Florent Solt,

CSCE 479/879 Lecture 5: Stephen Scott Autoencoders Introduction Basic Idea Stacked AE Stephen

JUST THE MATHS SLIDES NUMBER 19.2 PROBABILITY 2 (Permutations and combinations) by

Software Engineering Large Practical: Mid-semester feedback Stephen Gilmore (

Gods Training Program: Volition and Thinking Gods Training Emphasizes Two Factors: 1.

lessons learned from a national digital health deployment Professor Frances S Mair Head of

POEMMA POEMMA: Probe of Extreme : Probe of Extreme Multi-Messenger Astrophysics Multi-Messenger

Media Team Formation in Social Networks Network Ties Thanks to Evimari Terzi ALGORITHMS FOR

hints on research Daniel Jackson MIT Lab for Computer Science 6898: Advanced Topics in Software

Sambuz

Useful Links

Newsletter

Mail Us

A Holistic and Sustainable Care Center Project Typology - His istory ry & & Trends