[PPT] - Discovering the Vocabulary of a Language through Cross- Lingual PowerPoint Presentation

SLIDE 1

KIT – Universität des Landes Baden-Württemberg und nationales Forschungszentrum in der Helmholtz-Gemeinschaft Cognitive Systems Lab Karlsruhe Institute of Technology

www.kit.edu

Discovering the Vocabulary of a Language through Cross- Lingual Alignment

Supervisors:

Prof. Dr. rer. nat. Stephan Vogel, Prof. Dr. Ing. Tanja Schultz, Dipl. Inf. Tim Schlippe

Speaker: Felix Stahlberg

SLIDE 2

2 05.09.2011

SCENARIO

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

SLIDE 3

3 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Scenario Language barriers in international relief operations Languages and dialects without written form

Rapid language adaption in trouble areas Dialects How to get training material for MT and ASR systems in such situations?

SLIDE 4

4 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Scenario Objective: Discover vocabulary only using spoken translations of a well-studied source language

Say “I am sick.” in your mother tongue. /b/ /o/ /l/ /e/ /s/ /t/ /a/ /n/ /s/ /a/ /m/ Say “I am healthy.” in your mother tongue. /z/ /d/ /r/ /a/ /v/ /s/ /a/ /m/

/b/ /o/ /l/ /e/ /s/ /t/ /a/ /n/ seems to be a word (meaning sick)
/z/ /d/ /r/ /a/ /v/ seems to be a word (meaning healthy)
/s/ /a/ /m/ seems to be a word (meaning I am)

SLIDE 5

5 05.09.2011

Basic Idea

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Croatian (Target Language) English (Source Language)

Audio: Phoneme sequence: Sentence: /b/ /o/ /l/ /e/ /s/ /t/ /a/ /n/ /s/ /a/ /m/ I am sick

How to find such an alignment?

SLIDE 6

6 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

How to find an alignment? Adapt methods of Statistical Machine Translation

Word alignments identify word pairs in bitexts that are translations of one another Expectation-Maximation approach (interactive with translation model (TM)) GIZA++: Freely available implementation of IBMs model hierachy and HMM model

Choose best alignment in terms of the TM Adjust TM parameters regarding the alignment

SLIDE 7

7 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Challenges Spoken translations are the only available resources of the target language

No target language phoneme recognizer is available Apply related language recognizer with uniformly distributed language model High phoneme error rates (PERs)

Word alignment models from SMT are not designed for word-phoneme alignment

Inaccurate alignments

SLIDE 8

8 05.09.2011

EXPERIMENTAL SETUP

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

SLIDE 9

9 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

BMED Basic Medical Expression Database 200 sentences in English, German, Slovene and Croatian 4h speech data of 5 Slovene and 8 Croatian native speakers recorded

English German Croatian Slovene I have bad headache. Ich habe starke Kopfschmerzen. Imam jaku glavobolju. Imam močen glavobol. I have a temperature. Ich habe Fieber. Imam temperaturu. Imam vročino. I have bad pain here. Ich habe hier starke Schmerzen. Ovdje imam jake bolove. Imam zelo močne bolečine. I am wounded. Ich bin verletzt. Ozlijeđen sam. Poškodovan sem. I am wounded on my right leg. Ich bin an meinem rechten Bein verletzt. Imam ozlijedu na desnoj nozi. Desno nogo imam poškodovano. …

SLIDE 10

10 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

CorpusGong

SLIDE 11

11 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Recognizer Performance Here: Phoneme recognizer with Croatian acoustic models and phoneme set

trained on 20h Croatian speech from GlobalPhone corpus

Test Data PER Croatian GlobalPhone test set 33% Croatian BMED 43.4% Slovene BMED 55.2%

SLIDE 12

12 05.09.2011

ALIGNMENT AND VOCABULARY EXTRACTION

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

SLIDE 13

13 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Example Corpus

English Sentence Croatian Sentence Croatian Phonemes This is the wounded patient. To je ozlijeden pacient. t o j e o z l i j e dp e n p a ts i n t The patient is outside. Vani je pacient. v a n i j e p a ts i e n d The patient waits for

peration.

Pacient ceka na

peraciju.

p a ts i e m t tS e k a n a o p e r a ts i j u

SLIDE 14

14 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Alignment Step Run GIZA++ to generate alignments.

Alignment Ranking Interpolation Extraction

v a n i j e p a ts i e n d The patient is

utside

t o j e o z l i j e dp e n p a ts i n t This patient is wounded the p a ts i e m t tS e k a n a o p e r a ts i j u The patient waits for operation

SLIDE 15

15 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Source language word Aligned phoneme sequences patient j e p a ts I n t, v e p a ts I n d, p a ts i e m t this t o is j e wounded

z l i j e dp e n
utside

a n i waits tS e k a for n a

peration
p e r a ts i j u

Alignment Step (2)

Alignment Ranking Interpolation Extraction

Output of the alignment step:

SLIDE 16

16 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Ranking Step Idea: Rank all source language words, select n-best. Combine multiple aspects in the ranking

Frequency-Levenshtein Product Alignment Error Length Levenshtein Distance Alignment Score Frequency Phoneme Confidence

Alignment Ranking Interpolation Extraction

weight

SLIDE 17

17 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Interpolation Step – 1 Suppose we want to get the Croatian phoneme sequence corresponding to „patient‟ given the following alignment

Possibility 1: Concatenate all aligned phonemes: v e p a ts i n d v a n i j e p a ts i e n d The patient is

utside

Alignment Ranking Interpolation Extraction

SLIDE 18

18 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Interpolation Step – 1 Suppose we want to get the Croatian phoneme sequence corresponding to „patient‟ given the following alignment

Possibility 2: Select the largest subsequence: e p a ts i v a n i j e p a ts i e n d The patient is

utside

Alignment Ranking Interpolation Extraction

SLIDE 19

19 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Interpolation Step – 1 Suppose we want to get the Croatian phoneme sequence corresponding to „patient‟ given the following alignment

Possibility 3: Use interval with minimum align error: e p a ts i e n d Align error: 1 + 1 = 2 v a n i j e p a ts i e n d The patient is

utside

Alignment Ranking Interpolation Extraction

SLIDE 20

20 05.09.2011

Interpolation Step – Levenshtein Graph

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

v a n i j e p a ts i e n d

The

patient is

utside

t o j e o z l i j e dp e n p a ts i n t This patient is wounded the p a ts i e m t tS e k a n a o p e r a ts i j u

The patient waits for

peration

paTint

paTiemt epaTiend

Alignment Ranking Interpolation Extraction

SLIDE 21

21 05.09.2011

Interpolation Step – Centralities

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Degree 3 PageRank .06 Closeness 7

Max. Dist.

4 Betweeness 1

Alignment Ranking Interpolation Extraction

SLIDE 22

22 05.09.2011

Interpolation Step – Centralities

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Degree 4 PageRank .08 Closeness 4

Max. Dist.

2 Betweeness 3

Alignment Ranking Interpolation Extraction

SLIDE 23

23 05.09.2011

Interpolation Step – Representative

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Alignment Ranking Interpolation Extraction

„paTient“ with high rankings → representative

SLIDE 24

24 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Extraction Step We extracted the new word ‚p a ts i e n t„ Refeed this information to the database

v a n i j e p a ts i e n d The patient is

utside

Alignment: p a ts i e n t v a n i j e The is

utside

(realign in next iteration)

Alignment Ranking Interpolation Extraction

SLIDE 25

25 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Evaluation Metrics Find a mapping between entries in hypothesis and reference vocabulary For each hypothesis vocabulary entry

Vocabulary Phoneme Error Rate (VPER): Avg. Phoneme Error Rate to the reference entry

For each sentence

Abstract from phoneme recognition errors Word Error Rate (WER): Avg. Word Error Rate between hypothesis and reference string hypoInRef: How many words are in the hypothesis, but not in the reference refInHypo: Other way round

SLIDE 26

26 05.09.2011

RESULTS

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

SLIDE 27

27 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Hypo Reference VPER im ima 0.333333 premi <not assigned> 1

+kodo

po+kodova 0.333333 raa rana 0.25 ee te 0.5 laxko laxko adobro dobro 0.2 je je pust postel 0.5 uduobnu udobna 0.333333 zron <not assigned> 1 moaga pomaga 0.333333 zgrab zgradb 0.166667 skas vas 0.666667

vaserazuu

<not assigned> 1 azuunemlva razumem 0.857143 katexiee kateri 0.5 Hypo Reference VPER m m draunik zdraunik 0.125 ajea <not assigned> 1 zgradb zgradb vzglaunik vzglaunik te te zdravilo zdravilo poTut poTuti 0.166667 ibole bole 0.25 udobno udobno daness danes 0.2 laxko laxko ivalidskiv invalidskivoziTek 0.411765 imam imam nogo nogo dobilis dobili 0.166667 inekTij inekTij

Results – Vocabulary PER

55.2% PER Recognizer 0% PER Recognizer

Despite frequent recognition errors, correct extraction is possible Errors often due to wrong single phonemes at begin or end of a word

SLIDE 28

28 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Hypothesis String Reference String WER RefInHypo HypoInRef govorite moram zdraunik govori ti moram z (zdraunik, zdraunikom) 0.6 0.33 0.6 sem mediTinskasestra jaz sem mediTinskasestra 0.33 0.33 kater jezik govorite (kater, kateri) jezik govorite

0% PER recognizer

Difficulties with

Inflections Short words

Results – WER, RefInHypo, HypoInRef

SLIDE 29

29 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Results – Source Languages

VER TER hypoInRef refInHypo 0% 10% 20% 30% 40% 50% 60% cr-sl de-sl sl-sl Alignment errors influence performance significantly Choosing a related source language enhances performance slightly

VPER WER

SLIDE 30

30 05.09.2011 2 4 6 8 10 12 14 16 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Vocabulary Size / 10 Vocabulary Phoneme Error Rate 0% PER Recognizer (step size 10) 55.2% PER Recognizer (step size 10) 55.2% PER Recognizer (step size 50)

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Results – Vocabulary PER (de-sl)

Recognizer performance has significant impact Step size is secondary

SLIDE 31

31 05.09.2011 2 4 6 8 10 12 14 16 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Vocabulary Size / 10 Word Error Rate 0% PER Recognizer (step size 10) 55.2% PER Recognizer (step size 10) 55.2% PER Recognizer (step size 50)

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Results – Word Error Rate (de-sl)

Word Error Rate decreases with vocabulary size Recognizer performance becomes more important when a large vocabulary is extracted

SLIDE 32

32 05.09.2011 5 10 15 20 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Vocabulary Size / 10 refInHypo

5 10 15 20 0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%

Vocabulary Size / 10 hypoInRef 0% PER Recognizer (step size 10) 55.2% PER Recognizer (step size 10) 55.2% PER Recognizer (step size 50)

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Results – refInHypo, hypoInRef (de-sl)

Generally, the hypothesis is shorter than the reference

SLIDE 33

33 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Conclusion We presented first steps towards a novel method for data-collection efforts We are able to extract „good“ vocabulary entries There are still many deficiencies

Inflections

Starting point for further research

Factor analysis Enhance alignment models Enhance recognizer performance …

SLIDE 34

34 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Thank you

SLIDE 35

35 05.09.2011

BACKUP SLIDES

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

SLIDE 36

36 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Alignment Step Run GIZA++ to generate alignments.

Corpus Alignment and Vocabulary Extraction Miscellaneous

Alignment Ranking Interpolation Extraction

t o j e o z l i j e dp e n p a ts i n t p a ts i e m t tS e k a n a o p e r a ts i j u v a n i j e p a ts i e n d The patient is

utside

This patient is wounded the The patient waits for operation

SLIDE 37

37 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Alignment Step – Direct Alignment Approach Possibility 1: Direct alignment approach

Pass through alignment information directly to the next step

Corpus Alignment and Vocabulary Extraction Miscellaneous

Alignment Ranking Interpolation Extraction

Rank in next step: {j e p a ts I n t, v e p a ts I n d, p a ts i e m t}, {t o}, {j e}, {o z l i j e dp e n}, {a n i}, {tS e k a}, {n a}, {o p e r a ts i j u}

SLIDE 38

38 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Corpus Alignment and Vocabulary Extraction Miscellaneous

Alignment Ranking Interpolation Extraction

Rank in next step: {j e p a ts I n t, v e p a ts I n d, p a ts i e m t, j e}, {a n i, tS e k a, n a}, {o z l i j e dp e n}, {o p e r a ts i j u}

Alignment Step – Clustering Approach Possibility 2: Clustering approach

Use alignment only for extracting word boundary marker Calculate cluster

Example: cluster({p a ts I e m t, j e, o z l i j e dp e n, a n i…})= { 1→ j e p a ts I n t, v e p a ts I n d, p a ts i e m t, j e}, 2 → a n i, tS e k a, n a}, 3 → {o z l i j e dp e n}, 4 → {o p e r a ts i j u}}

Use clusters as new source vocabulary

SLIDE 39

39 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Ranking Step Idea: Rank all source language words, select n-best. Combine multiple aspects in the ranking

Alignment Score Phoneme Confidence Frequency Levenshtein distance Freq/Lev Product Length Alignment Error

Alignment Ranking Interpolation Extraction

SLIDE 40

40 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Ranking Step Idea: Rank all source language words, select n-best. Combine multiple aspects in the ranking

Alignment Score Phoneme Confidence Frequency Levenshtein distance Freq/Lev Product Length Alignment Error

Alignment Ranking Interpolation Extraction

Concept Prefer words with high average alignment score (output by GIZA++). Intention Reduce impact of alignment errors.

SLIDE 41

41 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Ranking Step Idea: Rank all source language words, select n-best. Combine multiple aspects in the ranking

Alignment Score Phoneme Confidence Frequency Levenshtein distance Freq/Lev Product Length Alignment Error

Alignment Ranking Interpolation Extraction

Concept Prefer words with high average phoneme confidence (output by phoneme recognizer). Intention Reduce impact of phoneme recognition errors.

SLIDE 42

42 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Ranking Step Idea: Rank all source language words, select n-best. Combine multiple aspects in the ranking

Alignment Score Phoneme Confidence Frequency Levenshtein distance Freq/Lev Product Length Alignment Error

Alignment Ranking Interpolation Extraction

Concept Prefer frequent words. Intention Frequent words provide more information to work with.

SLIDE 43

43 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Ranking Step Idea: Rank all source language words, select n-best. Combine multiple aspects in the ranking

Alignment Score Phoneme Confidence Frequency Levenshtein distance Freq/Lev Product Length Alignment Error

Alignment Ranking Interpolation Extraction

Concept Average Levenshtein distance

f aligned phoneme sequences

per length should be low. Intention Improve quality of interpolated sequence in the interpolation step.

SLIDE 44

44 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Ranking Step Idea: Rank all source language words, select n-best. Combine multiple aspects in the ranking

Alignment Score Phoneme Confidence Frequency Levenshtein distance Freq/Lev Product Length Alignment Error

Alignment Ranking Interpolation Extraction

Concept Frequent words may have a greater Levenshtein ranking. Intention Towards high ranking for single

ccurrence words (that rank
ptimal regarding Levenshtein).

SLIDE 45

45 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Ranking Step Idea: Rank all source language words, select n-best. Combine multiple aspects in the ranking

Alignment Score Phoneme Confidence Frequency Levenshtein distance Freq/Lev Product Length Alignment Error

Alignment Ranking Interpolation Extraction

Concept Prefer words with high average length of aligned phoneme sequences. Intention Other rankings prefer short sequences.

SLIDE 46

46 05.09.2011 Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Ranking Step Idea: Rank all source language words, select n-best. Combine multiple aspects in the ranking

Alignment Score Phoneme Confidence Frequency Levenshtein distance Freq/Lev Product Length Alignment Error

Alignment Ranking Interpolation Extraction

Concept See next slides. Intention Rank reasonability of alignments.

SLIDE 47

47 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Interpolation Step – 1 Suppose we want to get the Croatian phoneme sequence corresponding to „Guten‟ given the following alignment

Corpus Alignment and Vocabulary Extraction Miscellaneous

Possibility 1: Concatenate all aligned phonemes: /D/ /O/ /A/ /R/ /E/ Guten Tag /D/ /O/ /B/ /A/ /R/ /D/ /A/ /N/ /E/

Alignment Ranking Interpolation Extraction

SLIDE 48

48 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Interpolation Step – 1 Suppose we want to get the Croatian phoneme sequence corresponding to „Guten‟ given the following alignment

Corpus Alignment and Vocabulary Extraction Miscellaneous

Guten Tag /D/ /O/ /B/ /A/ /R/ /D/ /A/ /N/ /E/ Possibility 2: Use interval with minimum align error ranking: /D/ /O/ /B/ /A/ /R/ Align error ranking: -1 -1 = -2

Alignment Ranking Interpolation Extraction

SLIDE 49

49 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Alignment Process – Step 3.1 Further possible intervals (with suboptimal align error rankings)

Corpus Alignment and Vocabulary Extraction Miscellaneous

Guten Tag /D/ /O/ /B/ /A/ /R/ /D/ /A/ /N/ /E/ Align error ranking: -1 -1 -1 -1 -1 = -5 Guten Tag /D/ /O/ /B/ /A/ /R/ /D/ /A/ /N/ /E/ Align error ranking: -1 -1 -1 = -3

SLIDE 50

50 05.09.2011

Interpolation Step – Closeness

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Alignment Ranking Interpolation Extraction

Closeness: Sum of distances to all input nodes. 5 6 5 6 4 7 6 5 5 6 7 6 6 6 6

SLIDE 51

51 05.09.2011

Interpolation Step – Max Distance

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Alignment Ranking Interpolation Extraction

3 3 2 3 2 4 3 2 3 3 4 4 4 4 3 Max Distance: Maximum distance to an input node.

SLIDE 52

52 05.09.2011

Interpolation Step – PageRank

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Alignment Ranking Interpolation Extraction

.11 .06 .08 .05 .08 .06 .05 .08 .11 6 .05 .04 .04 .04 .06 PageRank: Googles importance measure for web pages.

SLIDE 53

53 05.09.2011

Interpolation Step – Degree

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Alignment Ranking Interpolation Extraction

6 4 4 3 4 3 3 4 6 3 3 2 2 2 3 Degree: Number of adjacent nodes.

SLIDE 54

54 05.09.2011

Interpolation Step – Betweeness

Discovering the Vocabulary of a Language through Cross-Lingual Alignment

Alignment Ranking Interpolation Extraction

1 1 2 1 3 1 1 2 1 1 1 1 1 1 1 Betweeness: Number of shortest paths between input nodes containing the node.

SLIDE 55

55 05.09.2011

Interpolation Step – 2

Name Vorname: Titel des Vortrags

degree page rank max distance ∑ distance between- ess kallo 3 0.1881 2 3 1 ballo 3 0.1881 2 3 1 halo 3 0.1667 2 4 1 hallo 3 0.1981 1 3 2 balo 2 0.1295 2 4 1 kalo 2 0.1295 2 4 1

Input: kallo ballo halo

Select this row

SLIDE 56

56 05.09.2011 5 10 15 20 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

TER

Stahlberg Felix: Intermediate Results (3) 5 10 15 20 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

hypoInRef 0% PER Recognizer (step size 10) 55.2% PER Recognizer (step size 10) 55.2% PER Recognizer (step size 50)

5 10 15 20 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

refInHypo

5 10 15 20 0,1 0,2 0,3 0,4 0,5 0,6 0,7 0,8 0,9 1

VER

SLIDE 57

57 05.09.2011

Using Adaptor Grammars (…) [for] Unsup. Acquisition of Ling. Structure [Johnson 08] Investigation

Performance of different adaptor grammars Taking context into account

Results

F-Score up to 0.78 [Johnson, Goldwater 09] with better results through parameter tuning

Name Vorname: Titel des Vortrags

Former Research Language Selection Corpus Selection

3a

SLIDE 58

58 05.09.2011

Towards Speech Transl. of Non Written Languages [Besacier et al. 06] Investigation

Bypass written forms of languages for MT Parallel corpus

Standard word transcription (with punctuation, spaces) in the source language Perfect transcription as a long single string of phonemes in the target language

Find word boundaries in target language, then train language models and translation models with them

Results

BLEU scores:

Name Vorname: Titel des Vortrags

Former Research Language Selection Corpus Selection

3a

SLIDE 59

59 05.09.2011

Towards Human Transl. Guided Lang. Discovery for ASR Sys. [Stüker, Waibel 08] Investigation

Generating Dictionaries by word to phoneme alignment Close related to what we have in mind, but with phoneme sequences generated from the transcription

Results

Feasible approach

Name Vorname: Titel des Vortrags

Former Research Language Selection Corpus Selection

3b

SLIDE 60

60 05.09.2011 Stahlberg Felix: Intermediate Results (2)

Word Segmentation with Adaptor Grammars* (1)

Gr. Corpus F-Score Precision Recall Unigram German GP Test Set 0.2397 0.1852 0.34 German BMEC 0.4004 0.3577 0.454 Colloc-syllable German GP Test Set 0.3211 0.2453 0.4647 German BMEC 0.4402 0.365 0.5544 Colloc-syllable: Unigram: *According to [Johnson, Goldwater 09]

Phoneme Recognition Corpus Feature Annotation System Word Alignment

SLIDE 61

61 05.09.2011 Stahlberg Felix: Intermediate Results (2)

Word Segmentation with Adaptor Grammars (2) Common segmentation errors

Kranken schwester (unavoidable) Dasm edikament (larger corpus may help) Trink en (bad)

Phoneme Recognition Corpus Feature Annotation System Word Alignment

SLIDE 62

62 05.09.2011 Stahlberg Felix: Intermediate Results (2)

Prosodic Features (1)

F0_RUN_START F0_RUN_END F0_RUN F0_RUN F0_RUN F0_RUN INTENSITY_LOCAL_MAX INTENSITY_GLOBAL_MAX INTENSITY_SIL PITCH_RESET

: Intensity : Pitch

Phoneme Recognition Corpus Feature Annotation System Word Alignment

SLIDE 63

63 05.09.2011 Stahlberg Felix: Intermediate Results (2)

Prosodic Features (2) Assumptions

Prosodic features help us with word segmentation

F0_RUN_END seems to mark word boundaries Especially if there is a PITCH_RESET

Prosodic features help us with cross lingual alignment

GLOBAL_INTENSITY_MAXs may match cross lingual Maybe use cross lingual similarities of other characteristic feature distributions Not evaluated yet

Phoneme Recognition Corpus Feature Annotation System Word Alignment

SLIDE 64

64 05.09.2011 Stahlberg Felix: Intermediate Results (2)

Requirement 2 AM should be capable of taking cross-lingual matching points into account

Only if German speech data is available Use prosodic features INTENSITY_GLOBAL_MAX and F0_RUN

Phoneme Recognition Corpus Feature Annotation System Word Alignment

Ich glaube nicht. n e v je r u je m

INTENSITY_GLOBAL_MAX INTENSITY_GLOBAL_MAX

SLIDE 65

65 05.09.2011 Stahlberg Felix: Intermediate Results (2)

PHP Toolkit

Phoneme Recognition Corpus Feature Annotation System Word Alignment

SLIDE 66

66 05.09.2011 Stahlberg Felix: Intermediate Results (2)

Database

Phoneme Recognition Corpus Feature Annotation System Word Alignment

SLIDE 67

67 05.09.2011 Stahlberg Felix: Intermediate Results (2)

Database – Corpora

Phoneme Recognition Corpus Feature Annotation System Word Alignment

SLIDE 68

68 05.09.2011 Stahlberg Felix: Intermediate Results (2)

Database – Recognizers

Phoneme Recognition Corpus Feature Annotation System Word Alignment

SLIDE 69

69 05.09.2011 Stahlberg Felix: Intermediate Results (2)

Database – Features

Phoneme Recognition Corpus Feature Annotation System Word Alignment

SLIDE 70

70 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Overview

Corpus Alignment and Vocabulary Extraction Miscellaneous

𝐸𝑏𝑢𝑏𝐶𝑏𝑡𝑓 ⊂ 𝑊𝑝𝑑∗ × 𝑄ℎ𝑝𝑜𝑓𝑛𝑓𝑇𝑓𝑢∗

BMED database Source language vocabulary Recognizer phoneme set

SLIDE 71

71 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Example Corpus

Corpus Alignment and Vocabulary Extraction Miscellaneous

English Sentence Croatian Sentence Croatian Phonemes This is the wounded patient. To je ozlijeden pacient. t o j e o z l i j e dp e n p a ts i n t The patient is outside. Vani je pacient. v a n i j e p a ts i e n d The patient waits for

peration.

Pacient ceka na operaciju. p a ts i e m t tS e k a n a o p e r a ts i j u 𝐸𝑏𝑢𝑏𝐶𝑏𝑡𝑓 = * (<this, is, the, wounded, patient>,<t, o, j, e, o, z, l, i, j, e, dp, e, n, p, a, ts, i, e, n, t,>), (<the, patient, is, outside>,<v, a, n, i, j, e, p, a, ts, i, e, n, t>), (<the, patient, waits, for, operation>,<p, a, ts, i, e, n, t, tS, e, k, a, n, a, … , ts, i, j, u,>)} 𝑊𝑝𝑑 = this, is, the, wounded, patient, outside, waits, for, operation} 𝑄ℎ𝑝𝑜𝑓𝑛𝑓𝑇𝑓𝑢 = t, o, j, e, z, l, i, dp, n, p, a, ts}

SLIDE 72

72 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Our Approach

1. Alignment step. Generate an alignment for each element

in 𝐸𝑏𝑢𝑏𝐶𝑏𝑡𝑓.

2. Ranking step. Choose a subset 𝑇 ⊆ 𝑊𝑝𝑑 of 𝑜 promising

source language words (|𝑇| = 𝑜).

3. Interpolation step. For each 𝑡

𝑘 ∈ 𝑇, extract one single

phoneme sequence 𝑞𝑘 ∈ 𝑄ℎ𝑝𝑜𝑓𝑛𝑓𝑇𝑓𝑢∗ and use it as new vocabulary entry (𝑄 = *𝑞𝑘|𝑘 < 𝑜+).

4. Extraction step. Remove all 𝑡

𝑘 and 𝑞𝑘 from 𝐸𝑏𝑢𝑏𝐶𝑏𝑡𝑓.

𝐸𝑏𝑢𝑏𝐶𝑏𝑡𝑓 ← *(𝑌\𝑇, 𝑍\𝑄)|(𝑌, 𝑍) ∈ 𝐸𝑏𝑢𝑏𝐶𝑏𝑡𝑓+.

5. Start with 1.) until 𝐸𝑏𝑢𝑏𝐶𝑏𝑡𝑓 is empty.

|𝑌|

(𝑌,𝑍)∈𝐸𝑏𝑢𝑏𝐶𝑏𝑡𝑓

= 0

Corpus Alignment and Vocabulary Extraction Miscellaneous

SLIDE 73

73 05.09.2011

THESIS

Stahlberg Felix: Intermediate Results (3)

Corpus Alignment and Vocabulary Extraction Miscellaneous

SLIDE 74

74 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Corpus Alignment and Vocabulary Extraction Miscellaneous

t o j e o z l i j e dp e n p a ts i n t p a ts i e m t tS e k a n a o p e r a ts i j u v a n i j e p a ts i e n d The patient is

utside

This patient is wounded the The patient waits for operation

SLIDE 75

75 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Corpus Alignment and Vocabulary Extraction Miscellaneous

Say “I am sick.” in your mother tongue. /b/ /o/ /l/ /e/ /s/ /t/ /a/ /n/ /s/ /a/ /m/ Say “I am healthy.” in your mother tongue. /z/ /d/ /r/ /a/ /v/ /s/ /a/ /m/

/b/ /o/ /l/ /e/ /s/ /t/ /a/ /n/ seems to be a word (meaning sick)
/z/ /d/ /r/ /a/ /v/ seems to be a word (meaning healthy)
/s/ /a/ /m/ seems to be a word (meaning I am)

SLIDE 76

76 05.09.2011 Stahlberg Felix: Intermediate Results (3)

Corpus Alignment and Vocabulary Extraction Miscellaneous

Align error ranking: -1 -1 = -2 v a n i j e p a ts i e n d The patient is

utside