malach multilingual access to large spoken archives
play

MALACH : Multilingual Access to Large spoken ArCHives - PowerPoint PPT Presentation

MALACH : Multilingual Access to Large spoken ArCHives http://www.clsp.jhu.edu/research/malach (funded under NSF ITR Award 0122466) Sam Gustman Survivors of the Shoah Visual History Foundation Bhuvana Ramabhadran, Michael Picheny, Martin


  1. MALACH : Multilingual Access to Large spoken ArCHives http://www.clsp.jhu.edu/research/malach (funded under NSF ITR Award 0122466)

  2. Sam Gustman Survivors of the Shoah Visual History Foundation Bhuvana Ramabhadran, Michael Picheny, Martin Franz, Nanda Kambhatla IBM T. J. Watson Research Center William Byrne CLSP, Johns Hopkins University Josef Psutka University of West Bohemia Jan Hajic Charles University Dagobert Soergel, Douglas W.Oard, CLIS, University of Maryland

  3. Examples of Spoken Archives Source Description Vincent Voice Library Speeches, Performances, (MSU) Lectures, Interviews, Broadcasts, etc. 50000 recordings Oyez! Oyez! Oyez! Supreme Court (NWU) Proceedings 500 hours History and Politics Out Significant political and Loud (NWU) historical events and personalities of the twentieth century Informedia (CMU) 2 TB of Digital Video National Gallery of Spoken word collections from the 20 th century Spoken Word (MSU)

  4. VHF Multimedia Data Collection Data � VHF has collected testimonies 52000 testimonies (2 1/2 hours each) in over 32 languages (180 TB of digital video) - the largest and most complex single topic digital video library in the world Mini02.mov http://www.vhf.org/archive.htm

  5. Number of Interviews by Country Argentina 737 Georgia 6 Slovakia 665 Australia 2,483 Germany 677 Slovenia 12 Austria 184 Greece 303 South Africa 254 Belarus 253 Hungary 730 Spain 6 Sweden 331 Belgium 207 Ireland 5 Switzerland 68 Bolivia 22 Israel 8,474 Ukraine 3,434 Bosnia & Italy 419 United Kingdom 873 Herzegovina 43 Japan 1 Brazil 567 Kazakhstan 6 United States 19,843 Bulgaria 636 Latvia 77 Uruguay 126 Uzbekistan 25 Canada 2,844 Lithuania 133 Venezuela 227 Chile 65 Macedonia 9 Colombia 14 Mexico 112 Yugoslavia 361 Costa Rica 19 Moldova 283 Zimbabwe 6 Republic of Croatia 330 Netherlands 1,051 Total: Czech Republic 567 New Zealand 55 51,649 testimonies Denmark 95 Norway 34 57 countries Dominican Republic 1 Peru 2 Ecuador 9 Poland 1,429 Estonia 9 Portugal 2 Finland 1 Romania 147 France 1,675 Russia 712

  6. Testimony Language Statistics Bulgarian 622 Japanese 1 Sign (3 American & 1 Croatian 394 Ladino 10 Hungarian) Czech 574 Latvian 6 Slovak 574 Danish 72 Lithuanian 45 Slovenian 6 Dutch 1,080 Macedonian 9 Spanish 1,350 English 24,947 Norwegian 34 Swedish 269 Flemish 5 Polish 1,571 Ukrainian 318 French 1,886 Portuguese 563 Yiddish 513 German 933 Romani 28 Greek 303 Romanian 123 Total : 51,649 testimonies Hebrew 6,317 Russian 7,011 32 languages Hungarian 1,285 Serbian 374 Italian 432

  7. Manual Indexing System � Cataloguers listen to the audio data � Divide data into large segments � For each large segment • Divide into smaller segments • For each smaller segment, make notes on what the speaker said • Annotate these notes with keywords that can be used to index this data • Associate with video, stills, artifacts, etc. • Summarize these notes • About 4000 testimonies catalogued in this fashion � Clearly expensive and time-consuming – depending upon the nature of the archive, cost may be prohibitive. � Alternatively used fixed 1-minute segments

  8. An Example Location-Time Subject Person Berlin-1939 Employment Josef Stein Berlin-1939 Family life Gretchen Stein Anna Stein interview time Dresden-1939 Relocation Transportation-rail Dresden-1939 Schooling Gunter Wendt Maria

  9. MALACH: Multilingual Access to Large Spoken ArCHives The objective of MALACH is to dramatically improve access to large multilingual spoken archives by capitalizing on the unique characteristics (unconstrained natural speech) of the Survivors of the Shoah Visual History Foundation's (VHF) multimedia digital archive of oral histories � Specific goals include: � Advances in speech recognition technology to handle spontaneous and emotional speech with disfluencies, heavy accents, elderly speech, and dynamic switching between multiple languages � Advances in information retrieval technologies to provide efficient indexing, search and retrieval � Automated techniques for the generation of new metadata to label segments � Automated translation of domain-specific multilingual thesauri � Workshops and user studies to evaluate the social and scientific value of the technology and see how it can be applied to other large archives.

  10. Needs User Formulation Interactive Automatic Selection Search Query Overview Tagging Content Boundary Detection NLP Components ASR Recognition Thesaurus Speech

  11. Jan-04 Oct-03 English ASR Accuracy Jul-03 Apr-03 Jan-03 Oct-02 Jul-02 Apr-02 Jan-02 100 80 60 40 20 0 English Word Error Rate (%)

  12. Why is Speech Recognition Hard? � Unusual Words • My middle name m- my my middle brother he had two names in lost- in- before the war Shloma Hasich and me, that ’ s Chuna Moskovitch, I was the baby at home and the sisters name was Miriam all were Mosokowiz • my middle name from my mental emitter but out the heck in the shloma hostage the meat and scorn are much as I was the baby home and desist his name rose mary an • Disfluencies • A- a- a- a- band with on- our- on- our- arm • a hat and bend with the on on our farm � Emotional speech • a young man they ripped his teeth and beard out they beat him • Sections of frequent interruptions • CHURCH TWO DAYS these were the people who were to go to march TO MARCH and your brother smuggled himself SMUGGLED IN IN IN IN • church H. to data this these people who have to go to court each and two brothers smuggled some drugs and

  13. Unexpected Surprises • Stereo format recordings with interviewee and interviewer in the same channel • Some with low volume and some with no data in it at all • Many, many non-English testimonies – There is no guarantee that a testimony is in English, even if the interviewer starts speaking in English and says that it is in English! • As many as 9 speakers in some testimonies • Lots of cross talk – less of this in interviewers with British and Australian accents • Some interviewees say very little. – A few testimonies, interviewers did all the talking – forced yes/no type answers

  14. Other observations • Lots of foreign words, unsure words, names, places • Noisy Background: • Static noise, Airplane noise, Buzzing Sound, Hammering noise in the background, Coughing, Laughter, Emotion (crying, screaming), Many conversations in the background, Badly placed microphone

  15. Histogram of Transcription times 200 180 160 140 120 No. of Speakers 100 80 60 40 20 0 0 1 2 3 4 5 6 7 8 9 10 Transcription Time in Hours

  16. Examples of foreign words, names… ADAKCLAUS ADDUS-YIS-HOREL ARBEIT-MACHT-FREI ARNHEIM ARONAFISCHSTRASSEN BABUSHKAS CZESTOCHOWA HA-NOR-YAT-SA-NEE HASLACH JUDENANRAT SZMALCONIKI VERMIETEN YANZICHITZ YAKUBOVICH YITZKAH YU-OV-DOV-SKY YUDENLAGER ZWILLINGEN ZOSHA

  17. ASR Performance • Gender Dependent Systems – Two gender dependent systems trained with about half the training data (~100h male speakers, ~78h female speakers ) 65h 200h SI 45.5 46.6 41.0 42.3 SAT 41.9 43.3 37.6 38.2 MLLR 39.4 39.6 35.1 35.2 • Performance improvements of 1.4% absolute at the SAT level obtained with 65h of training data went away after MLLR • Gains not seen with 200 hours of training data (0.6% overall gain with gender dependent systems)

  18. Decoding the Test Collection • W hy is this im portant? – Test collection is being used in training models for automatic topic segmentation, categorization and search • Collection Details – Compressed audio (Sampling Frequencies: 44.1 KHz and 48KHz) – 625 hours done (computing done ~ 4xRT) • 580 hours of speech • Models used had an SI WER of 46.7% and speaker- dependent word error rate of 39.6% Total Tapes Full Testimonies Partial Testimonies 1294 199 47

  19. Why is acoustic segmentation necessary? (Eurospeech 2003) • Automatically identify and remove non-speech segments • Reduce computational load • Speaker labeling of segments allows adaptation to be performed on speaker-coherent clusters • Manual process is time-consuming and expensive • Goal is to improve recognition performance on tens of thousands of hours of spoken material

  20. First Pass Decoding w ith several autom atic segm entation schem es 9 17 55 232 1124 70 60 50 40 30 20 10 0 Human Speech v/s BIC Iterative Audio/Visual Non-Speech Seg.

  21. Segment Clustering • Bottom-up clustering scheme to two clusters (interviewee and interviewer) • Single Cluster (i.e one transform only) • Manually marked speaker ids • Randomly assigned speaker ids

  22. WER : Effect of Automatic speaker clustering on Automatic Segmentation (Speech/Non-Speech scheme) 9 17 55 232 1124 80 70 60 50 40 30 20 10 0 Speaker- Single Human BUC Random Ind. Transform Speaker Speaker Ids Ids Clustering scheme has relatively little effect on performance when starting from speaker-mixed segments Impact on interviewer’s speech ( < 18%; can be as low as 4%)

  23. WER after adaptation – how far are we from the best we can do? 100 75 Human Seg. WER% 50 Automatic Seg. 25 0 1 2 3 4 5 Speakers Relative 8% worse

Recommend


More recommend