part i lecture 1 introduction
play

Part I Lecture 1 Introduction Introduction/Signal Processing, Part - PowerPoint PPT Presentation

Part I Lecture 1 Introduction Introduction/Signal Processing, Part I Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 10 September


  1. Part I Lecture 1 Introduction Introduction/Signal Processing, Part I Michael Picheny, Bhuvana Ramabhadran, Stanley F. Chen IBM T.J. Watson Research Center Yorktown Heights, New York, USA {picheny,bhuvana,stanchen}@us.ibm.com 10 September 2012 2 / 96 What Is Speech Recognition? Why Is Speech Recognition Important? Converting speech to text (STT). Demo. a.k.a. automatic speech recognition (ASR). What it’s not. Natural language understanding — e.g. , Siri. Speech synthesis — converting text to speech (TTS), e.g. , Watson. Speaker recognition — identifying who is speaking. 3 / 96 4 / 96

  2. Because It’s Fast Other Reasons Requires no specialized training to do fast. modality method rate (words/min) Hands-free. sound speech 150–200 Speech-enabled devices are everywhere. sight sign language; gestures 100–150 Phones, smart or dumb. touch typing; mousing 60 Access to phone > access to internet. taste covering self in food < 1 smell not showering < 1 Text is easier to process than audio. Storage/compression; indexing; human consumption. 5 / 96 6 / 96 Key Applications Why Study Speech Recognition? Transcription: archiving/indexing audio. Real-world problem. Legal; medical; television and movies. Potential market: ginormous. Call centers. Hasn’t been solved yet. Whenever you interact with a computer . . . Not too easy; not too hard ( e.g. , vision). Without sitting in front of one. Lots of data. e.g. , smart or dumb phone; car; home entertainment. One of first learning problems of this scale. Accessibility. Connections to other problems with sequence data. People who can’t type, or type slowly. Machine translation, bioinformatics, OCR, etc. The hard of hearing. 7 / 96 8 / 96

  3. Where Are We? Who Are We? Michael Picheny: Sr. Manager, Speech and Language. Course Overview 1 Bhuvana Ramabhadran: Manager, Acoustic Modeling. Stanley F. Chen: Regular guy. A Brief History of Speech Recognition IBM T.J. Watson Research Center, Yorktown Heights, NY. 2 Building a Speech Recognizer: The Basic Idea 3 Speech Production and Perception 4 9 / 96 10 / 96 Why Three Professors? How To Contact Us Too much knowledge to fit in one brain. In E-mail, prefix subject line with “EECS E6870:”!!! . Signal processing. Michael Picheny — picheny@us.ibm.com . Probability and statistics. Bhuvana Ramabhadran — bhuvana@us.ibm.com . Phonetics; linguistics. Stanley F. Chen — stanchen@us.ibm.com . Natural language processing. Office hours: right after class. Machine learning; artificial intelligence. Before class by appointment. Automata theory. TA: Xiao-Ming Wu — xw2223@columbia.edu . Courseworks. For posting questions about labs. 11 / 96 12 / 96

  4. Course Outline Programming Assignments 80% of grade ( √− , √ , √ + grading). week topic assigned due 1 Introduction Some short written questions. 2 Signal processing; DTW lab 1 Write key parts of basic large vocabulary continuous 3 Gaussian mixture models speech recognition system. 4 Hidden Markov models lab 2 lab 1 Only the “fun” parts. 5 Language modeling C++ code infrastructure provided by us. 6 Pronunciation modeling lab 3 lab 2 Also accessible from Java (via SWIG). 7 Finite-state transducers Get account on ILAB computer cluster (x86 Linux PC’s). 8 Search lab 4 lab 3 Complete the survey. 9 Robustness; adaptation Labs due at Wednesday 6pm. 10 Discrim. training; ROVER project lab 4 11 Advanced language modeling 12 Neural networks; DBN’s. 13 Project presentations project 13 / 96 14 / 96 Final Project Readings 20% of grade. PDF versions of readings will be available on the web site. Option 1: Reading project (individual). Recommended text: Pick paper(s) from provided list, or propose your own. Speech Synthesis and Recognition , Holmes, 2nd Give 10-minute presentation summarizing paper(s). edition (paperback, 256 pp., 2001) [Holmes] . Option 2: Programming/experimental project (group). Reference texts: Pick project from provided list, or propose your own. Theory and Applications of Digital Signal Processing , Give 10-minute presentation summarizing project. Rabiner, Schafer (hardcover, 1056 pp., 2010) [R+S] . Speech and Language Processing , Jurafsky, Martin (2nd edition, hardcover, 1024 pp., 2000) [J+M] . Statistical Methods for Speech Recognition , Jelinek (hardcover, 305 pp., 1998) [Jelinek] . Spoken Language Processing , Huang, Acero, Hon (paperback, 1008 pp., 2001) [HAH] . 15 / 96 16 / 96

  5. Web Site Prerequisites Basic knowledge of probability and statistics. www.ee.columbia.edu/~stanchen/fall12/e6870/ Fluency in C++ or Java. Syllabus. Basic knowledge of Unix or Linux. Slides from lectures (PDF). Knowledge of digital signal processing optional. Online by 8pm the night before each lecture. Helpful for understanding signal processing lectures. Hardcopy of slides distributed at each lecture? Not needed for labs. Lab assignments (PDF). Reading assignments (PDF). Online by lecture they are assigned. Username: speech , password: pythonrules . 17 / 96 18 / 96 Help Us Help You Where Are We? Feedback questionnaire after each lecture (2 questions). Course Overview 1 Feedback welcome any time. You, the student, are partially responsible . . . A Brief History of Speech Recognition For the quality of the course. 2 Please ask questions anytime! EE’s may find CS parts challenging, and vice versa. Building a Speech Recognizer: The Basic Idea 3 Together, we can get through this. Let’s go! Speech Production and Perception 4 19 / 96 20 / 96

  6. The Early Years: 1950–1960’s Whither Speech Recognition? Ad hoc methods. Speech recognition has glamour. Funds have been available. Results have been less glamorous . . . Many key ideas introduced; not used all together. e.g. , spectral analysis; statistical training; language . . . General-purpose speech recognition seems far modeling. away. Special-purpose speech recognition is severely Small vocabulary. limited. It would seem appropriate for people to ask Digits; yes/no; vowels. themselves why they are working in the field and what they can expect to accomplish . . . Not tested with many speakers (usually < 10). . . . These considerations lead us to believe that a general phonetic typewriter is simply impossible unless the typewriter has an intelligence and a knowledge of language comparable to those of a native speaker of English . . . —John Pierce, Bell Labs, 1969 21 / 96 22 / 96 Whither Speech Recognition? Knowledge-Driven or Data-Driven? Killed ASR research at Bell Labs for many years. Knowledge-driven. People know stuff about speech, language, Partially served as impetus for first (D)ARPA program e.g. , linguistics, (acoustic) phonetics, semantics. (1971–1976) funding ASR research. Hand-derived rules. Goal: integrate speech knowledge, linguistics, and AI Use expert systems, AI to integrate knowledge. to make a breakthrough in ASR. Large vocabulary: 1000 words. Data-driven. Speed: a few times real time . Ignore what we think we know. Build dumb systems that work well if fed lots of data. Train parameters statistically. 23 / 96 24 / 96

  7. The ARPA Speech Understanding Project The Birth of Modern ASR: 1970–1980’s Every time I fire a linguist, the performance of the 100 speech recognizer goes up. —Fred Jelinek, IBM, 1985(?) 80 Ignore (almost) everything we know about phonetics, 60 accuracy linguistics. View speech recognition as . . . . 40 Finding most probable word sequence given audio. Train probabilities automatically w/ transcribed speech. 20 0 SDC HWIM Hearsay Harpy ∗ Each system graded on different domain. 25 / 96 26 / 96 The Birth of Modern ASR: 1970–1980’s The Golden Years: 1990’s–now Many key algorithms developed/refined. 1984 now Expectation-maximization algorithm; n -gram models; CPU speed 60 MHz 3 GHz Gaussian mixtures; Hidden Markov models; Viterbi training data < 10h 10000h+ decoding; etc. output distributions GMM ∗ GMM Computing power still catching up to algorithms. sequence modeling HMM HMM First real-time dictation system built in 1984 (IBM). language models n -gram n -gram Specialized hardware ≈ 60 MHz Pentium. Basic algorithms have remained the same. Bulk of performance gain due to more data, faster CPU’s. Significant advances in adaptation, discriminative training. New technologies ( e.g. , Deep Belief Networks) on the cusp of adoption. ∗ Actually, 1989. 27 / 96 28 / 96

Recommend


More recommend