Seminar C2NLU, Schloß Dagstuhl, Wadern, Germany 24-January-2017 From Bayes Decision Rule to Neural Networks for Human Language Technology Hermann Ney T. Alkhouli, P. Bahar, K. Irie, J.-T. Peter Human Language Technology and Pattern Recognition RWTH Aachen University, Aachen, Germany IEEE Distinguished Lecturer 2016/17 H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 1
Human Language Technology (HLT) Automatic Speech Recognition (ASR) Statistical Machine Translation (SMT) wir wollen diese große Idee erhalten we want to preserve this great idea we want to preserve this great idea Handwriting Recognition tasks: (Text Image Recognition) – speech recognition – machine translation – handwriting recognition – sign language we want to preserve this great idea H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 2
Human Language Technology: Speech and Language characteristic properties: • well-defined ’classification’ tasks: – due to 5000-year history of (written!) language – well-defined goal: characters or words (= full forms) of the language • easy task for humans (in native language!) • hard task for computers (as the last 50 years have shown!) unifying view: • formal task: input string → output string • output string: string of words/characters in a natural language • models of context and dependencies: strings in input and output – within input and output string – across input and output string • abstract view of language understanding (?): mapping: natural language → formal language H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 3
SPEECH SIGNAL ASR: what is the problem? – ambiguities at all levels ACOUSTIC ANALYSIS – interdependencies of decisions approach: SEGMENTATION AND PHONEME CLASSIFICATION MODELS – score hypotheses – probabilistic framework PHONEME HYPOTHESES – statistical decision theory (CMU and IBM 1975; PRONUNCIATION WORD BOUNDARY DETECTION Bahl & Jelinek + 1983) AND LEXICAL ACCESS LEXICON WORD HYPOTHESES various terminologies: SYNTACTIC AND LANGUAGE – pattern recognition SEMANTIC ANALYSIS MODEL – statistical learning – connectionism SENTENCE HYPOTHESES – machine learning SEARCH: INTERACTION OF KNOWLEDGE SOURCES KNOWLEDGE SOURCES important: string context! RECOGNIZED SENTENCE H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 4
SENTENCE IN SOURCE LANGUAGE machine translation • interaction between WORD POSITION ALIGNMENT three models (or RE-ORDERING MODEL knowledge sources): – alignment model ALIGNMENT HYPOTHESES – lexicon model – language model BILINGUAL LEXICAL CHOICE LEXICON • handle interdependences, ambiguities and conflicts WORD+POSITION by Bayes decision rule HYPOTHESES as for speech recognition SYNTACTIC AND LANGUAGE SEMANTIC ANALYSIS MODEL SENTENCE HYPOTHESES GENERATION: INTERACTION OF KNOWLEDGE SOURCES KNOWLEDGE SOURCES SENTENCE IN TARGET LANGUAGE H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 5
Bayes Decision Rule 1) performance measure or loss function (e. g. edit or Levenshtein distance) between correct output sequence W and hypothesized output sequence ˜ W : L [ W, ˜ W ] 2) probabilistic dependence pr ( W | X ) between input string X = x 1 ...x t ...x T and output string W = w 1 ...w n ...w N (e. g. empirical distribution of a representative sample) 3) optimum performance: Bayes decision rule minimizes expected loss: � � � X → ˆ pr ( W | X ) · L [ W, ˜ W ( X ) := arg min W ] ˜ W W Under these two conditions: L [ W, ˜ satisfies triangle inequality W ] : max { pr ( W | X ) } > 0 . 5 W we have the MAP rule (MAP = maximum-a-posteriori) [Schlüter & Nussbaum + 12]: � � X → ˆ W ( X ) := arg max pr ( W | X ) W Since [Bahl & Jelinek + 83], this simpified Bayes decision rule is widely used for speech recognition, handwriting recognition, machine translation, ... H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 6
Statistical Approach to HLT Tasks Performance Measure Probabilistic (Loss Function) Models Optimization Training Criterion Training (Efficient Algorithm) Data Parameter Estimates Bayes Decision Rule Test (Efficient Algorithm) Data Output Evaluation H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 7
Statistical Approach and Machine Learning four ingredients: • performance measure: error measure (e.g. edit distance) we have to decide how to judge the quality of the system output (ASR: edit distance; SMT: edit distance + block movements) • probabilistic models with suitable structures: to capture the dependencies within and between input and output strings – elementary observations: Gaussian mixtures, log-linear models, support vector machines (SVM), multi-layer perceptron (MLP), ... – strings: n -gram Markov chains, CRF, Hidden Markov models (HMM), recurrent neural nets (RNN), LSTM-RNN, CTC, ANN-based models of attention, ... • training criterion: – ideally should be linked to performance criterion ( end-to-end training ) two important issues: – what is a suitable training criterion? – what is a suitable optimization strategy? • Bayes decision rule: to generate the output word sequence – combinatorial problem (efficient algorithms) – should exploit structure of models examples: dynamic programming and beam search, A ∗ and heuristic search, ... H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 8
HLT and Neural Networks • acoustic modelling • language modelling (for ASR and SMT) • machine translation H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 9
History: ANN in Acoustic Modelling • 1988 [Waibel & Hanazawa + 88]: phoneme recognition using time-delay neural networks (using CNNs!) • 1989 [Bridle 89]: softmax operation for probability normalization in output layer • 1990 [Bourlard & Wellekens 90]: – for squared error criterion, ANN outputs can be interpreted as class posterior probabilities (rediscovered: Patterson & Womack 1966) – they advocated the use of MLP outputs to replace the emission probabilities in HMMs • 1993 [Haffner 93]: sum over label-sequence posterior probabilities in hybrid HMMs • 1994 [Robinson 94]: recurrent neural network – competitive results on WSJ task – his work remained a singularity in ASR first clear improvements over the state of the art: – 2008 handwriting: Graves using LSTM-RNN and CTC – 2011 speech: Hinton & Li Deng using deep FF MLP and hybrid HMM – more ... H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 10
What is Different Now after 25 Years? feedforward neural network (FF-NN; multi-layer perceptron, MLP): – operations: matrix · vector – nonlinear activation function ANN outputs: probability estimates comparison for ASR: today vs. 1989-1994: • number of hidden layers: 10 (or more) rather than 2-3 • number of output nodes (phonetic labels): 5000 rather than 50 • optimization strategy: practical experience and heuristics, e.g. layer-by-layer pretraining • much more computing power overall result: – huge improvement by ANN – WER is (nearly) halved !! H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 11
Recurrent Neural Network: String Processing principle for string processing over time t = 1 , ..., T : – introduce a memory (or context) component to keep track of history – result: there are two types of input: memory h t − 1 and observation x t � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � � extensions: – bidirectional variant [Schuster & Paliwal 1997] – feedback of output labels – long short-term memory [Hochreiter & Schmidhuber 97; Gers & Schraudolph + 02] – deep hidden layers H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 12
Recurrent Neural Network: Details of Long Short-Term Memory tanh output forget input gate gate gate net input ingredients: – separate memory vector c t in addition to h t – use of gates to control information flow – (additional) effect: make backpropagation more robust H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 13
Acoustic Modelling: HMM and ANN (CTC: similar [Graves & Fernandez + 06] ) – why HMM? mechanism for time alignment (or dynamic time warping) – critical bottleneck: emission probability model requires density estimation! – hybrid approach: replace HMM emission probability by label posterior probabilities, i. e. by ANN output after suitable re-scaling X E L A time H. Ney: From Bayes to Neural Nets � RWTH c Schloß Dagstuhl, Seminar C2NLU, 2017, Jan. 23-27 14
Recommend
More recommend