Multitask Learning with Low-Level Auxiliary Tasks 1 Traditional - PowerPoint PPT Presentation

for Speech Recognition Shubham Toshniwal , Hao Tang, Liang Lu, Karen Livescu Toyota Technological Institute at Chicago Multitask Learning with Low-Level Auxiliary Tasks

1 • Traditional automatic speech recognition (ASR) systems are modular. • Different components of the system are trained separately. • Components correspond to different levels of representation - frame-level states, phones, and words etc. Conventional ASR Systems Feature Vectors Speech Words . . . ‘‘recognize speech’’ Feature Decoder Extraction Acoustic Pronunciation Language Model Dictionary Model

2 • Neural end-to-end models for ASR have become viable and popular. • End-to-end models are appealing because: • Conceptually simple; all model parameters contribute to the same final goal. • Impressive results in ASR (Zweig et al. 2016) as well as other domains (Vinyals et al. 2015, Huang et al. 2016). End-to-end ASR Models recognize speech h 1 h 2 h 3 h T EOS GO Acoustic x 1 x 2 x 3 x T Features

However, end-to-end models have some drawbacks as well: • Optimization can be challenging. • Ignore potentially useful domain-specific information about intermediate representations, as well as existing intermediate levels of supervision. • Hard to interpret intermediate learned representations, thus harder to debug. 3 End-to-end Models: Cons

• Analysis of some deep end-to-end models found that different layers tend to specialize for different sub-tasks (Mohamed et al. 2012, Zeiler et al. 2014). • Lower layers focus on lower-level representation and higher ones on higher-level representation. 4 Motivation Faces Edges Parts of face Pixels Layer 3 Layer 2 Layer 1

• Can we encourage such intermediate representation learning more explicitly ? • Multitask learning: Combine final task loss (speech recognition) with losses corresponding to lower-level tasks (such as phonetic recognition) applied on lower layers (Søgaard et al. 2016). 5 Motivation

y K . 6 • We use the attention-enabled encoder-decoder variant y 1 encoder and outputs y • Character decoder : Attends to high-level features generated by (ii) Outputs a sequence of high-level features (hidden states). x T x 1 (i) Reads in acoustic features x • Speech encoder : A pyramidal bidirectional LSTM that: proposed by Chan et al. 2015. Encoder-Decoder Model for speech recognition CharDec ( L c ) y 1 y 2 GO x 1 x 2 x 8 x T x 3 x 4 x 5 x 6 x 7

y K . 6 • We use the attention-enabled encoder-decoder variant y 1 encoder and outputs y • Character decoder : Attends to high-level features generated by (ii) Outputs a sequence of high-level features (hidden states). • Speech encoder : A pyramidal bidirectional LSTM that: proposed by Chan et al. 2015. Encoder-Decoder Model for speech recognition CharDec ( L c ) y 1 y 2 GO x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x T (i) Reads in acoustic features x = ( x 1 , . . . , x T )

6 • We use the attention-enabled encoder-decoder variant • Character decoder : Attends to high-level features generated by (ii) Outputs a sequence of high-level features (hidden states). • Speech encoder : A pyramidal bidirectional LSTM that: proposed by Chan et al. 2015. Encoder-Decoder Model for speech recognition CharDec ( L c ) y 1 y 2 GO x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x T (i) Reads in acoustic features x = ( x 1 , . . . , x T ) encoder and outputs y = ( y 1 , . . . , y K ) .

p ), p ) 2 L c 7 • Phoneme-level supervision obtained using pronunciation L p 1 • Training Loss L is given by: L (b) CTC-loss ( L CTC (a) Phoneme Decoder Loss ( L Dec • Experiment with two types of sequence loss: dictionary. Adding Phoneme Supervision CharDec ( L c ) y 1 y 2 PhoneCTC $(L_{p}^{CTC})$ GO PhoneDec $(L_{p}^{Dec})$ $z_1$ $z_2$ GO $z_1$ x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x T

p ) 2 L c 7 • Phoneme-level supervision obtained using pronunciation L p 1 • Training Loss L is given by: L (b) CTC-loss ( L CTC (a) Phoneme Decoder Loss ( L Dec • Experiment with two types of sequence loss: dictionary. Adding Phoneme Supervision CharDec ( L c ) y 1 y 2 PhoneCTC $(L_{p}^{CTC})$ GO PhoneDec ( L Dec ) p z 1 z 2 GO z 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x T p ),

2 L c 7 • Phoneme-level supervision obtained using pronunciation L p 1 • Training Loss L is given by: L (b) CTC-loss ( L CTC (a) Phoneme Decoder Loss ( L Dec • Experiment with two types of sequence loss: dictionary. Adding Phoneme Supervision CharDec ( L c ) y 1 y 2 PhoneCTC ( L CTC ) p GO PhoneDec $(L_{p}^{Dec})$ $z_1$ $z_2$ GO $z_1$ x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x T p ), p )

7 • Phoneme-level supervision obtained using pronunciation (b) CTC-loss ( L CTC (a) Phoneme Decoder Loss ( L Dec • Experiment with two types of sequence loss: dictionary. Adding Phoneme Supervision CharDec ( L c ) y 1 y 2 PhoneCTC ( L CTC ) p GO PhoneDec ( L Dec ) p z 1 z 2 GO z 1 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x T p ), p ) • Training Loss L is given by: L = 1 2 ( L c + L p ) .

8 • We also experiment with frame-level state supervision. Adding frame-level Supervision CharDec ( L c ) y 1 y 2 PhoneCTC ( L CTC ) p GO PhoneDec ( L Dec ) p z 1 z 2 State ( L s ) s 6 s 7 s 1 s 2 s 3 s 4 s 5 s 8 s T z 1 GO x 1 x 2 x 4 x 7 x 8 x T x 3 x 5 x 6 • Training Loss L is then: L = 1 3 ( L c + L p + L s ) .

Dataset: • Switchboard corpus - 300 hrs of conversational speech data. • Standard training/development/test split is used. Model: • Speech Encoder : 4-layer pyramidal bidirectional LSTM. • Character Decoder : 1-layer unidirectional LSTM. • Both have 256 hidden units. 9 Dataset & Model Details

10 13.8 24.1 13.4 Enc-Dec + PhoneDec-3 + State-2 24.1 13.6 Enc-Dec + State-2 25.9 14.5 Enc-Dec + PhoneDec-4 24.9 Enc-Dec + PhoneDec-3 25.3 14.0 Enc-Dec + PhoneCTC-3 26.0 14.6 Enc-Dec (baseline) development data. Dev Results Table 1: Character error rate (CER) and word error rate (WER) results on Model Dev CER (in �) Dev WER (in �)

11 38.0 Enc-Dec (word) + 3-gram 25.8 46.0 36.0 Maas et al. 2015 CTC 56.1 48.2 47.1 Zweig et al. 2016 Iterated CTC 24.7 37.1 — 37.8 27.3 42.4 Our models Lu et al. 2016 32.0 40.8 23.1 Enc-Dec + PhoneDec-3 + State-2 33.7 Enc-Dec 25.0 Enc-Dec (baseline) Test Results Table 2: WER (%) on test data for different end-to-end models. Model SWB CHE Full

12 How does Multitask Learning help ? Figure 1: Log-loss of training data (only L c ) for different model variations.

Multitask Learning with Low-Level Auxiliary Tasks 1 Traditional - PowerPoint PPT Presentation

for Speech Recognition Shubham Toshniwal , Hao Tang, Liang Lu, Karen Livescu Toyota Technological Institute at Chicago Multitask Learning with Low-Level Auxiliary Tasks 1 Traditional automatic speech recognition (ASR) systems are modular.

Multitask Learning Lei Tang Arizona State University Nov. 6th, 2006 Lei Tang Multitask

VFW Auxiliary LOCAL AUXILIARY TREASURERS AND TRUSTEES TRAINING Presented By VFW Auxiliary

Office of Auxiliary Services Presented by Dr. Gregory A. McCord Chief Auxiliary Services Officer

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

VFW Auxiliary LOCAL AUXILIARY TREASURERS AND TRUSTEES TRAINING Presented By George Martin

VFW Auxiliary INVESTMENTS HELD AT VFW AUXILIARY NATIONAL HEADQUARTERS Presented By George

Consistent Multitask Learning with Nonlinear Output Constraints Carlo Ciliberto Department of

COMPETITIVE MULTITASK MARINE TECHNOLOGY Ocean Cleaner Technology S.L. is a competitive marine

Real Real- -Time Systems Time Systems Low- Low -level programming level programming Low-

Dependency Parse Dependency Tags aux auxiliary auxpass passive auxiliary cop

Term Rep placement Deep rep lacement Auxiliary constructor i Auxiliary constructor i module

Shared Memory Programming with OpenMP Lecture 6: Tasks What are tasks? Tasks are

Scheduling Aperiodic Tasks Background Scheduling Treat aperiodic tasks as lowest-priority

SPEEDING UP DEEP REINFORCEMENT LEARNING VIA TRANSFER AND MULTITASK LEARNING Speaker: Yunshu Du

Learning Tasks in Practice How to Make Use of COMET Learning Tasks in Vocational Schools

Tmux and Screen Limitations of the terminal How do we multitask? Graphical

Multi-Task Learning: Models, Optimization and Applications Linli Xu University of Science and

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Hadoop over NDN: Initial Experience and Results Mathias Gibbens, Lei Ye, Chris Gniady, and

Multi-Source Adjustment of Multi-Layer Annotation: the Bits of Wisdom Approach Kilian Evang 20

Recitation 1: Multitasking Kai Mast Threads vs. Processes Threads Processes How to start?

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June

Real Time Operating Systems from Fundamentals of Real Time Systems Mukul Shirvaikar &

IN5550 Neural Methods in Natural Language Processing Ensembles, transfer and multi-task

Multitask Learning with Low-Level Auxiliary Tasks 1 Traditional - PowerPoint PPT Presentation

for Speech Recognition Shubham Toshniwal , Hao Tang, Liang Lu, Karen Livescu Toyota Technological Institute at Chicago Multitask Learning with Low-Level Auxiliary Tasks 1 Traditional automatic speech recognition (ASR) systems are modular.

Multitask Learning Lei Tang Arizona State University Nov. 6th, 2006 Lei Tang Multitask

VFW Auxiliary LOCAL AUXILIARY TREASURERS AND TRUSTEES TRAINING Presented By VFW Auxiliary

Office of Auxiliary Services Presented by Dr. Gregory A. McCord Chief Auxiliary Services Officer

Low Level Low Level Low Level Low Level Detection of Detection of Detection of Detection of

VFW Auxiliary LOCAL AUXILIARY TREASURERS AND TRUSTEES TRAINING Presented By George Martin

VFW Auxiliary INVESTMENTS HELD AT VFW AUXILIARY NATIONAL HEADQUARTERS Presented By George

Consistent Multitask Learning with Nonlinear Output Constraints Carlo Ciliberto Department of

COMPETITIVE MULTITASK MARINE TECHNOLOGY Ocean Cleaner Technology S.L. is a competitive marine

Real Real- -Time Systems Time Systems Low- Low -level programming level programming Low-

Dependency Parse Dependency Tags aux auxiliary auxpass passive auxiliary cop

Term Rep placement Deep rep lacement Auxiliary constructor i Auxiliary constructor i module

Shared Memory Programming with OpenMP Lecture 6: Tasks What are tasks? Tasks are

Scheduling Aperiodic Tasks Background Scheduling Treat aperiodic tasks as lowest-priority

SPEEDING UP DEEP REINFORCEMENT LEARNING VIA TRANSFER AND MULTITASK LEARNING Speaker: Yunshu Du

Learning Tasks in Practice How to Make Use of COMET Learning Tasks in Vocational Schools

Tmux and Screen Limitations of the terminal How do we multitask? Graphical

Multi-Task Learning: Models, Optimization and Applications Linli Xu University of Science and

Multi-Task Active Learning Yi Zhang Outline Active Learning Multi-Task Active Learning

Hadoop over NDN: Initial Experience and Results Mathias Gibbens, Lei Ye, Chris Gniady, and

Multi-Source Adjustment of Multi-Layer Annotation: the Bits of Wisdom Approach Kilian Evang 20

Recitation 1: Multitasking Kai Mast Threads vs. Processes Threads Processes How to start?

RegML 2016 Class 4 Regularization for multi-task learning Lorenzo Rosasco UNIGE-MIT-IIT June

Real Time Operating Systems from Fundamentals of Real Time Systems Mukul Shirvaikar &amp;

IN5550 Neural Methods in Natural Language Processing Ensembles, transfer and multi-task

Real Time Operating Systems from Fundamentals of Real Time Systems Mukul Shirvaikar &