Training Procedure Formally End-to-end training: ◮ Input f θ 0 ( v ) f θ 1 ( v ) f θ 2 ( v ) f θ 3 ( v ) s 0 s 1 s 2 s 3 s 4 ◮ ◮ Word embeddings: word → R d ◮ Training data: pairs of < document, sentiment label > ◮ Output ◮ Parameter values: θ . 13 / 37
Training Procedure Formally End-to-end training: Test: ◮ Input ◮ Input f θ 0 ( v ) f θ 1 ( v ) f θ 2 ( v ) f θ 3 ( v ) f θ 0 ( v ) f θ 1 ( v ) f θ 2 ( v ) f θ 3 ( v ) s 0 s 1 s 2 s 3 s 4 s 0 s 1 s 2 s 3 s 4 ◮ ◮ ◮ Word embeddings: word → R d ◮ Word embeddings: word → R d ◮ Training data: pairs of ◮ Learned parameters: θ ◮ New data: < document > < document, sentiment label > ◮ Output ◮ Output ◮ Parameter values: θ ◮ Prediction: < sentiment label > . 13 / 37
Training Procedure Formally End-to-end training: Test: ◮ Input ◮ Input f θ 0 ( v ) f θ 1 ( v ) f θ 2 ( v ) f θ 3 ( v ) f θ 0 ( v ) f θ 1 ( v ) f θ 2 ( v ) f θ 3 ( v ) s 0 s 1 s 2 s 3 s 4 s 0 s 1 s 2 s 3 s 4 ◮ ◮ ◮ Word embeddings: word → R d ◮ Word embeddings: word → R d ◮ Training data: pairs of ◮ Learned parameters: θ ◮ New data: < document > < document, sentiment label > ◮ Output ◮ Output ◮ Parameter values: θ ◮ Prediction: < sentiment label > ◮ Standard training procedure ◮ Backpropagation ◮ Stochastic gradient descent . 13 / 37
Benefits of Neural WFSAs 1: Informed Model Development Fixed length: s 0 s 1 s 2 s 3 s 4 such a great talk 14 / 37
Benefits of Neural WFSAs 1: Informed Model Development Fixed length: s 0 s 1 s 2 s 3 s 4 such a great talk Self loops: s 0 s 1 s 2 s 3 s 4 such a great, wonderful, funny talk 14 / 37
Benefits of Neural WFSAs 1: Informed Model Development Fixed length: s 0 s 1 s 2 s 3 s 4 such a great talk Self loops: s 0 s 1 s 2 s 3 s 4 such a great, wonderful, funny talk Epsilon transitions: s 0 s 1 s 2 s 3 s 4 such great shoes f θ ǫ () 14 / 37
Benefits of Neural WFSAs 1: Informed Model Development Fixed length: s 0 s 1 s 2 s 3 s 4 such a great talk Self loops: s 0 s 1 s 2 s 3 s 4 such a great, wonderful, funny talk Epsilon transitions: s 0 s 1 s 2 s 3 s 4 such great shoes f θ ǫ () . . . s 0 s 0 s 1 s 1 s 2 s 2 s 3 s 3 s 4 s 4 14 / 37
Benefits of Neural WFSAs 2: ◮ They are neural ◮ Backpropagation ◮ Stochastic gradient descent ◮ PyTorch, TensorFlow, AllenNLP 15 / 37
Benefits of Neural WFSAs 2: ◮ They are neural ◮ Backpropagation ◮ Stochastic gradient descent ◮ PyTorch, TensorFlow, AllenNLP ◮ Coming up: ◮ Many deep models are mathematically equivalent to neural WFSAs ◮ A (new) joint framework ◮ Allows extension of these models 15 / 37
Overview ◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata ◮ Case Study: Convolutional neural networks 16 / 37
Overview ◮ Background: Weighted Finite-State Automata ◮ Neural Weighted Finite-State Automata ◮ Existing Deep Models as Weighted Finite-State Automata ◮ Case Study: Convolutional neural networks 16 / 37
Case Study: Convolutional Neural Networks (ConvNets) A Linear-Kernel Filter with Max-Pooling v 1 v 2 v 3 v 4 v 5 v 6 v 7 17 / 37
Case Study: Convolutional Neural Networks (ConvNets) A Linear-Kernel Filter with Max-Pooling v 1 v 2 v 3 v 4 v 5 v 6 v 7 S θ ( v 1 : v 4 ) = � θ j · v j j =1:4 Learnable parameters Word vectors 17 / 37
Proposition 1: ConvNet Filters are Computing WFSA scores Schwartz et al., ACL 2018 s 0 s 1 s 2 s 3 s 4 18 / 37
Proposition 1: ConvNet Filters are Computing WFSA scores Schwartz et al., ACL 2018 ◮ f θ j ( v ) = θ j · v s 0 s 1 s 2 s 3 s 4 18 / 37
Proposition 1: ConvNet Filters are Computing WFSA scores Schwartz et al., ACL 2018 ◮ f θ j ( v ) = θ j · v ◮ s θ ( v 1 : v 4 ) = � f θ j ( v j ) = � ( θ j · v j ) j =1:4 j =1:4 s 0 s 1 s 2 s 3 s 4 18 / 37
ConvNets are ( Implicitly ) Computing WFSA Scores! � ConvNet : S θ ( v 1 : v d ) = ( θ j · v j ) (1) j =1: d � Neural WFSA : s θ ( v 1 : v d ) = ( θ j · v j ) (2) j =1: d 19 / 37
ConvNets are ( Implicitly ) Computing WFSA Scores! � ConvNet : S θ ( v 1 : v d ) = ( θ j · v j ) (1) j =1: d � Neural WFSA : s θ ( v 1 : v d ) = ( θ j · v j ) (2) j =1: d Benefits : ❉ Interpret ConvNets ❉ Improve ConvNets 19 / 37
A ConvNet Learns a Fixed-Length Soft -Pattern! Schwartz et al., ACL 2018 s 0 s 1 s 2 s 3 s 4 ◮ E.g., “such a great talk” ◮ what a great song ◮ such an awesome movie 20 / 37
Improving ConvNets: SoPa (Soft-Patterns) Schwartz et al., ACL 2018 ◮ Language pattern are often flexible-length ◮ such a great talk ◮ such a great, funny, interesting talk ◮ such great shoes 21 / 37
Improving ConvNets: SoPa (Soft-Patterns) Schwartz et al., ACL 2018 ◮ Language pattern are often flexible-length ◮ such a great talk ◮ such a great, funny, interesting talk ◮ such great shoes Convolutional Neural Network: S θ ( v 1 : v d ) = � ( θ j · v j ) j =1: d 21 / 37
Improving ConvNets: SoPa (Soft-Patterns) Schwartz et al., ACL 2018 ◮ Language pattern are often flexible-length ◮ such a great talk ◮ such a great, funny, interesting talk ◮ such great shoes Weighted Finite-State Automaton: such a great talk s 0 s 1 s 2 s 3 s 4 21 / 37
Improving ConvNets: SoPa (Soft-Patterns) Schwartz et al., ACL 2018 ◮ Language pattern are often flexible-length ◮ such a great talk ◮ such a great, funny, interesting talk ◮ such great shoes Weighted Finite-State Automaton: funny , interesting such a great talk s 0 s 1 s 2 s 3 s 4 21 / 37
Improving ConvNets: SoPa (Soft-Patterns) Schwartz et al., ACL 2018 ◮ Language pattern are often flexible-length ◮ such a great talk ◮ such a great, funny, interesting talk ◮ such great shoes Weighted Finite-State Automaton: funny , interesting such a great talk s 0 s 1 s 2 s 3 s 4 ǫ 21 / 37
Improving ConvNets: SoPa (Soft-Patterns) Schwartz et al., ACL 2018 ◮ Language pattern are often flexible-length ◮ such a great talk ◮ such a great, funny, interesting talk ◮ such great shoes Weighted Finite-State Automaton: funny , interesting such a great talk s 0 s 1 s 2 s 3 s 4 ǫ 21 / 37
Sentiment Analysis Experiments Classify ( ❯ / ❉ ) v 1:7 = v 1 v 2 v 3 v 4 v 5 v 6 v 7 I saw such a great talk today 22 / 37
Sentiment Analysis Experiments Classify ( ❯ / ❉ ) Sequence encoders: ◮ SoPa (ours) v 1:7 = ◮ ConvNet v 1 v 2 v 3 v 4 v 5 v 6 v 7 I saw such a great talk today 22 / 37
Sentiment Analysis Experiments Classify ( ❯ / ❉ ) Sequence encoders: ◮ SoPa (ours) v 1:7 = ◮ ConvNet LSTM v 1 v 2 v 3 v 4 v 5 v 6 v 7 I saw such a great talk today 22 / 37
Sentiment Analysis Results Schwartz et al., ACL 2018 Classification Accuracy 85 90 85 80 80 75 75 70 100 1 , 000 10 , 000 100 1 , 000 10 , 000 Num. Training Samples (SST) Num. Training Samples (Amazon) 23 / 37
Sentiment Analysis Results Schwartz et al., ACL 2018 Classification Accuracy 85 90 85 80 80 75 75 SoPa (ours) ConvNet 70 LSTM 100 1 , 000 10 , 000 100 1 , 000 10 , 000 Num. Training Samples (SST) Num. Training Samples (Amazon) 23 / 37
Sentiment Analysis Results Schwartz et al., ACL 2018 Classification Accuracy 85 90 85 80 80 75 75 SoPa (ours) ConvNet 70 LSTM 100 1 , 000 10 , 000 100 1 , 000 10 , 000 Num. Training Samples (SST) Num. Training Samples (Amazon) 23 / 37
Interpreting SoPa Soft Patterns! ◮ For each learned pattern, extract the 4 top scoring phrases in the training set 24 / 37
Interpreting SoPa Soft Patterns! ◮ For each learned pattern, extract the 4 top scoring phrases in the training set Highest Scoring Phrases mesmerizing portrait of a engrossing portrait of a Patt. 1 clear-eyed portrait of an fascinating portrait of a portrait of a � s 0 s 1 s 2 s 3 s 4 24 / 37
Interpreting SoPa Soft Patterns! ◮ For each learned pattern, extract the 4 top scoring phrases in the training set Highest Scoring Phrases Highest Scoring Phrases mesmerizing portrait of a honest , and enjoyable engrossing portrait of a forceful , and beautifully Patt. 1 Patt. 2 clear-eyed portrait of an energetic , and surprisingly fascinating portrait of a portrait of a , and � � � s 0 s 1 s 2 s 3 s 4 s 0 s 1 s 2 s 3 s 4 24 / 37
Interpreting SoPa Soft Patterns! ◮ For each learned pattern, extract the 4 top scoring phrases in the training set Highest Scoring Phrases Highest Scoring Phrases mesmerizing portrait of a honest , and enjoyable engrossing portrait of a forceful , and beautifully Patt. 1 Patt. 2 clear-eyed portrait of an energetic , and surprisingly fascinating portrait of a unpretentious , charming SL , quirky � portrait of a , and � � � s 0 s 1 s 2 s 3 s 4 s 0 s 1 s 2 s 3 s 4 24 / 37
25 / 37
s 0 s 1 s 2 s 3 s 4 25 / 37
s 0 s 1 s 2 s 3 s 4 More expressive WFSA s 0 s 1 s 2 s 3 s 4 f () 25 / 37
s 0 s 1 s 2 s 3 s 4 Interpretable More expressive WFSA More Robust Convolutional Neural Network ++ s 0 s 1 s 2 s 3 s 4 f () 25 / 37
Many Existing Deep Models are Neural WFSAs! Peng, Schwartz et al., EMNLP 2018 Mikolov et al. arXiv 2014 Balduzzi and Ghifary ICML 2016 Bradbury et al. ICLR 2017 Lei et al. EMNLP 2018 Lei et al. NAACL 2016 Foerster et al. ICML 2017 26 / 37
Many Existing Deep Models are Neural WFSAs! Peng, Schwartz et al., EMNLP 2018 Mikolov et al. arXiv 2014 Balduzzi and Ghifary ICML 2016 Bradbury et al. ICLR 2017 s 0 s 1 Lei et al. EMNLP 2018 Lei et al. NAACL 2016 s 0 s 1 s 2 Foerster et al. ICML 2017 s 0 s 1 s 2 s 3 26 / 37
Many Existing Deep Models are Neural WFSAs! Peng, Schwartz et al., EMNLP 2018 Mikolov et al. arXiv 2014 Balduzzi and Ghifary ICML 2016 Bradbury et al. ICLR 2017 s 0 s 1 Lei et al. EMNLP 2018 Lei et al. NAACL 2016 s 0 s 1 s 2 Foerster et al. ICML 2017 s 0 s 1 s 2 s 3 ◮ Six recent recurrent neural networks (RNN) models are also implicitly computing WFSA scores 26 / 37
Developing more Robust WFSA Models S 2 : S 3 : s 0 s 1 s 0 s 1 s 2 Mikolov et al. (2014) Lei et al. (2016) Balduzzi and Ghifary (2016) Bradbury et al. (2017) Lei et al. (2018) 27 / 37
Developing more Robust WFSA Models S 2 : S 3 : s 0 s 1 s 0 s 1 s 2 Mikolov et al. (2014) Lei et al. (2016) Balduzzi and Ghifary (2016) Bradbury et al. (2017) Lei et al. (2018) S 2 3 : s 0 s 1 s 2 Peng, Schwartz et al. (2018) 27 / 37
Sentiment Analysis Results Peng, Schwartz et al., EMNLP 2018 28 / 37
Language Modeling Results Peng, Schwartz et al., EMNLP 2018 lower is better 29 / 37
s 0 s 1 Weighted Finite-State Automata Deep Learning ❯ widely studied ❯ backpropagation ❯ understandable ❯ stochastic gradient descent ❯ interpretable ❯ PyTorch, TensorFlow, AllenNLP ❯ informed model development ❯ state-of-the-art ❉ low performance ❉ architecture engineering 30 / 37
Work in Progress 1: Are All Deep Models for NLP Equivalent to WFSAs? ◮ Elman RNN: h i = σ ( Wh i − 1 + Uv i + b ) ◮ The interaction between h i and h i − 1 is via affine transformations followed by nonlinearities ◮ Same for LSTM ◮ Most probably not equivalent to a WFSA 31 / 37
Work in Progress 2: Automatic Model Development s 0 s 0 s 1 s 1 s 2 s 2 s 3 s 3 s 4 s 4 32 / 37
Work in Progress 2: Automatic Model Development s 0 s 0 s 1 s 1 s 2 s 2 s 3 s 3 s 4 s 4 s 0 s 0 s 1 s 1 s 2 s 2 s 3 s 3 s 4 s 4 32 / 37
Work in Progress 2: Automatic Model Development s 0 s 0 s 1 s 1 s 2 s 2 s 3 s 3 s 4 s 4 Deep learning: model engineering s 0 s 0 s 1 s 1 s 2 s 2 s 3 s 3 s 4 s 4 32 / 37
Work in Progress 2: Automatic Model Development s 0 s 0 s 1 s 1 s 2 s 2 s 3 s 3 s 4 s 4 Deep learning: model engineering SoPa: informed model development s 0 s 0 s 1 s 1 s 2 s 2 s 3 s 3 s 4 s 4 32 / 37
Work in Progress 2: Automatic Model Development s 0 s 0 s 1 s 1 s 2 s 2 s 3 s 3 s 4 s 4 Deep learning: model engineering SoPa: informed model development New: Automatic model development s 0 s 0 s 1 s 1 s 2 s 2 s 3 s 3 s 4 s 4 32 / 37
Other Projects output: Classify ( ❯ / ❉ ) prediction sequence v 1:7 encoders word v 1 v 2 v 3 v 4 v 5 v 6 v 7 embeddings input: I saw such a great talk today words 33 / 37
Other Projects output: Classify ( ❯ / ❉ ) prediction sequence v 1:7 encoders word v 1 v 2 v 3 v 4 v 5 v 6 v 7 embeddings input: Schwartz et al., EMNLP 2013 I saw such a great talk today Schwartz et al., COLING 2014 words 33 / 37
Other Projects output: Classify ( ❯ / ❉ ) prediction sequence v 1:7 encoders Schwartz et al., CoNLL 2015 Rubinstein et al., ACL 2015 word v 1 v 2 v 3 v 4 v 5 v 6 v 7 Schwartz et al., NAACL 2016 embeddings Vuli´ c et al., CoNLL 2017 Peters et al., 2018 input: Schwartz et al., EMNLP 2013 I saw such a great talk today Schwartz et al., COLING 2014 words 33 / 37
Other Projects output: Classify ( ❯ / ❉ ) prediction Schwartz et al., ACL 2018 sequence v 1:7 Peng et al., EMNLP 2018 encoders Liu et al., RepL4NLP 2018 *best paper award* Schwartz et al., CoNLL 2015 Rubinstein et al., ACL 2015 word v 1 v 2 v 3 v 4 v 5 v 6 v 7 Schwartz et al., NAACL 2016 embeddings Vuli´ c et al., CoNLL 2017 Peters et al., 2018 input: Schwartz et al., EMNLP 2013 I saw such a great talk today Schwartz et al., COLING 2014 words 33 / 37
Other Projects output: Labeled datasets: Classify ( ❯ / ❉ ) prediction < sentence, label > pairs Schwartz et al., ACL 2011 Schwartz et al., COLING 2012 Schwartz et al., ACL 2018 Schwartz et al., CoNLL 2017 sequence v 1:7 Peng et al., EMNLP 2018 Gururangan et al., NAACL 2018 encoders Liu et al., RepL4NLP 2018 Kang et al., NAACL 2018 *best paper award* Zellers et al., EMNLP 2018 Schwartz et al., CoNLL 2015 Rubinstein et al., ACL 2015 word v 1 v 2 v 3 v 4 v 5 v 6 v 7 Schwartz et al., NAACL 2016 embeddings Vuli´ c et al., CoNLL 2017 Peters et al., 2018 input: Schwartz et al., EMNLP 2013 I saw such a great talk today Schwartz et al., COLING 2014 words 33 / 37
Annotation Artifacts in NLP Datasets Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018 Premise A person is running on the beach Hypothesis The person is sleeping Textual Entailment (state-of-the-art ∼ 90% accuracy) 34 / 37
Annotation Artifacts in NLP Datasets Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018 Premise A person is running on the beach ? entailment ? Hypothesis The person is sleeping contradiction ? neutral Textual Entailment (state-of-the-art ∼ 90% accuracy) 34 / 37
Annotation Artifacts in NLP Datasets Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018 Premise A person is running on the beach ? entailment ? Hypothesis The person is sleeping contradiction ? neutral Textual Entailment (state-of-the-art ∼ 90% accuracy) AllenNLP Demo! 34 / 37
Annotation Artifacts in NLP Datasets Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018 Premise A person is running on the beach ? entailment ? Hypothesis The person is sleeping contradiction ? neutral Textual Entailment (state-of-the-art ∼ 90% accuracy) ◮ The word “sleeping” is over-represented in the training data with contradiction label ◮ annotation artifact ◮ State-of-the-art models focus on this word rather than understanding the text 34 / 37
Annotation Artifacts in NLP Datasets Schwartz et al., CoNLL 2017; Gururangan, Swayamdipta, Levy, Schwartz et al., NAACL 2018 Premise A person is running on the beach ? entailment ? Hypothesis The person is sleeping contradiction ? neutral Textual Entailment (state-of-the-art ∼ 90% accuracy) ◮ The word “sleeping” is over-represented in the training data with contradiction label ◮ annotation artifact ◮ State-of-the-art models focus on this word rather than understanding the text ◮ Models are not as strong as we think they are 34 / 37
Recommend
More recommend