INF5820: Language Technological Applications Applications of Recurrent Neural Networks Stephan Oepen University of Oslo November 6, 2018
Most Recently: Variants of Recurrent Neural Networks To a first approximation, the de facto consensus in NLP in 2017 is that, no matter what the task, you throw a BiLSTM at it [...] Christopher Manning, March 2017 ( https://simons.berkeley.edu/talks/christopher-manning-2017-3-27 ) Looking around at EMNLP 2018 last week, that rule of thumb seems no less valid today. 2
Very High-Level: The RNN Abstraction—Unrolled R( x i , s i − 1 ) = g( s i − 1 W s + x i W x + b ) s i = = O( s i ) = s i y i 3
RNNs: Take-Home Messages for the Casual User ◮ State vectors s i reflect the complete history up to time point i ; ◮ RNNs are sensitive to (basic) natural language structure: sequences; ◮ applicable to indeterminate and unlimited length (in principle); ◮ few parameters: matrices W s and W x shared across all time points; ◮ analoguous to (potentially) deep nesting: repeated multiplications; ◮ near-crippling practical limitation: ‘exploding’ or ‘vanishing’ gradients; → gated RNNs: Hochreiter & Schmidhuber (1997) and Cho et al. (2014). 4
Essentials: Long Short-Term Memory RNNs (LSTMs) ◮ Three additional gates (vectors) modulate flow of information; ◮ state vectors s i are partitioned into memory cells and hidden state; ◮ forget gate f how much of the previous memory to keep; ◮ input gate i how much of the proposed update to apply; ◮ output gate o what parts of the updated memory to output. s i = R( x i , s i − 1 ) = [ c i ; h i ] c i = f i ⊙ c i − 1 + i i ⊙ z i tanh( x i W x + h i − 1 W h ) = z i h i = O( x i , s i − 1 ) = o i ⊙ tanh( c i ) ◮ More parameters: separate W x · and W h · matrices for each gate. 5
Variants: Bi-Directional Recurrent Networks 6
Variants: ‘Deep’ (Stacked) Recurrent Networks 7
A Side Note: Beyond Sequential Structures Improved Semantic Representations From Tree-Structured Long Short-Term Memory Networks Kai Sheng Tai, Richard Socher*, Christopher D. Manning Computer Science Department, Stanford University, *MetaMind Inc. kst@cs.stanford.edu, richard@metamind.io, manning@stanford.edu Abstract y 1 y 2 y 3 y 4 Because of their superior ability to pre- serve sequence information over time, Long Short-Term Memory (LSTM) net- x 1 x 2 x 3 x 4 works, a type of recurrent neural net- y 1 work with a more complex computational unit, have obtained strong results on a va- y 2 y 3 riety of sequence modeling tasks. The x 1 only underlying LSTM structure that has been explored so far is a linear chain. y 4 y 6 However, natural language exhibits syn- x 2 tactic properties that would naturally com- bine words to phrases. We introduce the x 4 x 5 x 6 Tree-LSTM, a generalization of LSTMs to 8
Tree LSTMs Help Leverage Syntactic Structure [ text ] A woman is slicing a tomato. [ hypothesis ] A vegetable is being cut by a woman. Method Pearson’s r Spearman’s ρ MSE Illinois-LH (Lai and Hockenmaier, 2014) 0.7993 0.7538 0.3692 UNAL-NLP (Jimenez et al., 2014) 0.8070 0.7489 0.3550 Meaning Factory (Bjerva et al., 2014) 0.8268 0.7721 0.3224 ECNU (Zhao et al., 2014) 0.8414 – – Mean vectors 0.7577 (0.0013) 0.6738 (0.0027) 0.4557 (0.0090) DT-RNN (Socher et al., 2014) 0.7923 (0.0070) 0.7319 (0.0071) 0.3822 (0.0137) SDT-RNN (Socher et al., 2014) 0.7900 (0.0042) 0.7304 (0.0076) 0.3848 (0.0074) LSTM 0.8528 (0.0031) 0.7911 (0.0059) 0.2831 (0.0092) Bidirectional LSTM 0.8567 (0.0028) 0.7966 (0.0053) 0.2736 (0.0063) 2-layer LSTM 0.8515 (0.0066) 0.7896 (0.0088) 0.2838 (0.0150) 2-layer Bidirectional LSTM 0.8558 (0.0014) 0.7965 (0.0018) 0.2762 (0.0020) Constituency Tree-LSTM 0.8582 (0.0038) 0.7966 (0.0053) 0.2734 (0.0108) Dependency Tree-LSTM 0.8676 (0.0030) 0.8083 (0.0042) 0.2532 (0.0052) Table 3: Test set results on the SICK semantic relatedness subtask. For our experiments, we report mean scores over 5 runs (standard deviations in parentheses). Results are grouped as follows: (1) SemEval 2014 submissions; (2) Our own baselines; (3) Sequential LSTMs; (4) Tree-structured LSTMs. 9
Common Applications of RNNs (in NLP) ◮ Acceptors e.g. (sentence-level) sentiment classification: P (class = k | w 1: n ) = y [ k ] ˆ softmax(MLP([RNN f ( x 1: n ); RNN b ( x n :1 )])) = y ˆ x 1: n = E [ w 1 ] , . . . , E [ w n ] ◮ transducers e.g. part-of-speech tagging: softmax(MLP([RNN f ( x 1: n , i ); RNN b ( x n :1 , i )])) [ k ] P ( t i = k | w 1: n ) = [ E [ w i ] ; RNN f c ( c 1: l i ); RNN b x i = c ( c l i :1 )] ◮ character-level RNNs robust to unknown words; may capture affixation; ◮ encoder–decoder (sequence-to-sequence) models coming next week. 10
RNNs as Feature Extractors 11
Sequence Labeling in Natural Language Processing ◮ Token-level class assignments in sequential context, aka tagging; ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated. Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — B PERS I PERS O B ORG O O B PERS E PERS O S ORG O O ◮ IOB (aka BIO) labeling scheme—and variants—encodes constraints. 12
Reflections on Negation as a Tagging Task conj root cc punct nsubj punct nsubj punct aux aux dobj prep pcomp dep dobj neg det aux poss part amod we have never gone out without keeping a sharp watch , and no one could have escaped our notice . " { } { } ann. 1: ⟨ cue ⟩ ann. 2: { } ⟨ cue ⟩ { } ann. 3: ⟨ cue ⟩ labels: N N CUE E E CUE N S O CUE E N N N N S O N N N N ◮ Sherlock (Lapponi et al., 2012, 2017) still state of the art today; ◮ ‘flattens out’ multiple, potentially overlapping negation instances; ◮ post-classification: heuristic reconstruction of separate structures. ◮ To what degree is cue classification a sequence labeling problem? 13
Constituent Parsing as Sequence Labeling (1:2) Constituent Parsing as Sequence Labeling Carlos G´ omez-Rodr´ ıguez David Vilares Universidade da Coru˜ na Universidade da Coru˜ na FASTPARSE Lab, LyS Group FASTPARSE Lab, LyS Group Departamento de Computaci´ on Departamento de Computaci´ on Campus de Elvi˜ na s/n, 15071 Campus de Elvi˜ na s/n, 15071 A Coru˜ na, Spain A Coru˜ na, Spain carlos.gomez@udc.es david.vilares@udc.es Abstract Zhang, 2017; Fern´ andez-Gonz´ alez and G´ omez- Rodr´ ıguez, 2018). We introduce a method to reduce constituent With an aim more related to our work, other au- parsing to sequence labeling. For each word thors have reduced constituency parsing to tasks w t , it generates a label that encodes: (1) the that can be solved faster or in a more generic number of ancestors in the tree that the words way. Fern´ andez-Gonz´ alez and Martins (2015) re- w t and w t +1 have in common, and (2) the non- duce phrase structure parsing to dependency pars- terminal symbol at the lowest common ances- tor. We first prove that the proposed encoding ing. They propose an intermediate representation function is injective for any tree without unary where dependency labels from a head to its de- branches. In practice, the approach is made pendents encode the nonterminal symbol and an extensible to all constituency trees by collaps- attachment order that is used to arrange nodes ing unary branches. We then use the PTB and into constituents. Their approach makes it pos- CTB treebanks as testbeds and propose a set of sible to use off-the-shelf dependency parsers for 14
Constituent Parsing as Sequence Labeling (2:2) 15
Two Definitions of ‘Sequence Labeling’ ◮ Hmm, actually, what exactly does one mean by sequence labeling? ◮ gentle definition class predictions for all elements, in context; ◮ pointwise classification; each individual decision is independent; ◮ no (direct) model of wellformedness conditions on class sequence; ◮ strict definition sequence labeling performs structured prediction; ◮ search for ‘globally’ optimal solution, e.g. most probable sequence; ◮ models (properties of) output sequence explicitly, e.g. class bi-grams; ◮ later time points impact earlier choices, i.e. revision of path prefix; ◮ search techniques: dynamic programming, beam search, re-ranking. 16
Wanted: Sequence-Level Output Constraints 17
Recommend
More recommend