in5550 neural methods in natural language processing
play

IN5550: Neural Methods in Natural Language Processing IN5550 - PowerPoint PPT Presentation

IN5550: Neural Methods in Natural Language Processing IN5550 Neural Methods in Natural Language Processing Recurrent Neural Networks Lilja vrelid & Stephan Oepen University of Oslo March 14, 2019 Obligatory assignment 3


  1. IN5550: Neural Methods in Natural Language Processing – IN5550 – Neural Methods in Natural Language Processing Recurrent Neural Networks Lilja Øvrelid & Stephan Oepen University of Oslo March 14, 2019

  2. Obligatory assignment 3 ◮ (Sentence-level) Sentiment Analysis with CNNs 1. Baseline: architecture of Zhang & Wallace (2017) 2. Tuning of hyperparameters 3. The influence of word embeddings 4. Theoretical assignment: summarize a research paper ◮ Data set: Stanford Sentiment Treebank (Socher et. al., 2013) 2

  3. Sentiment Analysis ◮ Sentiment: attitudes, emotions, opinions ◮ Subjective language ◮ Sentiment Analysis: automatically characterize the sentiment content of a text unit ◮ Performed at different levels of granularity: ◮ document ◮ sentence ◮ sub-sentence (aspect-based) 3

  4. Stanford Sentiment Treebank ◮ 11,855 sentences from movie reviews ◮ Parsed using a syntactic parser (Stanford parser) ◮ 215,514 unique phrases, annotated by 3 annotators ◮ Sentiment compositionality: how sentiment of a phrase is composed from its parts 4

  5. Crowdsourcing annotation ◮ Amazon Mechanical Turk: crowd-sourcing platform where requesters pay workers who help them with some task that requires human intelligence ◮ Used in NLP for a range of annotation tasks ◮ translation ◮ summarization ◮ information extraction ◮ document relevance ◮ figure captions ◮ labeling sentiment, intent, style 5

  6. Crowdsourcing annotation 6

  7. SST in this course ◮ Subset of the original SST ◮ Only sentence-level sentiment annotation ◮ Split into training (6500 sentences), development (800 sentences) (and secret held-out test set for final evaluation) ◮ Excluded neutral sentences: binary positive/negative distinction 7290 143658 negative Alternative medicine obviously merits ... but Ayurveda does the field no favors . 7

  8. In conclusion, CNN pros and cons ◮ Can learn to represent large n -grams efficiently, ◮ without blowing up the parameter space and without having to represent the whole vocabulary. Parameter sharing. ◮ Easily parallelizable: each ‘region’ that a convolutional filter operates on is independent of the others; the entire input can be processed concurrently. (Each filter also independent.) ◮ The cost of this is that we have to stack convolutions into deep layers in order to ‘view’ the entire input, and each of those layers is indeed calculated sequentially. ◮ Not designed for modeling sequential language data: does not offer a very natural way of modeling long-range and structured dependencies. 8

  9. But Language is So Rich in Structure A similar technique is almost impossible to apply to other crops. http://mrp.nlpl.eu/index.php?page=2 9

  10. Okay, Maybe Start with Somewhat Simpler Structures A similar technique is almost impossible to apply to other crops. root punct nsubj obl det cop ccomp case amod advmod mark amod A similar technique is almost impossible to apply to other crops . http://epe.nlpl.eu/index.php?page=1 DET ADJ NOUN AUX ADV ADJ PART VERB ADP ADJ NOUN PUNCT A similar technique is almost impossible to apply to other crops . 10

  11. Recurrent Neural Networks in the Abstract ◮ Recurrent Neural Networks (RNNs) take variable-length sequences as input ◮ are highly sensitive to linear order; need not make any Markov assumptions ◮ map input sequence x 1: n to output y 1: n ◮ internal state sequence s 1: n as ‘history’ RNN ( x 1: n , s 0 ) = y 1: n = R ( s i − 1 , x i ) s i y i = O ( s i ) x i ∈ R d x ; y i ∈ R d y ; s i ∈ R f ( d y ) 11

  12. Still High-Level: The RNN Abstraction Unrolled ◮ Each state s i and output y i depend on the full previous context, e.g. = R ( R ( R ( R ( x 1 , s o ) , x 2 ) , x 3 ) x 4 ) s 4 ◮ Functions R ( · ) and O ( · ) shared across time points; fewer parameters 12

  13. Implementing the RNN Abstraction ◮ We have yet to define the nature of the R ( · ) and O ( · ) functions ◮ RNNs actually a family of architectures; much variation for R ( · ) Arguably the Most Basic RNN Implementation s i = R ( s i − 1 , x i ) = s i − 1 + x i = O ( s i ) = s i y i ◮ Does this maybe look familiar? Merely a continuous bag of words ◮ order-insensitive: Cisco acquired Tandberg ≡ Tandberg acquired Cisco ◮ actually has no parameters of it own: θ = {} ; thus, no learning ability 13

  14. The ‘Simple’ RNN (Elman, 1990) ◮ Want to learn the dependencies between elements of the sequence ◮ nature of the R ( · ) function needs to be determined during training The Elman RNN R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = = O ( s i ) = s i y i x i ∈ R d x ; s 1 , y i ∈ R d y ; W x ∈ R d x × d s ; W s ∈ R d s × d s ; b ∈ R d s ◮ Linear transformations of states and inputs; non-linear activation ◮ alternative, equivalent definition of R ( · ) : s i = g ([ s i − 1 ; x i ] W + b ) 14

  15. Training Recurrent Neural Networks ◮ Embed RNN in end-to-end task, e.g. classification from output states y i ◮ standard loss functions, backpropagation, optimizers (so-called BPTT) 15

  16. An Alternate Training Regime ◮ Focus on final output state: y n as encoding of full sequence x 1: n ◮ looking familiar? map variable-length sequence to fixed-size vector ◮ sentence-level classification; or as input to conditioned generator ◮ aka sequence–to–sequence model; e.g. translation or summarization 16

  17. Unrolled RNNs, in a Sense, are very Deep MLPs R ( s i − 1 , x i ) = g ( s i − 1 W s + x i W x + b ) s i = g ( g ( s i − 2 W s + x i − 1 W x + b ) W s + x i W x + b ) = ◮ W s , W x shared across all layers → exploding or vanishing gradients 17

  18. Variants: Bi-Directional Recurrent Networks ◮ Capture full left and right context: ‘history’ and ‘future’ for each x i ◮ moderate increase in parameters (double); still linear-time computation 18

  19. Variants: ‘Deep’ (Stacked) Recurrent Networks 19

  20. RNNs as Feature Extractors 20

  21. A Note on Archicture Design While it is not theoretically clear what is the additional power gained by the deeper architectures, it was observed empirically that deep RNNs work better than shallower ones on some tasks. [...] Many works report results using layered RNN architectures, but do not ex- plicitly compare to one-layer RNNs. In the experiments of my research group, using two or more layers indeed often improves over using a single one. (Goldberg, 2017, p. 172) 21

  22. Common Applications of RNNs (in NLP) ◮ Acceptors e.g. (sentence-level) sentiment classification: P ( c = k | w 1: n ) = ˆ y [ k ] softmax(MLP([RNN f ( x 1: n ) [ n ] ; RNN b ( x n :1 ) [1] ])) = y ˆ = x 1: n E [ w 1 ] , . . . , E [ w n ] ◮ transducers e.g. part-of-speech tagging: softmax(MLP([RNN f ( x 1: n ) [ i ] ; RNN b ( x n :1 ) [ i ] ])) [ k ] P ( c i = k | w 1: n ) = [ E [ w i ] ; RNN f c ( c 1: l i ); RNN b x i = c ( c l i :1 )] ◮ character-level RNNs robust to unknown words; may capture affixation ◮ encoder–decoder (sequence-to-sequence) models coming before Easter 22

  23. Outlook: Automated Image Captioning ◮ Andrei Karpathy (2016): Connecting Images and Natural Language 23

  24. Sequence Labeling in Natural Language Processing ◮ Token-level class assignments in sequential context, aka tagging ◮ e.g. phoneme sequences, parts of speech; chunks, named entities, etc. ◮ some structure transcending individual tokens can be approximated Michelle Obama visits UiO today . NNP NNP VBZ NNP RB . PERS ORG PERS PERS — ORG — — B PERS I PERS O B ORG O O B PERS E PERS O S ORG O O ◮ IOB (aka BIO) labeling scheme—and variants—encodes groupings. 24

  25. Constituent Parsing as Sequence Labeling (1:2) Constituent Parsing as Sequence Labeling Carlos G´ omez-Rodr´ ıguez David Vilares Universidade da Coru˜ na Universidade da Coru˜ na FASTPARSE Lab, LyS Group FASTPARSE Lab, LyS Group Departamento de Computaci´ on Departamento de Computaci´ on Campus de Elvi˜ na s/n, 15071 Campus de Elvi˜ na s/n, 15071 A Coru˜ na, Spain A Coru˜ na, Spain carlos.gomez@udc.es david.vilares@udc.es Abstract Zhang, 2017; Fern´ andez-Gonz´ alez and G´ omez- Rodr´ ıguez, 2018). We introduce a method to reduce constituent With an aim more related to our work, other au- parsing to sequence labeling. For each word thors have reduced constituency parsing to tasks w t , it generates a label that encodes: (1) the that can be solved faster or in a more generic number of ancestors in the tree that the words way. Fern´ andez-Gonz´ alez and Martins (2015) re- w t and w t +1 have in common, and (2) the non- duce phrase structure parsing to dependency pars- terminal symbol at the lowest common ances- tor. We first prove that the proposed encoding ing. They propose an intermediate representation function is injective for any tree without unary where dependency labels from a head to its de- branches. In practice, the approach is made pendents encode the nonterminal symbol and an extensible to all constituency trees by collaps- attachment order that is used to arrange nodes ing unary branches. We then use the PTB and into constituents. Their approach makes it pos- CTB treebanks as testbeds and propose a set of sible to use off-the-shelf dependency parsers for 25

  26. Constituent Parsing as Sequence Labeling (2:2) 26

Recommend


More recommend