in4080 2020 fall
play

IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 - PowerPoint PPT Presentation

1 IN4080 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lnning 2 Neural LMs, Recurrent networks, Sequence labeling, Information Extraction, Named-Entity Recognition, Evaluation Lecture 13, 9 Nov. Today 3 Feedforward neural networks


  1. 1 IN4080 – 2020 FALL NATURAL LANGUAGE PROCESSING Jan Tore Lønning

  2. 2 Neural LMs, Recurrent networks, Sequence labeling, Information Extraction, Named-Entity Recognition, Evaluation Lecture 13, 9 Nov.

  3. Today 3  Feedforward neural networks  Neural Language Models  Recurrent networks  Information Extraction  Named Entity Recognition  Evaluation

  4. Last week 4  Feedforward neural networks (partly recap)  Model  Training  Computational graphs  Neural Language Models  Recurrent networks  Information Extraction

  5. Neural NLP 5  (Multi-layered) neural networks  Example: Neural language model (k- gram)  Using embeddings as word 𝑗−1 representations  𝑄 𝑥 𝑗 | 𝑥 𝑗−𝑙  Use embeddings for representing the 𝑥 𝑗 -s  Use neural network for 𝑗−1 estimating 𝑄 𝑥 𝑗 | 𝑥 𝑗−𝑙

  6. From J&M, 3.ed., 2019 6

  7. Pretrained embeddings 7  The last slide uses pretrained embeddings  Trained with some method, SkipGram , CBOW, Glove, …  On some specific corpus  Can be downloaded from the web  Pretrained embeddings can aslo be the input to other tasks, e.g. text classification  The task of neural language modeling was also the basis for training the embeddings

  8. Training the embeddings 8  Alternatively we may start with one-hot representations of words and train the embeddings as the first layer in our models (=the way we trained the embeddings)  If the goal is a task different from language modeling, this may result in embeddings better suited for the specific tasks.  We may even use two set of embeddings for each word – one pretrained and one which is trained during the task.

  9. Computational graph 10 [1] = 𝐹𝒚𝟐 𝒚𝟐 𝒗 1 𝒗 = 𝑑𝑝𝑜𝑑𝑏𝑢( 𝒘 = 𝒜 = 𝑏 = 𝒙 𝒜𝟑 = ෝ 𝒛 = 𝑡𝑝𝑔𝑢 − [1] = 𝐹𝒚𝟑 𝒚 2 𝒗 2 [1] , 𝒗 1 [1] , 𝒗 1 [1] ) 𝒘 + 𝒄 [1] 𝒙 + 𝒄 [2] 𝑋𝒗 𝑆𝑉(𝒜) = 𝑉𝒃 𝑛𝑏𝑦(𝒜𝟑) 𝒗 1 [1] = 𝐹𝒚𝟒 𝒚𝟒 𝒗 3 𝒄 [1] 𝒄 [2] W U This picture is if we train the E embeddings E With pretrained embeddings, [1] in a table for we look up 𝒗 1 each word

  10. 11 Recurrent networks

  11. Today 12  Feedforward neural networks  Recurrent networks  Model  Language Model  Sequence Labeling  Advanced architecture  Information Extraction  Named Entity Recognition  Evaluation

  12. Recurrent neural nets 13  Model sequences/temporal phenomena  A cell may send a signal back to itself – at the next moment in time The processing during time The network https://en.wikipedia.org/wiki/Recurrent_neural_network

  13. Forward 14  Each U, V and W are edges with weights (matrices)  𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 is the input sequence  Forward: Calculate ℎ 1 from ℎ 0 and 𝑦 1 . 1. Calculate 𝑧 1 from ℎ 1 . 2. Calculate ℎ 𝑗 from ℎ 𝑗−1 and 𝑦 𝑗 , 3. and 𝑧 𝑗 from 𝑗 , for 𝑗 = 1, … , 𝑜 From J&M, 3.ed., 2019

  14. Forward 15  𝒊 𝑢 = 𝑕 𝑉𝒊 𝑢−1 + 𝑋𝒚 𝑢  𝒛 𝑢 = 𝑔 𝑊𝒊 𝑢  𝑕 and are activation functions  (There are also bias which we didn't include in the formulas) From J&M, 3.ed., 2019

  15. Training 16  At each output node:  Calculate the loss and the  𝜀 -term  Backpropagate the error, e.g.  the 𝜀 -term at ℎ 2 is calculated  from the 𝜀 -term at ℎ 3 by U and  the 𝜀 -term at 𝑧 2 by V  Update  V from the 𝜀 -terms at the 𝑧 𝑗 -s and  U and W from the 𝜀 -terms at the ℎ 𝑗 -s From J&M, 3.ed., 2019

  16. Remark 17  J&M, 3. ed., 2019, sec 9.1.2  It is beyond this course to explain this at a high-level explain how this can be done in using vectors and matrices, OK detail  The formulas, however, are not  But you should be able to do correct: the actual calculations if you stick to the entries of the  Describing derivatives of vectors and matrices, as we did matrices and vectors demand a little more care, e.g. one has to above (ch. 7). transpose matrices

  17. Today 18  Feedforward neural networks  Recurrent networks  Model  Language Model  Sequence Labeling  Advanced architecture  Information Extraction  Named Entity Recognition  Evaluation

  18. RNN Language model 19 𝑜−1 =  ො 𝑧 = 𝑄 𝑥 𝑜 𝑥 1 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑊𝒊 𝑜 )  In principle:  unlimited history  a word depends on all preceding w2 words w1  The word 𝑥 𝑗 is represented by an embedding <s>  or a one-hot and the embedding is made by the LM From J&M, 3.ed., 2019

  19. Autoregressive generation 20  Generated by probabilities:  Choose word in accordance with prob.distribution  Part of more complex models  Encoder-decoder models  Translation From J&M, 3.ed., 2019

  20. Today 21  Feedforward neural networks  Recurrent networks  Model  Language Model  Sequence Labeling  Sequence Labeling  Advanced architecture  Information Extraction  Named Entity Recognition  Evaluation

  21. Neural sequence labeling: tagging 22 𝑜 =  ො 𝑧 = 𝑄 𝑢 𝑜 𝑥 1 𝑡𝑝𝑔𝑢𝑛𝑏𝑦(𝑊𝒊 𝑜 ) From J&M, 3.ed., 2019

  22. Sequence labeling 23  Actual models for sequence labeling, e.g. tagging, are more complex  For example, that it may take words after the tag into consideration.

  23. Today 24  Feedforward neural networks  Recurrent networks  Model  Language Model  Sequence Labeling  Advanced architecture  Information Extraction  Named Entity Recognition  Evaluation

  24. Stacked RNN 25  Can yield better results than single- layers  Reason?  Higher-layers of abstraction  similar to image processing (convolutional nets) From J&M, 3.ed., 2019

  25. Bidirectional RNN 26  Example: Tagger  Considers both preceding and following words From J&M, 3.ed., 2019

  26. LSTM 27  Problems for RNN  Long Short-Term Memory  Keep track of distant information  An advanced architecture with additional layers and weights  Vanishing gradient  Not consider the details here  During backpropagation going backwards through several layers,  Bi-LSTM (Binary LSTM) the gradient approaches 0  Popular standard architecture in NLP

  27. 28 Information extraction

  28. Today 29  Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE  Chunking  Named Entity Recognition  Evaluation

  29. IE basics 30 Information extraction (IE) is the task of automatically extracting structured information from unstructured and/or semi-structured machine-readable documents. (Wikipedia)  Bottom-Up approach  Start with unrestricted texts, and do the best you can  The approach was in particular developed by the Message Understanding Conferences (MUC) in the 1990s  Select a particular domain and task

  30. A typical pipeline 31 From NLTK

  31. Some example systems 32  Stanford core nlp: http://corenlp.run/  SpaCy (Python): https://spacy.io/docs/api/  OpenNLP (Java): https://opennlp.apache.org/docs/  GATE (Java): https://gate.ac.uk/  https://cloud.gate.ac.uk/shopfront  UDPipe: http://ufal.mff.cuni.cz/udpipe  Online demo: http://lindat.mff.cuni.cz/services/udpipe/  Collection of tools for NER:  https://www.clarin.eu/resource-families/tools-named-entity-recognition

  32. Today 33  Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE  Chunking  Named Entity Recognition  Evaluation

  33. Next steps 34  Chunk together words to phrases

  34. NP-chunks 35 [ The/DT market/NN ] for/IN [ system-management/NN software/NN ] for/IN [ Digital/NNP ] [ 's/POS hardware/NN ] is/VBZ fragmented/JJ enough/RB that/IN [ a/DT giant/NN ] such/JJ as/IN [ Computer/NNP Associates/NNPS ] should/MD do/VB well/RB there/RB ./.  Exactly what is an NP-chunk?  Flat structure: no NP-chunk is part of another NP chunk  It is an NP  Maximally large  But not all NPs are chunks  Opposing restrictions

  35. Chunking methods 36  Hand-written rules  Regular expressions  Supervised machine learning

  36. Regular Expression Chunker 37  Input POS-tagged sentences  Use a regular expression over POS to identify NP-chunks  NLTK example:  It inserts parentheses grammar = r""" NP: {<DT|PP\$>?<JJ>*<NN>} {<NNP>+} """

  37. IOB-tags 38  B-NP: First word in NP  Properties  One tag per token  I-NP: Part of NP , not first word  Unambiguous  O: Not part of NP (phrase)  Does not insert anything in the text itself

  38. Assigning IOB-tags 39  The process can be considered a form for tagging  POS-tagging: Word to POS-tag  IOB-tagging: POS-tag to IOB-tag  But one may in addition use additional features, e.g. words  Can use various types of classifiers  NLTK uses a MaxEnt Classifier (=LogReg, but the implementation is slow)  We can modify along the lines of mandatory assignment 2, using scikit-learn

  39. J&M, 3. ed. 40

  40. Today 41  Feedforward neural networks (partly recap)  Recurrent networks  Information extraction, IE  Chunking  Named Entity Recognition  Evaluation

Recommend


More recommend