Natural Language Processing (almost) from Scratch Ronan Collobert, Jason Weston, Leon Bottou, Michael Karlen, Koray Kavukcuoglu, Pavel Kuksa (2011) Presented by Tara Vijaykumar tgv2@illinois.edu
Content 1. Sequence Labeling 2. The benchmark tasks a. Part-of-speech Tagging b. Chunking c. Named Entity Recognition d. Semantic Role Labeling 3. The networks a. Transforming Words into Feature Vectors b. Extracting Higher Level Features from Word Feature Vectors c. Training d. Results
Sequence Labeling assignment of a categorical label to each member of a sequence of observed values ● Eg: part of speech tagging ● Mary had a little lamb (noun) (verb) (det) (adj) (noun) can be treated as a set of independent classification tasks ● choose the globally best set of labels for the entire sequence at once ○ algorithms are probabilistic in nature ● Markov assumption ○ Hidden Markov model (HMM) ○
POS Tagging Label word with syntactic tag (verb, noun, adverb…) ● best POS classifiers ● trained on windows of text, which are then fed ○ to bidirectional decoding algorithm during inference Features - previous and next tag context, ○ multiple words (bigrams, trigrams. . . ) context ● Shen et al. (2007) “Guided learning” - bidirectional sequence ○ classification using perceptrons
Chunking labeling segments of a sentence with syntactic ● constituents (NP or VP) each word assigned only one unique tag, encoded ● as begin-chunk (B-NP) or inside-chunk tag (I-NP) evaluated using CoNLL shared task ● Sha and Pereira, 2003 ● systems based on second-order random ○ fields Conditional Random Fields ○
Named Entity Recognition labels atomic elements in the sentence into categories ● (“PERSON”, “LOCATION”) Ando and Zhang (2005) ● semi-supervised approach ○ Viterbi decoding at test time ○ Features: words, POS tags, suffixes and prefixes or ○ CHUNK tags
Semantic Role Labeling give a semantic role to a syntactic constituent of a sentence ● State-of-the-art SRL systems consist of stages ● producing a parse tree ○ identifying which parse tree nodes represent the arguments of a ○ given verb, classifying nodes to compute the corresponding SRL tags ○ Koomen et al. (2005) ● takes the output of multiple classifiers and combines them into a ○ coherent predicate-argument output optimization stage takes into account recommendation of the ○ classifiers and problem specific constraints
Introduction Existing systems ● Find intermediate representations with task-specific features ○ Derived from output of existing systems (runtime dependencies) ■ Advantage: effective due to extensive use of linguistic knowledge ○ How to progress toward broader goals of NL understanding? ○ Collobert et al. 2011 ● Single learning system to discover internal representations ○ Avoid large body of linguistic knowledge - instead, transfer intermediate representations discovered ○ on large unlabeled data sets “Almost from scratch” - reduced reliance on prior NLP knowledge ○
Remarks comparing systems ● do not learn anything of the quality of each system if they were trained with ○ different labeled data refer to benchmark systems - top existing systems which avoid usage of external ○ data and have been well-established in the NLP field for more complex tasks (with corresponding lower accuracies), best systems have more ● engineered features POS task is one of the simplest of our four tasks, and only has relatively few ○ engineered features SRL is the most complex, and many kinds of features have been designed for it ○
Networks Traditional NLP approach ● extract rich set of hand-designed features (based on linguistic intuition, trial and error) ○ task dependent ■ Complex tasks (SRL) then require a large number of possibly complex features (eg: extracted from a ○ parse tree) can impact the computational cost ■ Proposed approach ● pre-process features as little as possible - make it generalizable ○ use a multilayer neural network (NN) architecture trained in an end-to-end fashion. ○
Transforming Words into Feature Vectors For efficiency, words are fed to our architecture as indices taken from a finite dictionary D. ● The first layer of our network maps each of these word indices into a feature vector, by a lookup table ● operation. Initialize the word lookup table with these representations (instead of randomly) For each word w ∈ D , an internal d wrd -dimensional feature vector representation is given by the lookup ● table layer LTW (·): LT W (w)= ⟨ W ⟩ 1 ○ w where W is a matrix of parameters to be learned, ⟨ W ⟩ is the w th column of W and d wrd is the word vector size (a hyper-parameter) Given a sentence or any sequence of T words, the output matrix produced - ●
Extracting Higher Level Features from Word Feature Vectors Window approach: assumes the tag of a word depends mainly on its neighboring words ● Word feature window given by the first network layer: ● Linear Layer: ● HardTanh Layer: ● Scoring: size of number of tags with corresponding score ● Feature window is not well defined for words near the beginning or the end of a sentence - augment the sentence ● with a special “PADDING” akin to the use of “start” and “stop” symbols in sequence models.
Extracting Higher Level Features from Word Feature Vectors Sentence approach: window approach fails with SRL, where the tag ● of a word depends on a verb chosen beforehand in the sentence Convolutional Layer: generalization of a window approach - for all ● windows t, output column of l th layer Max Layer: ● average operation does not make much sense - most words in ○ the sentence do not have any influence on the semantic role of a given word to tag. max approach forces the network to capture the most useful ○ local features
Extracting Higher Level Features from Word Feature Vectors Tagging schemes: ● window approach ○ tags apply to the word located in the center of the window ■ sentence approach ○ tags apply to the word designated by additional markers in the network input ■ most expressive IOBES tagging scheme ●
Training For θ trainable parameters and a training set T: maximize the following log-likelihood with ● respect to θ: Stochastic gradient: maximization is achieved by iteratively selecting a random example (x, ● y) and making a gradient step: Word-level log likelihood : each word in sentence is considered independently ● Get conditional tag probability with use of sofumax
Training Introduce scores: ● Transition score [A] ij : from i to j tags in successive words ○ Initial score [A] i0 : starting from the ith tag ○ Sentence-level log likelihood : enforces dependencies between the predicted tags in a sentence. ● Score of sentence along a path of tags, using initial and transition scores ○ Maximize this score ○ Viterbi algorithm for inference ■
Results Remarks: ● Architecture : choice of hyperparameters such as the number of hidden units has a limited impact on ○ the generalization performance Prefer semantically similar words to be close in the embedding space represented by the word ○ lookup table but that it is not the case
NLP (Almost) From Scratch Pt. 2 Harrison Ding
Word Embeddings - Goal - Obtain Word Embeddings that can capture syntactic and semantic differences
Datasets - English Wikipedia (631 million words) - Constructed a dictionary of 100k most common words in WSJ - Replace the non-dictionary words with “RARE” tokens - Reuters RCV1 Dataset (221 million words) - Extended dictionary to a size of 130k words where 30k were Reuters most common words
Ranking Criterion - Cohen et al. 1998 - Binary Preference Function - Ranking ordering - Training is done with a windowed approach X = Set of all possible text windows D = All words in the dictionary x (w) = Text window with the center word replaced by the chosen word f(x) = Score of the text window
Result of Embeddings for LM1 - Goal of capturing semantic and syntactic differences appears to have been achieved
Tricks with Training - Length of time calculated in weeks - Problem - Difficult to try a large number of hyperparameter combinations - Efficient Solution - Train networks based on earlier networks - Construct embeddings based on small dictionaries and use the best from there - “Breeding”
Language Models Information - Language Model LM1 - Window size d win = 11 1 = 100 units - Hidden layer n hu - English Wikipedia - Dictionary sizes: 5k, 10k, 30k, 50k, 100k - Training time: 4 weeks
Language Models Information - Language Model LM1 - Language Model LM2 - Window size d win = 11 - Same dimensions as LM1 1 = 100 units - Hidden layer n hu - Initialized embeddings LM1 - English Wikipedia - English Wikipedia + Reuters - Dictionary sizes: 5k, 10k, 30k, 50k, 100k - Dictionary size: 130k - Training time: 4 weeks - Training time: 3 more weeks
Comparison of Generalization Performance
Comparison of Generalization Performance
Comparison of Generalization Performance
Multi-Task Learning - Joint training = Training a neural network for two tasks - Easy to do when similar patterns appear in training tasks with different labels
Recommend
More recommend