arxiv 1508 01991v1 cs cl 9 aug 2015
play

arXiv:1508.01991v1 [cs.CL] 9 Aug 2015 els include LSTM networks, - PDF document

Bidirectional LSTM-CRF Models for Sequence Tagging Zhiheng Huang Wei Xu Kai Yu Baidu research Baidu research Baidu research huangzhiheng@baidu.com xuwei06@baidu.com yukai@baidu.com Abstract (Lafferty et al., 2001). Convolutional network


  1. Bidirectional LSTM-CRF Models for Sequence Tagging Zhiheng Huang Wei Xu Kai Yu Baidu research Baidu research Baidu research huangzhiheng@baidu.com xuwei06@baidu.com yukai@baidu.com Abstract (Lafferty et al., 2001). Convolutional network based models (Collobert et al., 2011) have been re- In this paper, we propose a variety of Long cently proposed to tackle sequence tagging prob- Short-Term Memory (LSTM) based mod- lem. We denote such a model as Conv-CRF as els for sequence tagging. These mod- it consists of a convolutional network and a CRF arXiv:1508.01991v1 [cs.CL] 9 Aug 2015 els include LSTM networks, bidirectional layer on the output (the term of sentence level log- LSTM (BI-LSTM) networks, LSTM with likelihood (SSL) was used in the original paper). a Conditional Random Field (CRF) layer The Conv-CRF model has generated promising re- (LSTM-CRF) and bidirectional LSTM sults on sequence tagging tasks. In speech lan- with a CRF layer (BI-LSTM-CRF). Our guage understanding community, recurrent neural work is the first to apply a bidirectional network (Mesnil et al., 2013; Yao et al., 2014) and LSTM CRF (denoted as BI-LSTM-CRF) convolutional nets (Xu and Sarikaya, 2013) based model to NLP benchmark sequence tag- models have been recently proposed. Other rele- ging data sets. We show that the BI- vant work includes (Graves et al., 2005; Graves et LSTM-CRF model can efficiently use both al., 2013) which proposed a bidirectional recurrent past and future input features thanks to neural network for speech recognition. a bidirectional LSTM component. It can In this paper, we propose a variety of neural also use sentence level tag information network based models to sequence tagging task. thanks to a CRF layer. The BI-LSTM- These models include LSTM networks, bidirec- CRF model can produce state of the art (or tional LSTM networks (BI-LSTM), LSTM net- close to) accuracy on POS, chunking and works with a CRF layer (LSTM-CRF), and bidi- NER data sets. In addition, it is robust and rectional LSTM networks with a CRF layer (BI- has less dependence on word embedding LSTM-CRF). Our contributions can be summa- as compared to previous observations. rized as follows. 1) We systematically com- pare the performance of aforementioned models 1 Introduction on NLP tagging data sets; 2) Our work is the first to apply a bidirectional LSTM CRF (denoted Sequence tagging including part of speech tag- as BI-LSTM-CRF) model to NLP benchmark se- ging (POS), chunking, and named entity recogni- quence tagging data sets. This model can use both tion (NER) has been a classic NLP task. It has past and future input features thanks to a bidirec- drawn research attention for a few decades. The tional LSTM component. In addition, this model output of taggers can be used for down streaming can use sentence level tag information thanks to applications. For example, a named entity recog- a CRF layer. Our model can produce state of nizer trained on user search queries can be utilized the art (or close to) accuracy on POS, chunking to identify which spans of text are products, thus and NER data sets; 3) We show that BI-LSTM- triggering certain products ads. Another example CRF model is robust and it has less dependence is that such tag information can be used by a search on word embedding as compared to previous ob- engine to find relevant webpages. servations (Collobert et al., 2011). It can produce Most existing sequence tagging models are accurate tagging performance without resorting to linear statistical models which include Hid- word embedding. den Markov Models (HMM), Maximum entropy Markov models (MEMMs) (McCallum et al., The remainder of the paper is organized as fol- 2000), and Conditional Random Fields (CRF) lows. Section 2 describes sequence tagging mod-

  2. els used in this paper. Section 3 shows the training are sigmoid and softmax activation functions as procedure. Section 4 reports the experiments re- follows. sults. Section 5 discusses related research. Finally 1 Section 6 draws conclusions. f ( z ) = 1 + e − z , (3) e z m 2 Models g ( z m ) = k e z k . (4) � In this section, we describe the models used in this paper: LSTM, BI-LSTM, CRF, LSTM-CRF and B−ORG B−MISC O BI-LSTM-CRF. O y 2.1 LSTM Networks Recurrent neural networks (RNN) have been em- h ployed to produce promising results on a variety of tasks including language model (Mikolov et al., x 2010; Mikolov et al., 2011) and speech recogni- tion (Graves et al., 2005). A RNN maintains a EU rejects German call memory based on history information, which en- ables the model to predict the current output con- Figure 1: A simple RNN model. ditioned on long distance features. Figure 1 shows the RNN structure (Elman, In this paper, we apply Long Short-Term Mem- 1990) which has an input layer x , hidden layer ory (Hochreiter and Schmidhuber, 1997; Graves et al., 2005) to sequence tagging. Long Short- h and output layer y . In named entity tag- ging context, x represents input features and y Term Memory networks are the same as RNNs, represents tags. Figure 1 illustrates a named except that the hidden layer updates are replaced by purpose-built memory cells. As a result, they entity recognition system in which each word is tagged with other (O) or one of four entity may be better at finding and exploiting long range types: Person (PER) , Location (LOC) , Organi- dependencies in the data. Fig. 2 illustrates a sin- gle LSTM memory cell (Graves et al., 2005). The zation (ORG) , and Miscellaneous (MISC) . The sentence of EU rejects German call to x t x t is tagged boycott British lamb . as B-ORG O B-MISC O O O B-MISC O O , o t input gate i t output gate where B- , I- tags indicate beginning and interme- diate positions of entities. cell An input layer represents features at time t . x t C t h t They could be one-hot-encoding for word feature, dense vector features, or sparse features. An input layer has the same dimensionality as feature size. An output layer represents a probability distribu- f t forget gate tion over labels at time t . It has the same dimen- sionality as size of labels. Compared to feedfor- x t ward network, a RNN introduces the connection between the previous hidden state and current hid- Figure 2: A Long Short-Term Memory Cell. den state (and thus the recurrent layer weight pa- rameters). This recurrent layer is designed to store LSTM memory cell is implemented as the follow- history information. The values in the hidden and ing: output layers are computed as follows: = σ ( W xi x t + W hi h t − 1 + W ci c t − 1 + b i ) i t h ( t ) = f ( Ux ( t ) + Wh ( t − 1)) , (1) = σ ( W xf x t + W hf h t − 1 + W cf c t − 1 + b f ) f t y ( t ) = g ( Vh ( t )) , (2) c t = f t c t − 1 + i t tanh ( W xc x t + W hc h t − 1 + b c ) o t = σ ( W xo x t + W ho h t − 1 + W co c t + b o ) where U , W , and V are the connection weights to be computed in training time, and f ( z ) and g ( z ) = o t tanh ( c t ) h t

Recommend


More recommend