Multi-Level Structured Self-Attentions for Distantly Supervised Relation Extraction Jinhua Du † , Jingguang Han § , Andy Way † , Dadong Wan § † ADAPT Centre, School of Computing, Dublin City University, Ireland § Accenture Labs Dublin, Ireland { jinhua.du, andy.way } @adaptcentre.ie { jingguang.han, dadong.wan } @accenture.com Abstract some sentences can be observed in a KB with a certain relationship, then these sentences will be Attention mechanisms are often used in deep labeled as the context of this entity pair and this neural networks for distantly supervised rela- relationship. The distant supervision strategy is an arXiv:1809.00699v1 [cs.CL] 3 Sep 2018 tion extraction (DS-RE) to distinguish valid effective and efficient method for automatically la- from noisy instances. However, traditional 1- beling large-scale training data. However, it also D vector attention models are insufficient for introduces a severe mislabelling problem due to the learning of different contexts in the se- the fact that a sentence that mentions two enti- lection of valid instances to predict the re- lationship for an entity pair. To alleviate ties does not necessarily express their relation in this issue, we propose a novel multi-level a KB (Surdeanu et al., 2012; Zeng et al., 2015). structured (2- D matrix) self-attention mecha- Plenty of research work has been proposed nism for DS-RE in a multi-instance learning to deal with distantly supervised data and has (MIL) framework using bidirectional recurrent achieved significant progress, especially with neural networks. In the proposed method, the rapid development of deep neural net- a structured word-level self-attention mecha- nism learns a 2- D matrix where each row vec- works (DNN) for relation extraction in recent tor represents a weight distribution for differ- years (Zeng et al., 2014, 2015; Lin et al., 2016, ent aspects of an instance regarding two enti- 2017a; Wang et al., 2016; Zhou et al., 2016; Ji ties. Targeting the MIL issue, the structured et al., 2017; Yang et al., 2017; Zeng et al., 2017). sentence-level attention learns a 2- D matrix DNN models under an MIL framework for DS- where each row vector represents a weight RE have become state-of-the-art, replacing statis- distribution on selection of different valid in- tical methods, such as feature-based and graphi- stances. Experiments conducted on two pub- cal models (Riedel et al., 2010; Hoffmann et al., licly available DS-RE datasets show that the proposed framework with a multi-level struc- 2011; Surdeanu et al., 2012). In the MIL frame- tured self-attention mechanism significantly work for distantly supervised RE, each entity pair outperform state-of-the-art baselines in terms often has multiple instances where some are noisy of PR curves, P@N and F1 measures. and some are valid. The attention mechanism in DNNs, such as convolutional (CNN) and recurrent 1 Introduction neural networks (RNN), is an effective way to se- Relation extraction is a fundamental task in infor- lect valid instances by learning a weight distribu- mation extraction (IE), which studies the issue of tion over multiple instances. However, there are predicting semantic relations between pairs of en- two important representation learning problems in tities in a sentence (Zelenko et al., 2003; Bunescu DNN-based distantly supervised RE: (1) Problem and Mooney, 2005; Zhou et al., 2005). One crucial I : entity pair-targeted context representation learn- problem in RE is the relative lack of large-scale, ing from an instance; and (2) Problem II : valid in- high-quality labeled data. In recent years, one stance selection representation learning over mul- commonly used and effective technique for deal- tiple instances. The former can use a word-level ing with this challenge is the distant supervision attention mechanism to learn a weight distribu- method via knowledge bases (KBs) (Mintz et al., tion on words and then a weighted sentence rep- 2009; Riedel et al., 2010; Hoffmann et al., 2011), resentation regarding two entities; the latter can which assumes that if one entity pair appearing in employ a sentence-level attention mechanism to
learn a weight distribution on multiple instances so (1) we propose a novel multi-level structured (2- that valid sentences with higher weights can be fo- D ) self-attention mechanism for DS-RE which cused and selected, and noisy instances with lower can make full use of input sequences to learn weights are suppressed. different contexts, without integrating extra re- sources; (2) we propose a 2- D matrix-based word- Both the word-level and sentence-level atten- level attention for better context representation tion mechanisms in previous work on the RE task learning targeting two entities; (3) we propose a 2- are simple 1- D vectors which are learned using D sentence-level attention mechanism over mul- the hidden states of the RNN, or via pooling from tiple instances to select different valid instances; either the RNNs’ hidden states or convolved n - and (4) we verify the proposed framework on two grams (Zeng et al., 2014, 2015; Zhou et al., 2016; publicly available distantly supervised datasets. Wang et al., 2016; Ji et al., 2017; Yang et al., 2017). The deficiency of the 1- D attention vec- 2 Related Work tor is that it only focuses on one or a small number of aspects of the sentence, or one or a small num- Most existing work on distant supervision data ber of instances (Lin et al., 2017b), with the result mainly focuses on denoising the data under the that different semantic aspects of the sentence, or MIL strategy by learning a valid sentence rep- different multiple valid sentences are ignored, and resentation or features, and then selecting one cannot be utilised. or more valid instances for relation classifica- Inspired by the structured self-attentive sen- tion (Riedel et al., 2010; Hoffmann et al., 2011; tence embedding in Lin et al. (2017b), we propose Surdeanu et al., 2012; Zeng et al., 2015; Lin et al., a novel multi-level structured (2- D ) self-attention 2016, 2017a; Zhou et al., 2016; Ji et al., 2017; mechanism (MLSSA) in a bidirectional LSTM- Zeng et al., 2017; Yang et al., 2017). based (BiLSTM) (Hochreiter and Schmidhuber, Riedel et al. (2010) and Surdeanu et al. (2012) 1997) MIL framework to alleviate two problems use a graphical model and MIL to select the valid in the distantly supervised RE. Regarding Prob- sentences and classify the relations. However, lem I , we propose a 2- D matrix-based word- these models are based on statistical methods and level attention mechanism, which contains mul- feature engineering, i.e. extracting sentence fea- tiple vectors, each focusing on different aspects tures using other NLP tools. Zeng et al. (2015) of the sentence for better context representation proposed a piece-wise CNN (PCNN) method to learning. In terms of Problem II , we propose a automatically learn sentence-level features and se- 2- D sentence-level attention mechanism for mul- lect one valid instance for the relation classifi- tiple instance learning, where it contains multi- cation. The one-sentence-selection strategy does ple vectors, each focusing on different valid in- not make full use of the supervision information stances for a better sentence selection. “ struc- among multiple instances. tured ” indicates that the weight vectors in the Lin et al. (2016) and Ji et al. (2017) introduce learned 2- D matrix try to construct a structural de- an attention mechanism to the PCNN-based MIL pendency relationship by learning different weight framework to select informative sentences, which distributions for different contexts or instances outperforms all baseline systems on the NYT data given the entity pair. We can see that our struc- set. However, their attention mechanism is only a tured attention mechanism is different from that sentence-level model without incorporating word- in Kim et al. (2017) which incorporates richer level attention. Zhou et al. (2016) introduce a structural distributions and are simple extensions word-level attention model to the BiLSTM-based of the basic attention procedure. We verify the MIL framework and obtain significant improve- proposed framework on two distantly supervised ments on the SemEval2010 (Hendrickx et al., RE datasets, namely the New York Times (NYT) 2010) data set. Wang et al. (2016) extend the sin- dataset (Riedel et al., 2010) and the DBpedia Por- gle word-level attention model to multiple word tuguese dataset (Batista et al., 2013). Experi- levels in CNNs to discern patterns in heteroge- mental results show that our MLSSA framework neous contexts of the input sentence, and achieve significantly outperforms state-of-the-art baseline best performance on the SemEval2010 data set. systems in terms of different evaluation metrics. However, these two works were not targeting the The main contributions of this paper include: distantly supervised RE problem.
Recommend
More recommend