An Interpretable Knowledge Transfer Model for Knowledge Base Completion Qizhe Xie, Xuezhe Ma, Zihang Dai, Eduard Hovy Carnegie Mellon University Language Technologies Institute August 2, 2017 1 / 28
Outline Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness 2 / 28
Outline Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness 3 / 28
Task: Knowledge base completion (KBC) ◮ Recover missing facts in knowledge bases ◮ Given lots of triples such as ( Leonardo DiCaprio , won award , Oscar ) ◮ Predict missing facts ( Leonardo DiCaprio , Profession , ? ) ◮ Embedding-based approaches 4 / 28
Data Sparsity Issue Frequency Log(Frequency) Frequency Log(Frequency) 40000 11 16000 10 30000 8.25 12000 7.5 Log(Frequency) Log(Frequency) Frequency Frequency 20000 5.5 8000 5 10000 2.75 4000 2.5 0 0 0 0 Relation Relation (a) WN18 (b) FB15k Figure 1: Frequencies of relations are subject to Zipf’s law. 5 / 28
Problems Our Model Tackle ◮ Data-sparsity: Transfer learning ◮ On WN18, the rarer the relation is, the greater the improvements are ◮ Interpretability: ℓ 0 -regularized representation ◮ Reverse relations, undirected relations and similar relations are identified by the sparse representation ◮ Model size: Compression ◮ On FB15k, the number of parameters can be reduced to 1/90 of the original model 6 / 28
Outline Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness 7 / 28
Notation and Previous Models ◮ Data: Triples ( h , r , t ) ◮ Training data: ( h = Leonardo DiCaprio , r = won award , t = Oscar ) ◮ Test data: ( h = Leonardo DiCaprio , r = Profession , t = ? ) ◮ Energy function f r ( h , t ) of triples ( h , r , t ) ◮ Minimize the energy of true triples and maximize the energy of false triples ◮ TransE [Bordes et al., 2013]: f r ( h , t ) = � h + r − t � ℓ Parameters: entity embeddings h , t , relation embeddings r ◮ STransE [Nguyen et al., 2016]: f r ( h , t ) = � W r , 1 h + r − W r , 2 t � ℓ Parameters: relation-specific projection matrices W r , 1 , W r , 2 and embeddings ◮ All parameters are trained by SGD 8 / 28
STransE: Parametrizing Each Relation Separately ◮ Prone to the data sparsity problem 9 / 28
Sharing Parameters through Common Concepts ◮ Relation-concept mapping example with attention weights: ◮ Parametrize concepts instead of relations ◮ Relation matrices are weighted averages of concept matrices with attention weights W r 1 , 1 = 0 . 2 D 1 + 0 . 8 D 2 10 / 28
Sharing Parameters through Common Concepts ◮ Suppose a ground-truth mapping is given, then ◮ Transfer learning can be done effectively through parameter sharing ◮ We can interpret similar relations ◮ All parameters are trainable by SGD ◮ Concepts need to be learned end-to-end ◮ How do we obtain the mapping? 11 / 28
Dense Mapping ◮ Dense attention: Construct a dense bipartite graph and train attention weights ◮ Problems: ◮ Uninterpretable: In practice, even with ℓ 1 regularization, we get a distributed weights W r 1 , 1 = 0 . 2 D 1 + 0 . 52 D 2 + 0 . 1 D 3 + 0 . 15 D 4 + 0 . 03 D 5 ◮ Inefficient: Computation involves all concept matrices ◮ Unnecessary: Intuitively, each relation can be composed of at most K concepts 12 / 28
Sparse Mapping ◮ Problem: Not differentiable ◮ An approximate approach: ◮ Given current embeddings, a correct mapping should minimize the loss function ◮ For each relation, assign a single concept to the relation and compute the loss ◮ Greedily choose the top K concepts that minimize the loss 13 / 28
Block Iterative Optimization ◮ Randomly initialize mappings and concepts. ◮ Repeat ◮ Optimize embeddings and attention weights with SGD ◮ Reassign mappings 14 / 28
A Better Sampling Approach: Domain sampling ◮ Loss function involves negative sampling ◮ Sample from domain-specific entities with an adaptive probability ◮ E.g., negative sample of ( Steve Jobs , was born in , US ) : ◮ Uniform negative sample: ( Steve Jobs , was born in , CMU ) ◮ Domain negative sample: ( Steve Jobs , was born in , China ) 15 / 28
Outline Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness 16 / 28
Main Results WN18 FB15k Model Additional Information Mean Rank Hits@10 Mean Rank Hits@10 SE [Bordes et al., 2011] No 985 80.5 162 39.8 Unstructured [Bordes et al., 2014] No 304 38.2 979 6.3 TransE [Bordes et al., 2013] No 251 89.2 125 47.1 TransH [Wang et al., 2014] No 303 86.7 87 64.4 TransR [Lin et al., 2015b] No 225 92.0 77 68.7 CTransR [Lin et al., 2015b] No 218 92.3 75 70.2 KG2E [He et al., 2015] No 348 93.2 59 74.0 TransD [Ji et al., 2015] No 212 92.2 91 77.3 TATEC [Garc´ ıa-Dur´ an et al., 2016] No - - 58 76.7 NTN [Socher et al., 2013] No - 66.1 - 41.4 DISTMULT [Yang et al., 2015] No - 94.2 - 57.7 STransE [Nguyen et al., 2016] No 206 (244) 93.4 (94.7) 69 79.7 ITransF No 205 94.2 65 81.0 ITransF (domain sampling) No 223 95.2 77 81.4 R TransE [Garc´ ıa-Dur´ an et al., 2015] Path - - 50 76.2 PTransE [Lin et al., 2015a] Path - - 58 84.6 NLFeat [Toutanova and Chen, 2015] Node + Link Features - 94.3 - 87.0 Random Walk [Wei et al., 2016] Path - 94.8 - 74.7 IRN [Shen et al., 2016] External Memory 249 95.3 38 92.7 Table 1: Link prediction results on two datasets. Hits@10 is the top-10 accuracy. Higher Hits@10 or lower Mean Rank indicates better performance. 17 / 28
Performance on Rare Relations ITransF (ours) STransE 100 75 Hits@10 50 25 0 Relations: Frequent —> Rare Figure 2: Average Hits@10 on WN18 relations 18 / 28
Performance on Rare Relations ITransF (ours) STransE ITransF (ours) STransE 100 100 75 75 Hits@10 Hits@10 50 50 25 25 0 0 Frequent Medium Rare Frequent Medium Rare Relation Bin Relation Bin (a) WN18 (b) FB15k Figure 3: Average Hits@10 on relations of different frequencies 19 / 28
Interpretability: How Is Knowledge Shared? ◮ Each relation’s head and tail have their own concepts. (a) WN18 (b) FB15k Figure 4: Heatmap visualization of attention weights on WN18 and FB15k. 20 / 28
Interpretability: How Is Knowledge Shared? ◮ Each relation’s head and tail have their own concepts. ◮ Interpretation: ◮ Reverse relations: hyponym and hypernym; award winning work and won award for. (a) WN18 (b) FB15k Figure 5: Heatmap visualization of attention weights on WN18 and FB15k. 21 / 28
Interpretability: How Is Knowledge Shared? ◮ Each relation’s head and tail have their own concepts. ◮ Interpretation: ◮ Reverse relations: hyponym and hypernym; award winning work and won award for. ◮ Undirected relations: spouse; similar to. (a) WN18 (b) FB15k 22 / 28 Figure 6: Heatmap visualization of attention weights on WN18 and FB15k.
Interpretability: How Is Knowledge Shared? ◮ Each relation’s head and tail have their own concepts. ◮ Interpretation: ◮ Reverse relations: hyponym and hypernym; award winning work and won award for. ◮ Undirected relations: spouse; similar to. ◮ Similar relations: was anominated for and won award for; instance hypernym and hypernym. (a) WN18 (b) FB15k 23 / 28
Interpretability of ℓ 1 regularized dense mapping (a) WN18 (b) FB15k Figure 8: Heatmap visualization of ℓ 1 regularized dense mapping ◮ The mapping cannot be sparse without performance loss. 24 / 28
A Byproduct of Parameter Sharing: Model Compression ITransF STransE CTransR ITransF STransE CTransR 83 95 79.75 93.75 Hits@10 Hits@10 76.5 92.5 73.25 91.25 70 90 15 30 75 300 600 1200 1345 2200 2690 18 22 26 30 36 45 # concepts # concepts (a) FB15k (b) WN18 Figure 9: Performance with different number of concepts ◮ On FB15k, the model can be compressed by nearly 90 times. 25 / 28
Analysis on Sparseness ◮ Does sparseness hurt performance? WN18 FB15k Method MR H10 Time MR H10 Time Dense 199 94.0 4m34s 69 79.4 4m30s Dense + ℓ 1 228 94.2 4m25s 131 78.9 5m47s Sparse 207 94.1 2m32s 67 79.6 1m52s Table 2: Performance of model with dense graph or sparse graph with only 15 or 22 concepts. The time gap is more significant when we use more concepts. ◮ How does our approach compare to sparse encoding methods? WN18 FB15k Method MR H10 MR H10 Pretrain + Sparse Encoding [Faruqui et al., 2015] 211 86.6 66 79.1 Ours 205 94.2 65 81.0 Table 3: Different methods to obtain sparse representations 26 / 28
Conclusion ◮ Propose a knowledge embedding model which can discover shared hidden concepts ◮ Perform transfer learning through parameter sharing ◮ Design a learning algorithm to induce the interpretable sparse representation ◮ Outperform baselines on two benchmark datasets for the knowledge base completion task 27 / 28
Recommend
More recommend