an interpretable knowledge transfer model for knowledge
play

An Interpretable Knowledge Transfer Model for Knowledge Base - PowerPoint PPT Presentation

An Interpretable Knowledge Transfer Model for Knowledge Base Completion Qizhe Xie, Xuezhe Ma, Zihang Dai, Eduard Hovy Carnegie Mellon University Language Technologies Institute August 2, 2017 1 / 28 Outline Introduction Task Motivation


  1. An Interpretable Knowledge Transfer Model for Knowledge Base Completion Qizhe Xie, Xuezhe Ma, Zihang Dai, Eduard Hovy Carnegie Mellon University Language Technologies Institute August 2, 2017 1 / 28

  2. Outline Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness 2 / 28

  3. Outline Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness 3 / 28

  4. Task: Knowledge base completion (KBC) ◮ Recover missing facts in knowledge bases ◮ Given lots of triples such as ( Leonardo DiCaprio , won award , Oscar ) ◮ Predict missing facts ( Leonardo DiCaprio , Profession , ? ) ◮ Embedding-based approaches 4 / 28

  5. Data Sparsity Issue Frequency Log(Frequency) Frequency Log(Frequency) 40000 11 16000 10 30000 8.25 12000 7.5 Log(Frequency) Log(Frequency) Frequency Frequency 20000 5.5 8000 5 10000 2.75 4000 2.5 0 0 0 0 Relation Relation (a) WN18 (b) FB15k Figure 1: Frequencies of relations are subject to Zipf’s law. 5 / 28

  6. Problems Our Model Tackle ◮ Data-sparsity: Transfer learning ◮ On WN18, the rarer the relation is, the greater the improvements are ◮ Interpretability: ℓ 0 -regularized representation ◮ Reverse relations, undirected relations and similar relations are identified by the sparse representation ◮ Model size: Compression ◮ On FB15k, the number of parameters can be reduced to 1/90 of the original model 6 / 28

  7. Outline Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness 7 / 28

  8. Notation and Previous Models ◮ Data: Triples ( h , r , t ) ◮ Training data: ( h = Leonardo DiCaprio , r = won award , t = Oscar ) ◮ Test data: ( h = Leonardo DiCaprio , r = Profession , t = ? ) ◮ Energy function f r ( h , t ) of triples ( h , r , t ) ◮ Minimize the energy of true triples and maximize the energy of false triples ◮ TransE [Bordes et al., 2013]: f r ( h , t ) = � h + r − t � ℓ Parameters: entity embeddings h , t , relation embeddings r ◮ STransE [Nguyen et al., 2016]: f r ( h , t ) = � W r , 1 h + r − W r , 2 t � ℓ Parameters: relation-specific projection matrices W r , 1 , W r , 2 and embeddings ◮ All parameters are trained by SGD 8 / 28

  9. STransE: Parametrizing Each Relation Separately ◮ Prone to the data sparsity problem 9 / 28

  10. Sharing Parameters through Common Concepts ◮ Relation-concept mapping example with attention weights: ◮ Parametrize concepts instead of relations ◮ Relation matrices are weighted averages of concept matrices with attention weights W r 1 , 1 = 0 . 2 D 1 + 0 . 8 D 2 10 / 28

  11. Sharing Parameters through Common Concepts ◮ Suppose a ground-truth mapping is given, then ◮ Transfer learning can be done effectively through parameter sharing ◮ We can interpret similar relations ◮ All parameters are trainable by SGD ◮ Concepts need to be learned end-to-end ◮ How do we obtain the mapping? 11 / 28

  12. Dense Mapping ◮ Dense attention: Construct a dense bipartite graph and train attention weights ◮ Problems: ◮ Uninterpretable: In practice, even with ℓ 1 regularization, we get a distributed weights W r 1 , 1 = 0 . 2 D 1 + 0 . 52 D 2 + 0 . 1 D 3 + 0 . 15 D 4 + 0 . 03 D 5 ◮ Inefficient: Computation involves all concept matrices ◮ Unnecessary: Intuitively, each relation can be composed of at most K concepts 12 / 28

  13. Sparse Mapping ◮ Problem: Not differentiable ◮ An approximate approach: ◮ Given current embeddings, a correct mapping should minimize the loss function ◮ For each relation, assign a single concept to the relation and compute the loss ◮ Greedily choose the top K concepts that minimize the loss 13 / 28

  14. Block Iterative Optimization ◮ Randomly initialize mappings and concepts. ◮ Repeat ◮ Optimize embeddings and attention weights with SGD ◮ Reassign mappings 14 / 28

  15. A Better Sampling Approach: Domain sampling ◮ Loss function involves negative sampling ◮ Sample from domain-specific entities with an adaptive probability ◮ E.g., negative sample of ( Steve Jobs , was born in , US ) : ◮ Uniform negative sample: ( Steve Jobs , was born in , CMU ) ◮ Domain negative sample: ( Steve Jobs , was born in , China ) 15 / 28

  16. Outline Introduction Task Motivation Model Experiments Main Results Performance on Rare Relations Interpretability Analysis on Sparseness 16 / 28

  17. Main Results WN18 FB15k Model Additional Information Mean Rank Hits@10 Mean Rank Hits@10 SE [Bordes et al., 2011] No 985 80.5 162 39.8 Unstructured [Bordes et al., 2014] No 304 38.2 979 6.3 TransE [Bordes et al., 2013] No 251 89.2 125 47.1 TransH [Wang et al., 2014] No 303 86.7 87 64.4 TransR [Lin et al., 2015b] No 225 92.0 77 68.7 CTransR [Lin et al., 2015b] No 218 92.3 75 70.2 KG2E [He et al., 2015] No 348 93.2 59 74.0 TransD [Ji et al., 2015] No 212 92.2 91 77.3 TATEC [Garc´ ıa-Dur´ an et al., 2016] No - - 58 76.7 NTN [Socher et al., 2013] No - 66.1 - 41.4 DISTMULT [Yang et al., 2015] No - 94.2 - 57.7 STransE [Nguyen et al., 2016] No 206 (244) 93.4 (94.7) 69 79.7 ITransF No 205 94.2 65 81.0 ITransF (domain sampling) No 223 95.2 77 81.4 R TransE [Garc´ ıa-Dur´ an et al., 2015] Path - - 50 76.2 PTransE [Lin et al., 2015a] Path - - 58 84.6 NLFeat [Toutanova and Chen, 2015] Node + Link Features - 94.3 - 87.0 Random Walk [Wei et al., 2016] Path - 94.8 - 74.7 IRN [Shen et al., 2016] External Memory 249 95.3 38 92.7 Table 1: Link prediction results on two datasets. Hits@10 is the top-10 accuracy. Higher Hits@10 or lower Mean Rank indicates better performance. 17 / 28

  18. Performance on Rare Relations ITransF (ours) STransE 100 75 Hits@10 50 25 0 Relations: Frequent —> Rare Figure 2: Average Hits@10 on WN18 relations 18 / 28

  19. Performance on Rare Relations ITransF (ours) STransE ITransF (ours) STransE 100 100 75 75 Hits@10 Hits@10 50 50 25 25 0 0 Frequent Medium Rare Frequent Medium Rare Relation Bin Relation Bin (a) WN18 (b) FB15k Figure 3: Average Hits@10 on relations of different frequencies 19 / 28

  20. Interpretability: How Is Knowledge Shared? ◮ Each relation’s head and tail have their own concepts. (a) WN18 (b) FB15k Figure 4: Heatmap visualization of attention weights on WN18 and FB15k. 20 / 28

  21. Interpretability: How Is Knowledge Shared? ◮ Each relation’s head and tail have their own concepts. ◮ Interpretation: ◮ Reverse relations: hyponym and hypernym; award winning work and won award for. (a) WN18 (b) FB15k Figure 5: Heatmap visualization of attention weights on WN18 and FB15k. 21 / 28

  22. Interpretability: How Is Knowledge Shared? ◮ Each relation’s head and tail have their own concepts. ◮ Interpretation: ◮ Reverse relations: hyponym and hypernym; award winning work and won award for. ◮ Undirected relations: spouse; similar to. (a) WN18 (b) FB15k 22 / 28 Figure 6: Heatmap visualization of attention weights on WN18 and FB15k.

  23. Interpretability: How Is Knowledge Shared? ◮ Each relation’s head and tail have their own concepts. ◮ Interpretation: ◮ Reverse relations: hyponym and hypernym; award winning work and won award for. ◮ Undirected relations: spouse; similar to. ◮ Similar relations: was anominated for and won award for; instance hypernym and hypernym. (a) WN18 (b) FB15k 23 / 28

  24. Interpretability of ℓ 1 regularized dense mapping (a) WN18 (b) FB15k Figure 8: Heatmap visualization of ℓ 1 regularized dense mapping ◮ The mapping cannot be sparse without performance loss. 24 / 28

  25. A Byproduct of Parameter Sharing: Model Compression ITransF STransE CTransR ITransF STransE CTransR 83 95 79.75 93.75 Hits@10 Hits@10 76.5 92.5 73.25 91.25 70 90 15 30 75 300 600 1200 1345 2200 2690 18 22 26 30 36 45 # concepts # concepts (a) FB15k (b) WN18 Figure 9: Performance with different number of concepts ◮ On FB15k, the model can be compressed by nearly 90 times. 25 / 28

  26. Analysis on Sparseness ◮ Does sparseness hurt performance? WN18 FB15k Method MR H10 Time MR H10 Time Dense 199 94.0 4m34s 69 79.4 4m30s Dense + ℓ 1 228 94.2 4m25s 131 78.9 5m47s Sparse 207 94.1 2m32s 67 79.6 1m52s Table 2: Performance of model with dense graph or sparse graph with only 15 or 22 concepts. The time gap is more significant when we use more concepts. ◮ How does our approach compare to sparse encoding methods? WN18 FB15k Method MR H10 MR H10 Pretrain + Sparse Encoding [Faruqui et al., 2015] 211 86.6 66 79.1 Ours 205 94.2 65 81.0 Table 3: Different methods to obtain sparse representations 26 / 28

  27. Conclusion ◮ Propose a knowledge embedding model which can discover shared hidden concepts ◮ Perform transfer learning through parameter sharing ◮ Design a learning algorithm to induce the interpretable sparse representation ◮ Outperform baselines on two benchmark datasets for the knowledge base completion task 27 / 28

Recommend


More recommend