interpretable and compositional relation learning by
play

Interpretable and Compositional Relation Learning by Joint Training - PowerPoint PPT Presentation

Interpretable and Compositional Relation Learning by Joint Training with an Autoencoder Ryo Takahashi* 1 Ran Tian* 1 Kentaro Inui 1,2 (* equal contribution) 1 Tohoku University 2 RIKEN, Japan Task: Knowledge Base Completion Knowledge Bases


  1. Interpretable and Compositional Relation Learning by Joint Training with an Autoencoder Ryo Takahashi* 1 Ran Tian* 1 Kentaro Inui 1,2 (* equal contribution) 1 Tohoku University 2 RIKEN, Japan

  2. Task: Knowledge Base Completion • Knowledge Bases (KBs) store a large amount of facts in the form of <head entity, relation, tail entity> triples: tail entity relation head entity Australia The Matrix country_of_film United States • The Knowledge Base Completion (KBC) task aims to predict missing parts of an incomplete triple: Finding Nemo country_of_film ? • Help discover missing facts in a KB July 18, 2018 2

  3. マスター タイトルの書式設定 Vector Based Approach A common approach to KBC is to model triples with a low dimension vector space, where Entity : represented by a Relation : represented as low dimension vector (so transformation of the vector that similar entities are space, which can be: close to each other) Vector Translation • Linear map • Australia Non-linear map • US Up to design choice Finding Nemo The Matrix July 18, 2018 3

  4. マスター タイトルの書式設定 2 Popular Types of Representations for Relation TransE [Bordes+’13] Bilinear [Nickel+’11] Relation as vector translation Relation as linear • • transformation (matrix) 𝒗 ℎ 𝒔 𝒘 𝑢 ⊤ 𝒗 ℎ 𝑵 𝑠 𝒘 𝑢 + ≈ 𝑒 𝑒 𝑒 ・ ・ 𝑒 2 𝑒 𝑒 Intuitively suitable for 1-to-1 • relation Flexibly modeling N-to-N • relation currency AUD Australia of_country country_of_film USD US Australia The Matrix US Finding Nemo same number of entities same distances within We follow July 18, 2018 4

  5. Matrices are Difficult to Train • More parameters compared to entity vector entity relation vector matrix High dimension 𝑒 vs. 𝑒 2 Low dimension • Objective is highly non-convex ⊤ 𝒗 ℎ 𝑵 𝑠 𝒘 𝑢 ・ ・ 𝑒 2 𝑒 𝑒 July 18, 2018 5

  6. In this work: ① Propose jointly training relation matrices with an autoencoder: • In order to reduce the high dimensionality ② Modified SGD with separated learning rates: • In order to handle the highly non-convex training objective ③ Use modified SGD to enhance joint training with autoencoder ④ Other techniques for training relation matrices Achieve SOTA on standard KBC datasets July 18, 2018 6

  7. TRAINING TECHNIQUES July 18, 2018 7

  8. ① Joint Training with an Autoencoder マスター タイトルの書式設定 Base Model Proposed Represent relations as matrices in Train an autoencoder to a bilinear model , can be reconstruct relation matrix from extended with compositional low dimension coding training [Nickel+’11, Guu+’15, Tian+’16] reconstructed original ⊤ ′ 𝒗 ℎ 𝑵 𝑠 1 𝑵 𝑠 2 𝒘 𝑢 𝒘 𝑢 𝑵 𝑠 𝑵 𝑠 ・ ・ ・ 𝑒 2 𝑒 2 𝑒 2 𝑒 2 𝑒 2 𝑒 2 𝑒 𝑒 𝑑 𝑒 Train jointly Different from usual autoencoders in which the Finding original input is not updated 1. Reduce the high dimensionality of relation matrices 2. Help learn composition of relations July 18, 2018 8

  9. ① Joint Training with an Autoencoder マスター タイトルの書式設定 Base Model Proposed Represent relations as matrices in Train an autoencoder to a bilinear model , can be reconstruct relation matrix from extended with compositional low dimension coding training [Nickel+’11, Guu+’15, Tian+’16] reconstructed original ⊤ ′ 𝒗 ℎ 𝑵 𝑠 1 𝑵 𝑠 2 𝒘 𝑢 𝒘 𝑢 𝑵 𝑠 𝑵 𝑠 ・ ・ ・ 𝑒 2 𝑒 2 𝑒 2 𝑒 2 𝑒 2 𝑒 2 𝑒 𝑒 𝑑 𝑒 Train jointly Not easy to carry out Training objective is highly non-convex → Easily fall into local minimums July 18, 2018 9

  10. ② Modified SGD (Separated Learning Rates) マスター タイトルの書式設定 Our strategy Different learning rates for different parts of our model Previous Modified The common practice for setting Different parts in a neural network learning rates of SGD [Bottou, 2012] : may have different learning rates 𝜃 KB 𝛽 KB 𝜐 𝑠 ≔ 𝜃 1 + 𝜃 KB 𝜇 KB 𝜐 𝑠 𝛽 𝜐 ≔ 1 + 𝜃𝜇𝜐 𝜃 AE 𝛽 AE 𝜐 𝑠 ≔ 1 + 𝜃 AE 𝜇 AE 𝜐 𝑠 𝜃 : initial learning rate 𝜃 KB : 𝜃 for KB-learning objective 𝜃 AE : 𝜃 for autoencoder objective 𝜇 : coefficient of L2-regularizer 𝜇 KB : 𝜇 for KB-learning objective 𝜇 AE : 𝜇 for autoencoder objective 𝜐 : counter of trained examples 𝜐 𝑓 : counter of each entity 𝑓 𝜐 𝑠 : counter of each relation 𝑠 July 18, 2018 10

  11. ② Modified SGD (Separated Learning Rates) マスター タイトルの書式設定 Our strategy Different learning rates for different parts of our model Previous Modified The common practice for setting Different parts in a neural network learning rates of SGD [Bottou, 2012] : may have different learning rates 𝜃 KB 𝛽 KB 𝜐 𝑠 ≔ 𝜃 1 + 𝜃 KB 𝜇 KB 𝜐 𝑠 𝛽 𝜐 ≔ 1 + 𝜃𝜇𝜐 𝜃 AE 𝛽 AE 𝜐 𝑠 ≔ 1 + 𝜃 AE 𝜇 AE 𝜐 𝑠 𝜃 : initial learning rate 𝜃 KB : 𝜃 for KB-learning objective 𝜃 AE : 𝜃 for autoencoder objective 𝜇 : coefficient of L2-regularizer 𝜇 KB : 𝜇 for KB-learning objective 𝜇 AE : 𝜇 for autoencoder objective Learning rates for frequent entities 𝜐 𝑓 : counter of each entity 𝑓 and relations can decay more quickly 𝜐 𝑠 : counter of each relation 𝑠 July 18, 2018 11

  12. ② Modified SGD (Separated Learning Rates) マスター タイトルの書式設定 Our strategy Different learning rates for different parts of our model Modified Different parts in a neural network Rationale may have different learning rates NN usually can be decomposed 𝜃 KB into several parts, each one is 𝛽 KB 𝜐 𝑠 ≔ 1 + 𝜃 KB 𝜇 KB 𝜐 𝑠 convex when other parts are fixed 𝜃 AE ↓ 𝛽 AE 𝜐 𝑠 ≔ 1 + 𝜃 AE 𝜇 AE 𝜐 𝑠 NN ≈ joint co-training of many 𝜃 KB : 𝜃 for KB-learning objective simple convex models 𝜃 AE : 𝜃 for autoencoder objective ↓ 𝜇 KB : 𝜇 for KB-learning objective Natural to assume different 𝜇 AE : 𝜇 for autoencoder objective learning rate for each part 𝜐 𝑓 : counter of each entity 𝑓 𝜐 𝑠 : counter of each relation 𝑠 July 18, 2018 12

  13. ③ Learning Rates for Joint Training Autoencoder KB objective trying 𝜃 KB 𝛽 KB 𝜐 𝑠 ≔ to predict entities 1 + 𝜃 KB 𝜇 KB 𝜐 𝑠 Autoencoder (AE) 𝜃 AE objective trying to fit to 𝛽 AE 𝜐 𝑠 ≔ 1 + 𝜃 AE 𝜇 AE 𝜐 𝑠 low dimension coding Beginning of training 𝛽(𝜐 𝑠 ) AE is initialized randomly • Does not make much sense As the training proceeds • 𝜃 KB to fit matrices to AE 𝛽 KB and 𝛽 AE should • balance 𝜃 AE 1/(𝜇 AE 𝜐 𝑠 ) 1/(𝜇 KB 𝜐 𝑠 ) 𝜐 𝑠 0 July 18, 2018 13

  14. ④ Other Training Techniques +2.6 Normalization in Hits@10 normalize relation on FB15k-237 matrices to 𝑵 𝑠 = 𝑒 𝑵 𝑠 = 𝑒 during training +1.2 in Hits@10 Regularization push 𝑵 𝑠 toward an 1 Minimize 𝑵 𝑠 ⊤ 𝑵 𝑠 − ⊤ 𝑵 𝑠 𝐽 𝑒 tr 𝑵 𝑠 orthogonal matrix +0.4 in Hits@10 Initialization 𝑵 𝑠 𝑵 𝑠 initialize 𝑵 𝑠 as (𝐽 + 𝐻)/2 instead of pure Gaussian July 18, 2018 14

  15. EXPERIMENTS July 18, 2018 15

  16. Datasets for Knowledge Base Completion Dataset #Entity #Relation #Train #Valid #Test WN18RR 40,943 11 86,835 3,034 3,134 [Dettmers+’18] FB15k-237 [ Toutanova&Chen’15] 14,541 237 272,115 17,535 20,466 • WN18RR : subset of WordNet [Miller ’95] • FB15k-237 : subset of Freebase [Bollacker+’08] • The previous WN18 and FB15k have an information leakage issue (refer our paper for test results) • Evaluate models by how high the model ranks the gold test triples. July 18, 2018 16

  17. Base Model vs. Joint Training with Autoencoder Model WN18RR FB15k-237 MR MRR H10 MR MRR H10 BASE 2447 .310 54.1 203 .328 51.5 JOINT with AE 2268 .343 54.8 197 .331 51.6 Models: Metrics: BASE : The bilinear model MR (Mean Rank): • • [Nickel+’11] lower is better Proposed JOINT Training : MRR (Mean Reciprocal Rank): • • Jointly train relation matrices higher is better with an autoencoder H10 (Hits at 10): • higher is better Joint training with an autoencoder improves upon the base bilinear model July 18, 2018 17

  18. Compared to Previous Research Model WN18RR FB15k-237 • Normalization MR MRR H10 MR MRR H10 • Regularization Ours • Initialization BASE 2447 .310 54.1 203 .328 51.5 JOINT with AE 2268 .343 54.8 197 .331 51.6 Re-experiments TransE [Bordes+’13] 4311 .202 45.6 278 .236 41.6 RESCAL [Nickel+’11] 9689 .105 20.3 457 .178 31.9 HolE [Nickel+’16] 8096 .376 40.0 1172 .169 30.9 Published results ComplEx [Trouillon+’16] 5261 .440 51.0 339 .247 42.8 ConvE [Dettmers+’18] 5277 .460 48.0 246 .316 49.1 • Base model is competitive enough • Our models achieved state-of-the-art results July 18, 2018 18

Recommend


More recommend