A Meta-Transfer Objective for Learning to Disentangle Causal Mechanisms Behrad Moniri Mahdiyar Shahbazi Department of Electrical Engineering Sharif University of Technology December 30, 2019 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 1 / 23
Accepted for ICLR 2020 Code available on Github . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 2 / 23
Introduction Idea 1 What are the right representations? Causal variables explaining the data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 3 / 23
Introduction Idea 1 What are the right representations? Causal variables explaining the data Idea 2 How to modularize knowledge for easier re-use & adaptation, good transfer? How to disentangle the unobserved explanatory variables? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 3 / 23
Hypotheses about how the environment changes Main Assumptions: Changing one mechanism does not change the others (Peters, Janzig & Scholkopf 2017) Non-stationarities, changes in distribution, involve few mechanisms (e.g. the result of a single-variable intervention) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 4 / 23
Claims Under the hypothesis of independent mechanisms and small changes across different distributions: smaller sample complexity to recover from a distribution change E.g. for transfer learning, agent learning, domain adaptation, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 5 / 23
Learning a Causal Graph with two Discrete Variables If we have the right knowledge representation, then we should get fast adaptation to the transfer distribution when starting from a model that is well trained on the training distribution Core Idea: A ”Regret” function based on the speed of adaptation. However it is clear to us that much more work will be needed to evaluate the proposed approach in a diversity of settings and with different specific parametrizations, training objectives, environments, etc. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 6 / 23
Let both A and B be discrete variables each taking N possible values and consider the following two parametrizations P A → B ( A , B ) = P A → B ( A ) P A → B ( B | A ) P B → A ( A , B ) = P B → A ( B ) P B → A ( A | B ) This amounts to four modules: P A → B ( A ), P A → B ( B | A ), P B → A ( B ) and P B → A ( A | B ). We will train both models independently. Maximum likelihood estimation of these parameters: normalized relative frequencies. θ : parameters of all these models θ A | B , θ B | A , θ B , θ A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 7 / 23
θ i = P A → B ( A = i ) θ j | i = P A → B ( B = j | A = i ) η j = P B → A ( B = j ) η i | j = P B → A ( A = i | B = j ) . ˆ ˆ θ i = n i / n θ j | i = n ij / n i η j = n j / n ˆ η i | j = n ij / n j ˆ We can now compute the likelihood for each model: P A → B ( A , B ) = ˆ ˆ θ i ˆ θ j | i = n ij / n ˆ P B → A ( A , B ) = ˆ η j ˆ η i | j = n ij / n Which direction can adapt faster? Answer: The causal direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 8 / 23
Simulation 4.2 4.4 ) 4.6 log P ( D 4.8 A B 5.0 B A 0 100 200 300 400 Number of examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 9 / 23
Proposition The expected gradient over the transfer distribution of the regret (accumulated negative log-likelihood during the adaptation episode) with respect to the module parameters is zero for the parameters of the modules that (a) were correctly learned in the training phase, and (b) have the correct set of causal parents, corresponding to the ground truth causal graph, if (c) the corresponding ground truth conditional distributions did not change from the training distribution to the transfer distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 10 / 23
As a consequence, the effective number of parameters that need to be adapted, when one has the correct causal graph structure, is reduced to those of the mechanisms that actually changed from the training to the transfer distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 11 / 23
Proposition Consider conditional probability modules P θ i ( V i | pa ( i , V , B i )) where B ij = 1 indicates that V j is among the parents pa ( i , V , B i ) of V i in a directed acyclic causal graph. Consider ground truth training distribution P 1 and transfer distribution P 2 over these variables, and ground truth causal structure B. The joint log-likelihood L ( V ) for a sample V with respect to the module parameters θ decomposed into module parameters θ i is L ( V ) = ∑ i log P θ i ( V i | pa ( i , V , B i )) . If (a) a model has the correct causal structure B, and (b) it been trained perfectly on P 1 , leading to estimated parameters θ , and (c) the ground truth P 1 and P 2 only differ from each other only for some P ( V i | pa ( i , V , B i )) for i ∈ C, then E V ∼ P 2 [ ∂ L ( V ) ∂θ i ] = 0 for i / ∈ C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 12 / 23
Bi-Variate Example The transfer distribution only changed the true P ( A ) (the cause) For the correct model only N − 1 parameters need to be re-estimated. In the backward model, all N ( N − 1) + ( N − 1) = N 2 − 1 parameters must be re-estimated. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 13 / 23
More than two parameters We won’t be able to enumerate all DAGs and pick the best one after observing episodes of adaptation. We can parameterize our belief about an exponentially large set of hypotheses by keeping track of the probability for each directed edge of the graph to be present . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 14 / 23
Formulization Modeling edges ∏ B ij ∼ Bernoulli ( p ij ) , P ( B ) = P ( B ij ) . ij The parents of V i , given B , as the set of V j ’s such that B ij = 1: pa ( i , V , B i ) = { V j | B ij = 1 , j ̸ = i } The Structural Causal Model: V i = f i ( θ i , B i , V , N i ) N i is an independent noise source to generate V i f i parametrizes the generator (active if B ij = 1) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 15 / 23
The conditional likelihood P B i ( V i = v ti | pa ( i , v t , B i )) measures how well the model that uses the incoming edges B i for node i performs for example v t . ∏ L B i = P B i ( V i = v ti | pa ( i , v t , B i )) . (1) t The overall exponentiated regret for the given graph structure B is ∏ L B = L B i i for the generalized multi-variable case R = − log E B [ L B ] (2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Behrad Moniri - Mahdiyar Shahbazi Transfer Learning and Causal Inference December 30, 2019 16 / 23
Recommend
More recommend