. . . . . . . . . . . . . . Towards Understanding Learning Representations: To What Extent Do Difgerent Neural Networks Learn the Same Representation Liwei Wang Lunjia Hu Jiayuan Gu Yue Wu Zhiqiang Hu Kun He John Hopcroft . . . . . . . . . . . . . . . . . . . . . . . . . . NeurIPS 2018 Spotlight
In other words, are the learned representations commonly shared across multiple . . . . . . . . . . . . . . . . Motivation intermediate layers, and people design architectures in order to learn these representations better (e.g. CNN). However, there is a lack of theory on what these representations really are. One fundamental question: are the representations learned by deep nets robust? . . . . . . . . . . . . . . . . . . . . . . . . deep nets trained on the same task? ▶ It’s widely believed that deep nets learn particular features/representations in their
In other words, are the learned representations commonly shared across multiple . . . . . . . . . . . . . . . . Motivation intermediate layers, and people design architectures in order to learn these representations better (e.g. CNN). One fundamental question: are the representations learned by deep nets robust? . . . . . . . . . . . . . . . . . . . . . . . . deep nets trained on the same task? ▶ It’s widely believed that deep nets learn particular features/representations in their ▶ However, there is a lack of theory on what these representations really are.
. . . . . . . . . . . . . . . . . Motivation intermediate layers, and people design architectures in order to learn these representations better (e.g. CNN). In other words, are the learned representations commonly shared across multiple . . . . . . . . . . . . . . . . . . . . . . . deep nets trained on the same task? ▶ It’s widely believed that deep nets learn particular features/representations in their ▶ However, there is a lack of theory on what these representations really are. ▶ One fundamental question: are the representations learned by deep nets robust?
. . . . . . . . . . . . . . Motivation Given a set of test examples, do the two deep nets share similarity in their output of layer i ? When layer i is the input layer, the similarity is high because both deep nets take the same test examples as input. When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error. How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations . . . . . . . . . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations.
. . . . . . . . . . . . . . . Motivation do the two deep nets share similarity in their output of layer i ? When layer i is the input layer, the similarity is high because both deep nets take the same test examples as input. When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error. How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations . . . . . . . . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations. ▶ Given a set of test examples,
. . . . . . . . . . . . . . . Motivation do the two deep nets share similarity in their output of layer i ? same test examples as input. When layer i is the fjnal output layer that predicts the classifjcation labels, the similarity is also high assuming both deep nets have tiny test error. How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations . . . . . . . . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations. ▶ Given a set of test examples, ▶ When layer i is the input layer, the similarity is high because both deep nets take the
. . . . . . . . . . . . . . . . Motivation do the two deep nets share similarity in their output of layer i ? same test examples as input. similarity is also high assuming both deep nets have tiny test error. How similar are intermediate layers? Do some groups of neurons in an intermediate layer learn features/representations . . . . . . . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations. ▶ Given a set of test examples, ▶ When layer i is the input layer, the similarity is high because both deep nets take the ▶ When layer i is the fjnal output layer that predicts the classifjcation labels, the
. . . . . . . . . . . . . . . . Motivation do the two deep nets share similarity in their output of layer i ? same test examples as input. similarity is also high assuming both deep nets have tiny test error. Do some groups of neurons in an intermediate layer learn features/representations . . . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? . . . . ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations. ▶ Given a set of test examples, ▶ When layer i is the input layer, the similarity is high because both deep nets take the ▶ When layer i is the fjnal output layer that predicts the classifjcation labels, the ▶ How similar are intermediate layers?
. . . . . . . . . . . . . . . . . Motivation do the two deep nets share similarity in their output of layer i ? same test examples as input. similarity is also high assuming both deep nets have tiny test error. . . . . . . . . . . . . . . . . . . that both deep nets share in common? How large are these groups? . . . . . ▶ In particular, suppose we have two deep nets with the same architecture trained on the same training data but from difgerent initializations. ▶ Given a set of test examples, ▶ When layer i is the input layer, the similarity is high because both deep nets take the ▶ When layer i is the fjnal output layer that predicts the classifjcation labels, the ▶ How similar are intermediate layers? ▶ Do some groups of neurons in an intermediate layer learn features/representations
. B there exist A and B such that a d , For test examples a 1 i after ReLU Output of layer Layer B B B B B X a i B B B B B Y X W Z i after ReLU Output of layer for all i , Y a i . Y a d Z W X Y We say activation vector of W W a d W a 1 activation vector of Z Z a d Z a 1 span activation vector of Y Y a 1 Z a i activation vector of X X a d X a 1 span B A Y a i X a i W a i Z a i W a i Layer A A . . . . . . . . . . . . . . . . . . . . . . . . . . . A . A A A A A A A A W Z Y X Two Groups of Neurons Learning the Same Representation: Exact Matches . . . . . . . . . . form an exact match! i + 1 i + 1
. B For test examples a 1 i after ReLU Output of layer Layer B B B B B B B B there exist A and B such that B B Y X W Z i after ReLU Output of layer Layer . A A a d , for all i , A Y a 1 Z W X Y We say activation vector of W W a d W a 1 activation vector of Z Z a d Z a 1 span activation vector of Y Y a d activation vector of X X a i X a d X a 1 span B A Y a i X a i W a i Z a i W a i Z a i Y a i A A A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Y A A A A W Z form an exact match! X . . . . . Two Groups of Neurons Learning the Same Representation: Exact Matches i + 1 i + 1
. W B B B B B B B . Y X Z B i after ReLU Output of layer Layer A A A A A A A A B B A activation vector of Y Z W X Y We say activation vector of W W a d W a 1 activation vector of Z Z a d Z a 1 span Y a d Layer Y a 1 activation vector of X X a d X a 1 span B A for all i , there exist A and B such that i after ReLU Output of layer A B A . . . . . . . . . . . . . . . . . . . . . . . . . . . . . form an exact match! . Y . . . Two Groups of Neurons Learning the Same Representation: Exact Matches X . Z W . . . . For test examples a 1 , · · · , a d , i + 1 i + 1 X ( a i ) Z ( a i ) = Y ( a i ) W ( a i ) Z ( a i ) X ( a i ) = W ( a i ) Y ( a i )
Recommend
More recommend