CS 103: Representation Learning, Information Theory and Control Lecture 6, Feb 15, 2019
VAEs and disentanglement A β - VAE minimizes the loss function: Factorized prior L = H p,q ( x | z ) + β E x [KL( q ( z | x ) k p ( z ))] = H p,q ( x | z ) + β { I ( z ; x ) + TC( z ) } Minimality Disentanglement Assuming a factorized prior for z, a β -VAE optimizes both for the IB Lagrangian and for disentanglement. 2 Achille and Soatto, "Information Dropout: Learning Optimal Representations Through Noisy Computation” , PAMI 2018 (arXiv 2016)
Learning disentangled representations (Higgins et al., 2017, Burgess et al., 2017) Start with very high β and slowly decrease during training. Beginning: Very strict bottleneck, only encode most important factor End: Very large bottleneck, encode all remaining factors Components of the representation z Image seed Think of it as a non-linear PCA, where training time disentangles the factors. 3
Learning disentangled representations (Higgins et al., 2017, Burgess et al., 2017) Each component of the learned representation corresponds to a different semantic factor. Components of the representation z Image seed Higgins et al., β -VAE: Learning Basic Visual Concepts with a Constrained Variational Framework , 2017 Pictures courtesy of Higgins et al., Burgess et al. 4 Burgess et al., Understanding Disentangling in beta-VAE” 2017
Multiple Objects Attend, Infer, Repeat (Eslami et al.) Multi-Entity VAE (Nash et al.) 5
Is the representation “semantic” and domain invariant? 6 Achille et al., Life-Long Disentangled Representation Learning with Cross-Domain Latent Homologies , 2018
Corollary: Ways of enforcing invariance The standard architecture alone already promotes invariant representations Regularization by architecture Reducing dimension (max-pooling) or adding noise (dropout) increases minimality and invariance. Only nuisance information dropped 1 in a bottleneck (sufficiency). Nuisance information The classifier cannot overfit to nuisances. 3 I(x; n) Task information Increasingly more minimal implies I(x; y) 2 increasingly more invariant to nuisances. Stacking layers Stacking multiple layers makes the representation increasingly minimal. 7
Information Dropout: a Variational Bottleneck Creating a soft bottleneck with controlled noise ℒ = H p , q ( y | x ) + 𝔽 x KL( p ( z | x ) ∥ q ( z )) = H p , q ( y | x ) + 𝔽 x [ − log | Σ ( x ) | ] bottleneck Average log-variance of noise Nuisance information I(x; n) Task information I(x; y) Multiplicative noise ~ log N(0, 𝜏 (x)) Achille and Soatto, "Information Dropout: Learning Optimal Representations Through Noisy Computation” , PAMI 2018 (arXiv 2016) 8
Learning invariant representations ( Achille and Soatto, 2017) Deeper layers filter increasingly more nuisances Stronger bottleneck = more filtering Only informative part of the image Other information is discarded Achille and Soatto, "Information Dropout: Learning Optimal Representations Through Noisy Computation” , PAMI 2018 (arXiv 2016) 9
The catch What if we just represent an image by its index in the training set (or by a unique hash)? x z y 24,576 bits 16 bits 4 bits 0100 0000000000000000 0001 0000000000000001 0010 0000000000000010 0101 0000000000000011 It is a sufficient representation and it is close to minimal. 10
<latexit sha1_base64="5ix/e5Cegp0ObsQK/e6TCJk524w=">ACOnicbVBNSwMxEM36bf2qevQSLEKLUnZFUBCh4EVBRMGq0C1LNk3bYJdk1lpu+6P8X9496pHr54Ur/4A09qDXw8GHu/NMDMvjAU34LrPzsjo2PjE5NR0bmZ2bn4hv7h0bqJEU1alkYj0ZUgME1yxKnAQ7DLWjMhQsIvwar/vX9wbXikzqAbs7okLcWbnBKwUpDf9SVXQXpd7N12Shn2JYE2JSI9yvAePgjSeOM6K3ZveyW8jv2QAcH+Bj4s9nZxpxTkC27ZHQD/Jd6QFNAQJ0H+1W9ENJFMARXEmJrnxlBPiQZOBctyfmJYTOgVabGapYpIZurp4MkMr1mlgZuRtqUAD9TvEymRxnRlaDv7T5jfXl/8z6sl0Nyp1zFCTBFvxY1E4Ehwv3EcINrRkF0LSFUc3srpm2iCQWba+7HGipDzVtyGw03u8g/pLzbLnlr3TrULleBjSFpBq6iIPLSNKugAnaAqougOPaBH9OTcOy/Om/P+1TriDGeW0Q84H5+kawM</latexit> <latexit sha1_base64="5ix/e5Cegp0ObsQK/e6TCJk524w=">ACOnicbVBNSwMxEM36bf2qevQSLEKLUnZFUBCh4EVBRMGq0C1LNk3bYJdk1lpu+6P8X9496pHr54Ur/4A09qDXw8GHu/NMDMvjAU34LrPzsjo2PjE5NR0bmZ2bn4hv7h0bqJEU1alkYj0ZUgME1yxKnAQ7DLWjMhQsIvwar/vX9wbXikzqAbs7okLcWbnBKwUpDf9SVXQXpd7N12Shn2JYE2JSI9yvAePgjSeOM6K3ZveyW8jv2QAcH+Bj4s9nZxpxTkC27ZHQD/Jd6QFNAQJ0H+1W9ENJFMARXEmJrnxlBPiQZOBctyfmJYTOgVabGapYpIZurp4MkMr1mlgZuRtqUAD9TvEymRxnRlaDv7T5jfXl/8z6sl0Nyp1zFCTBFvxY1E4Ehwv3EcINrRkF0LSFUc3srpm2iCQWba+7HGipDzVtyGw03u8g/pLzbLnlr3TrULleBjSFpBq6iIPLSNKugAnaAqougOPaBH9OTcOy/Om/P+1TriDGeW0Q84H5+kawM</latexit> <latexit sha1_base64="5ix/e5Cegp0ObsQK/e6TCJk524w=">ACOnicbVBNSwMxEM36bf2qevQSLEKLUnZFUBCh4EVBRMGq0C1LNk3bYJdk1lpu+6P8X9496pHr54Ur/4A09qDXw8GHu/NMDMvjAU34LrPzsjo2PjE5NR0bmZ2bn4hv7h0bqJEU1alkYj0ZUgME1yxKnAQ7DLWjMhQsIvwar/vX9wbXikzqAbs7okLcWbnBKwUpDf9SVXQXpd7N12Shn2JYE2JSI9yvAePgjSeOM6K3ZveyW8jv2QAcH+Bj4s9nZxpxTkC27ZHQD/Jd6QFNAQJ0H+1W9ENJFMARXEmJrnxlBPiQZOBctyfmJYTOgVabGapYpIZurp4MkMr1mlgZuRtqUAD9TvEymRxnRlaDv7T5jfXl/8z6sl0Nyp1zFCTBFvxY1E4Ehwv3EcINrRkF0LSFUc3srpm2iCQWba+7HGipDzVtyGw03u8g/pLzbLnlr3TrULleBjSFpBq6iIPLSNKugAnaAqougOPaBH9OTcOy/Om/P+1TriDGeW0Q84H5+kawM</latexit> <latexit sha1_base64="5ix/e5Cegp0ObsQK/e6TCJk524w=">ACOnicbVBNSwMxEM36bf2qevQSLEKLUnZFUBCh4EVBRMGq0C1LNk3bYJdk1lpu+6P8X9496pHr54Ur/4A09qDXw8GHu/NMDMvjAU34LrPzsjo2PjE5NR0bmZ2bn4hv7h0bqJEU1alkYj0ZUgME1yxKnAQ7DLWjMhQsIvwar/vX9wbXikzqAbs7okLcWbnBKwUpDf9SVXQXpd7N12Shn2JYE2JSI9yvAePgjSeOM6K3ZveyW8jv2QAcH+Bj4s9nZxpxTkC27ZHQD/Jd6QFNAQJ0H+1W9ENJFMARXEmJrnxlBPiQZOBctyfmJYTOgVabGapYpIZurp4MkMr1mlgZuRtqUAD9TvEymRxnRlaDv7T5jfXl/8z6sl0Nyp1zFCTBFvxY1E4Ehwv3EcINrRkF0LSFUc3srpm2iCQWba+7HGipDzVtyGw03u8g/pLzbLnlr3TrULleBjSFpBq6iIPLSNKugAnaAqougOPaBH9OTcOy/Om/P+1TriDGeW0Q84H5+kawM</latexit> This Information Bottleneck is wishful thinking The IB is a statement of desire for future data we do not have: q ( z | x ) L = H p,q ( y | z ) + β I ( z ; x ) min What we have is the data collected in the past. What is the best way to use the past data in view of future tasks? 11
Testing Training data Weights } { , (car, horse, deer, …) Invariant representation
Recommend
More recommend