CS 103: Representation Learning, Information Theory and Control Lecture 8, Mar 1, 2019
Recap Group nuisances - Group convolutions - Canonical reference frames - SIFT descriptors General nuisances - Minimal information in activation β Invariance to nuisances - Information Bottleneck - IB loss can upper-bounded by introducing an auxiliary variable - Aside: Variational Auto-Encoder can be seen as a particular case - Aside: Disentanglement in VAE How does this relate to standard deep learning? 2
The Kolmogorov Structure of a Task How can we define the structure of a task? Define the Kolmogorov Structure Function: S π ( t ) = min K ( M ) β€ t L ( π ; M ) Increasing the complexity of the model leads to big gains in accuracy: We are learning the structure of the problem. Training Loss Kolmogorov minimal sufficient statistic After learning all the structure, we can only memorize: inefficient asymptotic phase. Tangent = 1 in the asymptote: Need to store 1 bit Optimal in the model to decrease the loss by 1 bit Kolmogorov complexity of model Kolmogorov's Structure Functions and Model Selection , Vereshchagin and Vitanyi , 2002 3 Information Complexity of Tasks, their Structure and their Distance , Achille et al. , 2018
Optimizing using Deep Neural Networks How do we find the optimal solution? S π ( t ) = min K ( M )< t L ( π ; M ) Corresponding Lagrangian β ( M ) = L ( π ; M ) + Ξ» K ( M ) Let w be the parameters of the model Use the bound K ( M ) β€ KL( q ( w | π ) β₯ p ( w )) β ( M ) = L ( π ; M ) + Ξ» KL( q ( w | π ) β₯ p ( w )) This loss can be implemented using a DNN and the local reparametrization trick.* * Variational Dropout and the Local Reparameterization Trick , Kingma et al. , 2015 4 Information Complexity of Tasks, their Structure and their Distance , Achille et al. , 2018
Letβs rewrite it using Information Theory We used an upperbound, what is the best we value it can assume? β ( M ) = π½ w βΌ q ( w | π ) [ H p , q ( π | w )] + Ξ» KL( q ( w | π ) β₯ p ( w )) . Recall that: I ( w ; π ) β€ π½ π [KL( q ( w | π ) β₯ p ( w ))], which is obtained when p(w) = q(w|D) . Hence, on expectation over the datasets, the best function loss function to use to recover the task structure is: β ( M ) = π½ π [ H ( π | w )] + Ξ» I ( w ; π ) . IB Lagrangian for the weights 5
<latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit> <latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit> <latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit> <latexit sha1_base64="eu/9thb7n8qPJF1t/o56inDEc=">ACfHicdVFdSxtBFJ1d26qxrVEfbk0Cgm2YVdKFaQg2AdbfLBgVMiGZXYySQZnZteZu43Juj+0z3pryidfCDV1AMDh3Pu4V7OJkUFoPgp+cvXj5anltbL2+s3b9erG5qVNc8N4i6UyNdcJtVwKzVsoUPLrzHCqEsmvkpuTiX/1gxsrUn2Bo4x3FO1r0ROMopPiahkpoePitj6+v2uUECmKA0ZlcVbCZziNi+w93JZQH92PG7AHUcKRwtf6+AjuGpXdWXb4TCweLgYfBr+URzBsxNVa0AymgEUSzkmNzHEeV39H3ZTlimtklrbDoMOwU1KJjkZSXKLc8ou6F93nZU8Vtp5jWVMKuU7rQS417GmGq/psoqLJ2pBI3OTnTPvUm4v+8do69w04hdJYj12y2qJdLwBQmnUNXGM5QjhyhzAh3K7ABNZSh+5nKozVMJUb0B1i6asKnRSySy/1mGDTD7x9rx9/mJa2QbfKO1ElIDsgxOSXnpEUY+eWtepvelvfH3/H3/A+zUd+bZ7bI/if/gLpa73S</latexit> <latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit> <latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit> <latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit> <latexit sha1_base64="07Llzu3pX36qhiNEl5FSqs7OCBw=">ACf3iclVFdSxtBFJ3d2hqjrVHRF18uDdKEStgtBRURBPug0gcLRoUkLOTSTI4O7vO3G0+1gX/pn+g+DOcfCD148ULA+ecw/3ciZMpDoefeO+2Hu46f5wkJxcenzl+XSyuqFiVPNeJ3FMtZXITVcCsXrKFDyq0RzGoWSX4bXR2P98i/XRsTqHIcJb0W0q0RHMIqWCkp3W9CMhAqym8rodlDNbUex6jMfudwAMdBlmzDTdDPoTK8HVXhOzRDjhROKqN9GFSLU3P/Hb6nwV/5PvSrQans1bxJwWvgz0CZzOosKD02zFLI6QSWpMw/cSbGVUo2CS58VmanhC2TXt8oaFikbctLJUjlsWaYNnVjbpxAm7P+OjEbGDKPQTo7PNC+1MfmW1kixs9vKhEpS5IpNF3VSCRjDOHZoC80ZyqEFlGlhbwXWo5oytJ9TfLaGRaEW3R7mNhr/ZRCvwcWPmu/V/D8/y4ens5AKZJN8JRXikx1ySI7JGakTRv45S86s+E67je35nrTUdeZedbIs3L3HgEoL3n</latexit> A new Information Bottleneck p(y|x) D w Weights IB dataset real distribution weights Overfitting min w L = H p,q w ( y | z ) + Ξ² I ( D ; w ) y z x Activations IB data label activations Invariance q ( z | x ) L = H p,q ( y | z ) + Ξ² I ( z ; x ) min 6
The PAC-Bayes generalization bound Catoni, 2007; McAllester 2013 PAC-Bayes bound (Catoni, 2007; McAllester 2013). Corollary. Minimizing the IB Lagrangian for the weights minimizes an upper bound on the test error. This gives non-vacuous generalization bounds! (Dziugaite and Roy, 2017) 7
Recommend
More recommend