partial differential equations approaches to optimization
play

Partial Differential Equations Approaches to Optimization and - PowerPoint PPT Presentation

Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75 Years of Mathematics of Computation ICERM Nov 2, 2018 Adam Oberman McGill Dept of Math and Stats supported by NSERC, Simons


  1. Partial Differential Equations Approaches to Optimization and Regularization of Deep Neural Networks Celebrating 75 Years of Mathematics of Computation ICERM Nov 2, 2018 Adam Oberman McGill Dept of Math and Stats supported by NSERC, Simons Fellowship, AFOSR FA9550-18-1-0167

  2. Background AI • Artificial Intelligence is loosely defined as intelligence exhibited by machines • Operationally: R&D in CS academic sub-disciplines: Computer Vision, Natural Language Processing (NLP), Robotics, etc AlphaGo uses DL to beat world champion at Go

  3. Artificial General Intelligence (AGI) • AI : specific tasks, • AGI : general cognitive abilities. • AGI is a small research area within AI: build machines that can successfully perform any task that a human might do • So far, no progress on AGI.

  4. Deep Learning vs. traditional Machine Learning • Machine Learning (ML) has been around for some time. • Deep Learning is newer branch of ML which uses Deep Neural networks. • ML has theory: error estimates and convergence proofs. • DL less theory. But DL can e ff ectively solve substantially larger scale problems

  5. What are DNNs (in Math Language)? • ImageNet: Total number of classes: m =21841 • Total number of images: n =14,197,122 • Color images d= 3*256*256= 196,608 Facebook used 256 GPUs, working in parallel, to train ImageNet. Still an academic dataset . Total number of images on Facebook is much larger

  6. What is the function? Looking for a map from images to labels f(x) x in M = manifold of images graph of list of word labels

  7. Doing the impossible? In theory, due to curse of dimensionality, impossible to accurately interpolate a high dimensional function. In practice, possible using Deep Neural Network architecture, training to fit the data with SGD. However we don’t know why it works. Can train a computer to caption images more accurately that human performance.

  8. Loss Functions versus Error Classification problem: map image to discrete label space {1,2,3,…,10} In practice: map to a probability vector, then assign label of the arg max. Classification is not differentiable, so instead, in order to train, use a loss function as a surrogate.

  9. <latexit sha1_base64="3uHpidkJhxBpaO95Fmt7WdytCQ=">ACQ3icbZBNSysxFIbP+HW9avq0k1QhA6IdNwoyAVBZcKVoudGjJpxoYmSHJaMsw/8Wf4sY/4M4/4MaF3ErmLYifh0IPLznPZycN0oFN7ZavfdGRsfGJ/5M/i1NTc/MzpXnF05MkmnKajQRia5HxDBFatZbgWrp5oRGQl2GnV2+/3TS6YNT9Sx7aWsKcmF4jGnxDoJl89CyRW+QqEkth1F+X6B8y4KDZco1O0EqwKFTIhKXOluX/lrvUrX9E/Z8gkzrmjoDhXHxbM302Y+z4ur1TXq4NCHxB8h5WdfVW/BoBDXL4LWwnNJFOWCmJMI9hIbTMn2nIqWFEKM8NSQjvkgjUcKiKZaeaDAq06pQWihPtnrJoH6eyIk0picj5+yfar73+uJvUZm461mzlWaWabocFGcCWQT1A8Utbhm1IqeA0I1d39FtE0odbFXnIh/Dj5J5xsrAeOj1waezCsSViCZahAJuwAwdwCDWgcAMP8AT/vVv0Xv2XobWEe9ZhG+lPf6Bur+sIg=</latexit> <latexit sha1_base64="LDMLH6UNa2jZV0Aeyzoau7EZiQ=">ACQ3icbZDLSgMxFIYz9V5vVZdugiJ0QGqnGwURBVcKlgtdmrIpBkbmSGJGNbhnkIn8EXceMLuPMF3LhQxK1g2op4OxD4+M9/ODl/EHOmTbn84ORGRsfGJyan8tMzs3PzhYXFUx0litAqiXikagHWlDNJq4YZTmuxolgEnJ4F7b1+/+yKs0ieWJ6MW0IfClZyAg2VkKFc18wiTrQF9i0giA9yFDahb5mAvqFSGZQZ9yXgyL3e2Ou94rdl0X7lhDIlDKLHnZhfyIPZpQsx1UWG1XCoPCn6B9xtWdw9k7WZj/oIFe79ZkQSQaUhHGtd9yqxaRYGUY4zfJ+omMSRtf0rpFiQXVjXSQbXrNKEYaTskwYO1O8TKRZa90Rgnf1T9e9eX/yvV09MuNVImYwTQyUZLgoTDk0E+4HCJlOUGN6zgIli9q+QtLDCxNjY8zaEPyf/hdNKybN8bNPYB8OaBMtgBRSBzbBLjgER6AKCLgFj+AZvDh3zpPz6rwNrTnc2YJ/Cjn/QNKGrGP</latexit> <latexit sha1_base64="LDMLH6UNa2jZV0Aeyzoau7EZiQ=">ACQ3icbZDLSgMxFIYz9V5vVZdugiJ0QGqnGwURBVcKlgtdmrIpBkbmSGJGNbhnkIn8EXceMLuPMF3LhQxK1g2op4OxD4+M9/ODl/EHOmTbn84ORGRsfGJyan8tMzs3PzhYXFUx0litAqiXikagHWlDNJq4YZTmuxolgEnJ4F7b1+/+yKs0ieWJ6MW0IfClZyAg2VkKFc18wiTrQF9i0giA9yFDahb5mAvqFSGZQZ9yXgyL3e2Ou94rdl0X7lhDIlDKLHnZhfyIPZpQsx1UWG1XCoPCn6B9xtWdw9k7WZj/oIFe79ZkQSQaUhHGtd9yqxaRYGUY4zfJ+omMSRtf0rpFiQXVjXSQbXrNKEYaTskwYO1O8TKRZa90Rgnf1T9e9eX/yvV09MuNVImYwTQyUZLgoTDk0E+4HCJlOUGN6zgIli9q+QtLDCxNjY8zaEPyf/hdNKybN8bNPYB8OaBMtgBRSBzbBLjgER6AKCLgFj+AZvDh3zpPz6rwNrTnc2YJ/Cjn/QNKGrGP</latexit> <latexit sha1_base64="1t3sG2NE1Erwu7My7b6VndIN3d0=">ACQ3icbZDLSgMxFIYz9VbrerSTbAIHRCZcaMgqCywrWip0aMmDSaZIcloyzDv5sYXcOcLuHGhiFvBtJbi7UDg4z/4eT8YcKZNp736BQmJqemZ4qzpbn5hcWl8vLKuY5TRWidxDxWFyHWlDNJ64YZTi8SRbEIOW2E14eDfuOGKs1ieWb6CW0J3JEsYgQbK6HyZSCYRLcwENh0wzA7zlHWg4FmAgaqGyOZw4ByXo2qvb1bd7Nf7bku3LeGVKCMWfLzKzm2IDYyIea6qFzxtrxhwTH4v6ECRlVD5YegHZNUGkIx1o3/e3EtDKsDCOc5qUg1TB5Bp3aNOixILqVjbMIcbVmnDKFb2SQOH6veJDAut+yK0zsGp+ndvIP7Xa6Ym2m1lTCapoZJ8LYpSDk0MB4HCNlOUGN63gIli9q+QdLHCxNjYSzaEPyf/hfPtLd/yqVc5OBrFUQRrYB1UgQ92wAE4ATVQBwTcgSfwAl6de+fZeXPev6wFZzSzCn6U8/EJVBWunQ=</latexit> DNNs in Math Language: high dimensional function fitting n X min w E x ∼ ρ n ` ( f ( x ; w ) , y ( x )) = ` ( f ( x i ; w ) , y ( x i )) i =1 Data fitting problem: f parameterized map from images to probability vectors on labels. y is the correct label. Try to fit data by minimizing the loss. Training: minimize expected loss, by taking stochastic (approximate) gradients Note: train on an empirical distribution sampled from the density rho.

Recommend


More recommend