direction matters on the implicit regularization effect
play

Direction Matters: On the Implicit Regularization Effect of - PowerPoint PPT Presentation

Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate Jingfeng Wu , Difan Zou, Vladimir Braverman, Quanquan Gu Johns Hopkins University & UCLA November 2020 Overview


  1. Direction Matters: On the Implicit Regularization Effect of Stochastic Gradient Descent with Moderate Learning Rate Jingfeng Wu , Difan Zou, Vladimir Braverman, Quanquan Gu Johns Hopkins University & UCLA November 2020

  2. Overview • Background • SGD vs. GD: Different Convergence Directions • Small Learning Rate • Moderate Learning Rate • Direction Matters: SGD + Moderate LR is Good • Proof Sketches

  3. <latexit sha1_base64="KXCQlbNA4tNvFg4knZExHyj1hdI=">ACL3icbVDLSgMxFM34rPVdekmWIS6KTOC6KZQdKELFxXtAzp1yKSZNjTJDElGKeP8h9/gzh9wq38gbsSV4F+YabvQ6oHA4dxzOTfHjxhV2rbfrJnZufmFxdxSfnldW29sLHZUGEsManjkIWy5SNFGBWkrqlmpBVJgrjPSNMfnGTz5g2RiobiSg8j0uGoJ2hAMdJG8gr757iRGDpds9WIFuIBFOnDQRKXRVzL2EVpz0WkCXMObRzJX3CkW7bI8A/xJnQorVU/jgene9mlf4dLshjkRGjOkVNuxI91JkNQUM5Lm3ViRCOEB6pG2oQJxojrJ6G8p3DVKFwahNE9oOFJ/biSIKzXkvnFypPtqepaJ/83asQ6OgkVUayJwOgIGZQhzArCnapJFizoSEIS2puhbiPTD3a1PkrhQ9HIVkxznQNf0ljv+wclO0L09AxGCMHtsEOKAEHIqOAM1UAcY3IMn8AxerEfr1Xq3PsbWGWuyswV+wfr6Bk6UqsE=</latexit> Implicit Regularization: SGD >> GD n L S ( w ) = 1 X Loss ` i ( w ) n i =1 GD <latexit sha1_base64="EarS2g+hOyhHCn1+yqIviVW4JU=">ACMnicbVA9bxNBEN1LCARDiIGSZoUVyS5s3UWyQhkBRQqKIOIPyWdZc3tz9sp7H9qdw7JO/iX8Avr8gbSJ+AEgGoRERZue9TkFtjPSrp7em5k3ekGmpCHX/e7s7D7Ye/ho/3HlydODZ4fV5y+6Js21wI5IVar7ARhUMsEOSVLYzRCHCjsBdN3S73GbWRaXJB8wyHMYwTGUkBZKlRtT3jvsKIQOt0xme8yQu/3FpoDBfcRwL7jzWE/MPI/yRA8fqsMarW3JZbFt8G3h2onTZuvzXf/x6Pqr+8cNU5DEmJBQYM/DcjIYFaJC4aLi5wYzEFMY48DCBGI0w6I8ZMGPLBPyKNX2JcRL9v+JAmJj5nFgO2OgidnUluR92iCn6M2wkEmWEyZiZRTlilPKl1nxUGoUpOYWgNDS3srFBDQIsomucTz0qRig/E2Y9gG3eOW1265H21Cb9mq9tkr9prVmcdO2Ck7Y+eswT7wq7YNbtxLp0fzi/n96p1x7mbecnWyvn7DzPzrjI=</latexit> w w � η r L S ( w ) SGD <latexit sha1_base64="NEVHOYBfO6Mb1VrYie/t9hYiKGM=">ACMXicbVBNT9tAEF0HaCGlJaXHXlaJKgWhRjaiao9RufRIJRIixVE0Xo+TVdYf2h03sqz8C078jf4BrvAPcqt6KV/go3DgQAj7erpvZl5oxdkShpy3aVT29refV6d6/+Zv/tu4PG+8O+SXMtsCdSlepBAaVTLBHkhQOMo0QBwovg9nZSr/8hdrINLmgIsNRDJNERlIAWrcOJ1zX2FEoHU653P+mZd+tbXUGC64jwT2n2gILVZqPGvPj/i40XI7blX8OfAeQKvb9I+vlt3ifNz454epyGNMSCgwZui5GY1K0CSFwkXdzw1mIGYwaGFCcRoRmV1x4J/skzIo1TblxCv2McTJcTGFHFgO2OgqXmqrciXtGFO0bdRKZMsJ0zE2ijKFaeUr6LiodQoSBUWgNDS3srFDQIsoFuMRFZVK3wXhPY3gO+icd70vH/WkT+s7Wtcs+siZrM49ZV32g52zHhPsmt2wW3bn/HaWzh/n7q15jzMfGAb5fy/BzjzrO8=</latexit> w w � ⌘ r ` k ( w ) CIFAR-10, ResNet-18, w/o weight decay, w/o data augmentation Wu, Jingfeng, et al. "On the Noisy Gradient Descent that Generalizes as SGD." ICML 2020.

  4. Two More Figures about SGD (Less Relevant) 70 test DccurDcy (%) 60 50 40 GD (66.96) GLD const (66.66) 30 GLD GynDmLc (69.25) 20 GLD GLDg (67.96) 6GD (75.21) 10 0 2500 5000 7500 10000 12500 15000 17500 LterDtLon Wilson, Ashia C., et al. "The marginal value of adaptive gradient methods in machine learning." NIPS 2017. Zhu, Zhanxing, et al. "The Anisotropic Noise in Stochastic Gradient Descent: Its Behavior of Escaping from Sharp Minima and Regularization Effects." ICML 2019.

  5. SGD vs. GD: Learning Rate Matters! Small LR Moderate LR GD L L SGD L J Q1: Small LR, SGD ≈ GD? Q2: Moderate LR, SGD >> GD? Q3: GD is bad anyhow?

  6. In Theory, SGD ≈ GD or SGD ≠ GD ?? GD <latexit sha1_base64="EarS2g+hOyhHCn1+yqIviVW4JU=">ACMnicbVA9bxNBEN1LCARDiIGSZoUVyS5s3UWyQhkBRQqKIOIPyWdZc3tz9sp7H9qdw7JO/iX8Avr8gbSJ+AEgGoRERZue9TkFtjPSrp7em5k3ekGmpCHX/e7s7D7Ye/ho/3HlydODZ4fV5y+6Js21wI5IVar7ARhUMsEOSVLYzRCHCjsBdN3S73GbWRaXJB8wyHMYwTGUkBZKlRtT3jvsKIQOt0xme8yQu/3FpoDBfcRwL7jzWE/MPI/yRA8fqsMarW3JZbFt8G3h2onTZuvzXf/x6Pqr+8cNU5DEmJBQYM/DcjIYFaJC4aLi5wYzEFMY48DCBGI0w6I8ZMGPLBPyKNX2JcRL9v+JAmJj5nFgO2OgidnUluR92iCn6M2wkEmWEyZiZRTlilPKl1nxUGoUpOYWgNDS3srFBDQIsomucTz0qRig/E2Y9gG3eOW1265H21Cb9mq9tkr9prVmcdO2Ck7Y+eswT7wq7YNbtxLp0fzi/n96p1x7mbecnWyvn7DzPzrjI=</latexit> w w � η r L S ( w ) SGD <latexit sha1_base64="H8W0j+nYlbjAbB3JxTavAukChE=">ACeHicbVFNb9NAEF2bj5bwUReOXEZUiEYokV0JwbECDhw4FEHaSnFkjdfjdJX1h3bHRJGVH8SP4Qcgceml/4ALJzZODrTNSLt6eu/NzuhtWmtlOQx/ef6du/fu7+w+6D189PjJXrD/9NRWjZE0kpWuzHmKlrQqacSKNZ3XhrBINZ2lsw8r/ew7Gauq8hsvapoUOC1VriSyo5KgnUOsKWc0prDHAbQxt2raFsCTExuntqMIPSfxVobDeR9ew1ZfalDOiNstHYMNF5PWyWzFLJPgIByGXcFtEG3AwXH/z8/Bx98/TpLgKs4q2RUstRo7TgKa560aFhJTcte3Fiq3QY4pbGDJRZkJ235hJeOiaDvDLulAwd+39Hi4W1iyJ1zgL5wt7UVuQ2bdxw/m7SqrJumEq5HpQ3GriCVeKQKUOS9cIBlEa5XUFeoEuK3b9cm1IsuiE9F0x0M4b4PRoGL0Zhl9cQu/FunbFc/FCHIpIvBXH4pM4ESMhxaW34wXevfXB/+V319bfW/T80xcK/oH9F/wao=</latexit> w w � ⌘ r L S ( w ) + ⌘ ( r L S ( w ) � r ` k ( w )) Unbiased noise (scales with 𝜃 ) Theory disagrees with practice L “Easy” to prove SGD ≈ GD by concentration <= e.g., small LR • “Hard” to prove an inverse result <= no concentration! •

  7. <latexit sha1_base64="gE32TAgCiehRkp6UP49CDdg+CSs=">ACcnicbVFNb9NAEF2bj7Yp0BRu5TIQIbWCRnalqlyQqsKBA4eikrZSHKzxep2sumu7u2OqyPIP4sfwA1B75hfAnY3TA0kZaW3b97MW71NSiUtBcFPz793/8HDldW1zvqjx082uptPT21RGS4GvFCFOU/QCiVzMSBJSpyXRqBOlDhLt7P+mfhLGyL/QtBQjeNcZpIjOSruVlEqM7iCd7AL0dhgCp/i6ISjgu2rHaghai1qI9IGWik18BrqBdpeGqojQdi4y4kca3SzX6MJqmxJOVtwFMTd3tBP2gL7oLwFvQOd/782P1w8/047v6K0oJXWuTEFVo7DIOSRjUaklyJphNVpTIL3Ashg7mqIUd1a1A68ck0JWGHdygpb9d6JGbe1UJ06pkSZ2uTcj/9cbVpS9HdUyLysSOZ8bZUCKmCWNaTSCE5q6gByI91bgU/QICf3IwsuetqadFw4XIMd8HpXj/c7wefXUJHbF6r7Dl7ybZyA7YIfvIjtmAcXbteV7HW/d+1v+C783l/re7cwztlD+m79aJMCi</latexit> Small Learning Rate: SGD ≈ GD 𝜃 = 𝑒𝑢 → 0 GD <latexit sha1_base64="EarS2g+hOyhHCn1+yqIviVW4JU=">ACMnicbVA9bxNBEN1LCARDiIGSZoUVyS5s3UWyQhkBRQqKIOIPyWdZc3tz9sp7H9qdw7JO/iX8Avr8gbSJ+AEgGoRERZue9TkFtjPSrp7em5k3ekGmpCHX/e7s7D7Ye/ho/3HlydODZ4fV5y+6Js21wI5IVar7ARhUMsEOSVLYzRCHCjsBdN3S73GbWRaXJB8wyHMYwTGUkBZKlRtT3jvsKIQOt0xme8yQu/3FpoDBfcRwL7jzWE/MPI/yRA8fqsMarW3JZbFt8G3h2onTZuvzXf/x6Pqr+8cNU5DEmJBQYM/DcjIYFaJC4aLi5wYzEFMY48DCBGI0w6I8ZMGPLBPyKNX2JcRL9v+JAmJj5nFgO2OgidnUluR92iCn6M2wkEmWEyZiZRTlilPKl1nxUGoUpOYWgNDS3srFBDQIsomucTz0qRig/E2Y9gG3eOW1265H21Cb9mq9tkr9prVmcdO2Ck7Y+eswT7wq7YNbtxLp0fzi/n96p1x7mbecnWyvn7DzPzrjI=</latexit> w w � η r L S ( w ) <latexit sha1_base64="pkyzSQktJPiPlv6NuoczmDRe2JA=">ACL3icbVC9SgNBGNzN8a/qKXNhyLEwnAXEG2EoBYWFhFNDORC2Nvbi0v2ftjdM4Qj7+ErWPoCtpYiyBiJfgA9m4uKUziwMIwMx+zjBNxJpVpvhlT0zOzc/OZhezi0vLKam5tvSrDWBaISEPRc3BknIW0IpitNaJCj2HU6vnfZJ37+pUKyMLhS3Yg2fNwKmMcIVlpq5oq2yzowBHsgd0S2IXzpn1JMId8ZxcSsNOKRFC3B2lU9Zq5bNgpoBJYg3Jdmn352Xv9P2+3Mx92W5IYp8GinAsZd0yI9VIsFCMcNrL2rGkESZt3KJ1TQPsU9lI0uIe7GjFBS8U+gUKUvXvRYJ9Kbu+o5M+Vjdy3OuL/3n1WHmHjYQFUaxoQAZFXsxBhdAfClwmKFG8qwkmgum/ArnBAhOl5xp8btpSVYPY43PMEmqxYK1XzAv9ELHaIAM2kRbKI8sdIBK6AyVUQURdIce0RN6Nh6MV+PD+BxEp4zhzQYagfH9C87Dq+Q=</latexit> d w = � r L S ( w )d t Gradient Flow (GF) d w = � r L S ( w )d t + p η Σ ( w ) 1 2 d B t Stochastic Modified Equation (SME) 𝜃 = 𝑒𝑢 → 0 Higher order term SGD <latexit sha1_base64="H8W0j+nYlbjAbB3JxTavAukChE=">ACeHicbVFNb9NAEF2bj5bwUReOXEZUiEYokV0JwbECDhw4FEHaSnFkjdfjdJX1h3bHRJGVH8SP4Qcgceml/4ALJzZODrTNSLt6eu/NzuhtWmtlOQx/ef6du/fu7+w+6D189PjJXrD/9NRWjZE0kpWuzHmKlrQqacSKNZ3XhrBINZ2lsw8r/ew7Gauq8hsvapoUOC1VriSyo5KgnUOsKWc0prDHAbQxt2raFsCTExuntqMIPSfxVobDeR9ew1ZfalDOiNstHYMNF5PWyWzFLJPgIByGXcFtEG3AwXH/z8/Bx98/TpLgKs4q2RUstRo7TgKa560aFhJTcte3Fiq3QY4pbGDJRZkJ235hJeOiaDvDLulAwd+39Hi4W1iyJ1zgL5wt7UVuQ2bdxw/m7SqrJumEq5HpQ3GriCVeKQKUOS9cIBlEa5XUFeoEuK3b9cm1IsuiE9F0x0M4b4PRoGL0Zhl9cQu/FunbFc/FCHIpIvBXH4pM4ESMhxaW34wXevfXB/+V319bfW/T80xcK/oH9F/wao=</latexit> w w � ⌘ r L S ( w ) + ⌘ ( r L S ( w ) � r ` k ( w )) <latexit sha1_base64="2YkqCO3ZcVtUvasN3QP5Gwy0bRM=">AC43icbVLihNBFK1uX2N8RWenm8JhQBCd0DGjTDMbFwIjmIyA6meUF190ymHj1V1Uo+wvcyWz9MH/Anf9gdSc+knih4HDuvecbndeCW5dknyP4mvXb9y8tXO7d+fuvfsP+g8fja2uDYMR0Kbs5xaEFzByHEn4KwyQGUu4DS/OG7px/BWK7VB7eoIJO0VHzGXWBmvZ/kBxKrjwLGrbpkTE1eEKgslxoleFXmFheSno+xIRsdKcpfo4JK7SzLfjNetKl8gaKhoCjDW5ltm/wpZLTHJevVrY7idOx82OMQCVfyJ+J6Xc0eN0Z9WGm3Kbn19214a5zuvpn295JB0hXeBukK7B3uP/65e5X6k/YwhWa1BOWYoNZO0qRymafGcSYgpKgtVJRd0BImASoqwWa+s2/wfmAKPNMmPOVwx/674am0diHzMCmpm9vNXkv+rzep3exl5rmqageKLY1mtcBO4/bj4oIbYE4sAqDM8JAVszk1lLnwC6y5yEVn0guHSTfPsA3Gw0H6YpC8Cxc6QsvaQU/QU/QMpegAHaLX6ASNEIveRCby0ecY4i/x1/hqORpHq51dtFbxt1/se1a</latexit> ( Var[ ✏ ] = � 2 ⇒ � = O ( √ ⌘ ) Var[ ✏ 1 + · · · + ✏ η ] = ⌘� 2 ∼ O � ⌘ 2 �

Recommend


More recommend