Obtaining Adjustable Regularization for Free via Iterate Averaging Jingfeng Wu , Vladimir Braverman, Lin F. Yang Johns Hopkins University & UCLA June 2020
<latexit sha1_base64="/BQPh6/OvMQrEs/7E6XsCNoHT0c=">ACNXicbVBNb9NAEF2XrxK+Ahy5jFohpYqI7AqJckCK4MKh4JIWymOrPF6nKyXlu740aR5X/BL+HC/+BEDxAiCt/gU3SA7Q8aWn92b0Zl9aeU4DM+DrWvXb9y8tX27c+fuvfsPug8fHbuytpJGstSlPU3RkVaGRqxY02lCYtU0k6f7PyT87IOlWaD7ysaFLg1KhcSWQvJd3DRdLM+1EL8AoWyRyeQUyMEGvKuRcbTDXCYW+xB31o4nVcYylr/YDPyLCF9ysztmo6472kuxsOwjXgKokuyO5wJ+5/PB8uj5LulzgrZV2QYanRuXEUVjxp0LKSmtpOXDuqUM5xSmNPDRbkJs36jBaeiWDvLT+GYa1+vdGg4VzyL1kwXyzF32VuL/vHN+cGkUaqmYzcBOW1Bi5hVSFkypJkvfQEpVX+VpAztCjZF93xJUSXv3yVHO8PoueDl+98G6/FBtvidgRPRGJF2Io3ojMRJSfBJfxXfxI/gcfAt+Br82o1vBxc5j8Q+C38AktCr+A=</latexit> <latexit sha1_base64="NgkEF/pzBUBbtk0dDOe5Z6kCGk=">ACE3icbVDLSgMxFM3UV62vqks3wSJUhTIjgnZXdCPiop9QGcomUzahmYyQ5KxlGH+wY1bP8ONC0XcunHnb/gFptMutPVA4HDOfeW4IaNSmeaXkZmbX1hcyi7nVlbX1jfym1t1GUQCkxoOWCaLpKEU5qipGmqEgyHcZabj985HfuCNC0oDfqmFIHB91Oe1QjJSW2vkD26e8HQ8SeFUc7MNDGNvp0FgQL4E205M8lMAbzBbNkpoCzxJqQqWMLp3vwmO1nf+0vQBHPuEKMyRlyzJD5cRIKIoZSXJ2JEmIcB91SUtTjnwinThdn8A9rXiwEwj9uIKp+rsjRr6UQ9/VlT5SPTntjcT/vFakOqdOTHkYKcLxeFEnYlAFcBQ9KgWLGhJgLqm+FuIcEwkrHmNMhWNfniX1o5J1XCpf6zTOwBhZsAN2QRFY4ARUwAWoghrA4B48gRfwajwYz8ab8T4uzRiTnm3wB8bHD4ocoFU=</latexit> <latexit sha1_base64="WqLor5Nc19eHJ3+zI7DqK83WEhc=">ACnicbVC7TsMwFHXKq5RXgJHFtEKqGKoEIRW2ChbGItGH1ITIcdzWqvOQ7VBVUWYW/oBvYGEAIVYWVjYEH4ObdoCWI1k6Ouc+fI8bMSqkYXxquYXFpeWV/GphbX1jc0vf3mKMOaYNHDIQt52kSCMBqQhqWSkHXGCfJeRljs4H/utG8IFDYMrOYqI7aNeQLsUI6kR98fOgNoyRAOrw+dxMoGJpx4KbSYmuKh1NFLRsXIAOeJOSWlWrH8/V9v687+oflhTj2SAxQ0J0TCOSdoK4pJiRtGDFgkQID1CPdBQNkE+EnWSbU3igFA92Q65eIGm/u5IkC/EyHdVpY9kX8x6Y/E/rxPL7omd0CKJQnwZFE3ZlCdPs4FepQTLNlIEYQ5VX+FuI84wlKlV1AhmLMnz5PmUcU8rpxeqjTOwAR5sAeKoAxMUAU1cAHqoAEwuAUP4Ak8a3fao/aivU5Kc9q0Zxf8gfb2A2ZJno=</latexit> Searching optimal hyperparameter ML/Opt problem min w L ( w ) + λ R ( w ) Regularization Main loss Hyperparameter GD/SGD w k +1 = w k � η ( r L ( w ) + λ R ( w )) w k → w ∗ λ Learning rate/step size
Re-running the optimizer is expensive! L ResNet-50 + ImageNet + 8 GPUs • A single round of training takes about 3 days . • Almost a year to try a hundred different hyperparameters. Can we obtain adjustable regularization for free ?
Iterate averaging => regularization (Neu et al.) Contour of 𝑀(𝑥) Solution for min 𝑀 𝑥 + 𝜇𝑆 𝑥 SGD Path for Geometric solving min 𝑀 𝑥 averaging
Iterate averaging protocol • Require: A stored opt. path • Input: a hyperparameter 𝜇 • Compute a weighting scheme • Average the path • Output: the regularized solution Iterate averaging is cheap J But Neu et al.’s result is limited L
<latexit sha1_base64="f25a0wzOI/NgSEYzE82YR0fnMxg=">ACBXicbVC7SgNBFJ2NrxhfUQsRLQaDEJuwGwRjIQRtLKOYB2TXZXYymwyZfTAzawibNDb+io2FIrb+gthpY+tnOHkUmnjgwuGce7n3HidkVEhd/9ASM7Nz8wvJxdTS8srqWnp9oyKCiGNSxgELeM1BgjDqk7KkpFayAnyHEaqTvts4FdvCBc08K9kNySWh5o+dSlGUkl2evcy2zmAJ9B0OcKx0Y/zfWj2OmbPzl/n7XRGz+lDwGlijEmWPh62/r83i7Z6XezEeDI7EDAlRN/RQWjHikmJG+ikzEiREuI2apK6ojzwirHj4R/uK6UB3YCr8iUcqr8nYuQJ0fUc1ekh2RKT3kD8z6tH0i1YMfXDSBIfjxa5EYMygINIYINygiXrKoIwp+pWiFtI5SFVcCkVgjH58jSp5HPGYe74QqVxCkZIgh2wB7LAEegCM5BCZQBrfgHjyCJ+1Oe9CetZdRa0Ibz2yCP9BefwCil5su</latexit> <latexit sha1_base64="FAcAPX0VgpeDU5bzVZuBTWV3X4=">ACGnicbVDLSsNAFJ34rPUVdelmsAi6sCRFUBdC0YUuXFToQ2jSMJlO6uBkEmYm1pDmO9z4Bf6DGxeKuBM3/o3T1oWvAxcO59zLvf4MaNSWdaHMTE5NT0zW5grzi8sLi2bK6tNGSUCkwaOWCQufCQJo5w0FWMXMSCoNBnpOVfHQ/91jURka8rtKYuCHqcRpQjJSWPNM+2+pvw0PoBALhzM4znkNHJqGX0UM73DoDPqdOryBOzB1Bl6lU/HMklW2RoB/if1FStUTeO94g17NM9+cboSTkHCFGZKybVuxcjMkFMWM5EUnkSRG+Ar1SFtTjkIi3Wz0Wg43tdKFQSR0cQVH6veJDIVSpqGvO0OkLuVvbyj+57UTFey7GeVxogjH40VBwqCK4DAn2KWCYMVSTRAWVN8K8SXSISmdZlGHYP9+S9pVsr2bvngXKdxBMYogHWwAbaADfZAFZyCGmgADG7BA3gCz8ad8Wi8GK/j1gnja2YN/IDx/gkcCqIH</latexit> <latexit sha1_base64="DN39c4r+tOUEleqYdUsr2NFJXpY=">ACInicbVDLSgMxFM34rPVdekmKIKiLTMiaBdi0YUuFawVOrVkMndqaGYmJBmxDPMtbvwVNy6U6koQ/BXTx0KtFwIn59x7T3I8wZnStv1hjY1PTE5N52bys3PzC4uFpeUrFSeSQpXGPJbXHlHAWQRVzTSHayGBhB6Hmtc+6em1O5CKxdGl7ghohKQVsYBRog3VLJRFs40P8aZTFvipr3jYmGubiAJTZ0sdbZTt2+SvAzl5vFPslc0CRrFtbtkt0vPAqcIVivH174sHp+fNwpvrxzQJIdKUE6Xqji10IyVSM8ohy7uJAkFom7SgbmBEQlCNtO+e4Q3D+DiIpTmRxn3250RKQqU6oWc6Q6Jv1V+tR/6n1RMdHDRSFolEQ0QHRkHCsY5xLy/sMwlU84BhEpm3orpLTHxaJNq3oTg/P3yKLjaLTl7pfKFSeMYDSqHVtEa2kQO2kcVdIbOURVR9ICe0At6tR6tZ6trvQ9ax6zhzAr6VdbnN7MkplI=</latexit> <latexit sha1_base64="NgkEF/pzBUBbtk0dDOe5Z6kCGk=">ACE3icbVDLSgMxFM3UV62vqks3wSJUhTIjgnZXdCPiop9QGcomUzahmYyQ5KxlGH+wY1bP8ONC0XcunHnb/gFptMutPVA4HDOfeW4IaNSmeaXkZmbX1hcyi7nVlbX1jfym1t1GUQCkxoOWCaLpKEU5qipGmqEgyHcZabj985HfuCNC0oDfqmFIHB91Oe1QjJSW2vkD26e8HQ8SeFUc7MNDGNvp0FgQL4E205M8lMAbzBbNkpoCzxJqQqWMLp3vwmO1nf+0vQBHPuEKMyRlyzJD5cRIKIoZSXJ2JEmIcB91SUtTjnwinThdn8A9rXiwEwj9uIKp+rsjRr6UQ9/VlT5SPTntjcT/vFakOqdOTHkYKcLxeFEnYlAFcBQ9KgWLGhJgLqm+FuIcEwkrHmNMhWNfniX1o5J1XCpf6zTOwBhZsAN2QRFY4ARUwAWoghrA4B48gRfwajwYz8ab8T4uzRiTnm3wB8bHD4ocoFU=</latexit> <latexit sha1_base64="FhViANzgI41mA7lEGHD9+SXO/+o=">ACDXicbZDLSsNAFIYnXmu9RV0qMlgFQShJEdRd0Y3LFuwF2hAmk0k7dDIJMxOlhC7duPFV3AhVxK17dz6DL+E07UJbfzjw8Z9zmDm/FzMqlWV9GXPzC4tLy7mV/Ora+samubVdl1EiMKnhiEWi6SFJGOWkpqhipBkLgkKPkYbXuxr1G7dESBrxG9WPiROiDqcBxUhpyzUPY9eGd7pOYOyWNJU0tbEfKZlZPW31XLNgFa1McBbsCRTKe8Pq9/3+sOKan20/wklIuMIMSdmyrVg5KRKYkYG+XYiSYxwD3VISyNHIZFOml0zgEfa8WEQCV1cwcz9vZGiUMp+6OnJEKmunO6NzP96rUQF505KeZwowvH4oSBhUEVwFA30qSBYsb4GhAXVf4W4iwTCSgeY1yHY0yfPQr1UtE+LF1WdxiUYKwd2wQE4BjY4A2VwDSqgBjB4AE/gBbwaj8az8Wa8j0fnjMnODvgj4+MHgBqcaw=</latexit> <latexit sha1_base64="wYnVpEecl+QCJ/GPzD21/2VLADQ=">AC3icbVA9SwNBEN3zM8avqKXNkiBEguFOBLUQgjYWFhGMCeTCMbfZJMvt7R27e4ZwpLex83fYWChi6x+wy79xk1io8cHA470Zub5MWdK2/bImptfWFxazqxkV9fWNzZzW9u3KkokoTUS8Ug2fFCUM0FrmlOG7GkEPqc1v3gYuzX76hULBI3ehDTVghdwTqMgDaSl8v3vTQoOUOMz3DfC/ABdqkG7ArwOeCrYn8fe7mCXbYnwLPE+SaFSt4tPY4qg6qX+3TbEUlCKjThoFTsWPdSkFqRjgdZt1E0RhIAF3aNFRASFUrnfwyxHtGaeNOJE0JjSfqz4kUQqUGoW86Q9A9dcbi/95zUR3TlopE3GiqSDTRZ2EYx3hcTC4zSQlmg8MASKZuRWTHkg2sSXNSE4f1+eJbeHZeofHpt0jhHU2TQLsqjInLQMaqgS1RFNUTQPXpCL+jVerCerTfrfdo6Z3P7KBfsD6+AM2Im1o=</latexit> Formally, Neu et al. shows n • Linear regression L ( w ) = 1 k w T x � y k 2 X 2 n i =1 • ℓ ! -regularization R ( w ) = 1 2 k w k 2 2 • GD/SGD path w k +1 = w k � η r L ( w ) • Geometric averaging 1 p k = (1 − p ) p k , p = 1 + λη solves min w L ( w ) + λ R ( w ) p 1 w 1 + p 2 w 2 + · · · + p k w k
Our contributions: J J J J Iterate averaging works for more general 1. regularizers <= generalized ℓ ! -regularizer 2. optimizers <= Nesterov’s acceleration 3. objectives <= strongly convex and smooth losses 4. deep neural networks! (Empirically)
<latexit sha1_base64="70MwI2fPbgrJ7PozSD5+vfNk4j4=">ACEnicbVA9SwNBEN2LXzF+RS1tlohgEMOdCGohBG0sLBIwMZCLx9xmkyzZ2zt29wzhyG+wyV+xsVDE1sou/8bNR6HGBwOP92aYmedHnClt2yMrtbC4tLySXs2srW9sbmW3d6oqjCWhFRLyUNZ8UJQzQSuaU5rkaQ+Jze+93rsX/SKViobjT/Yg2AmgL1mIEtJG8bL7nJd0jZ4DxJe5XyMXaoBlx+SYyO6AnwO+Pawl8dedt8u2BPgeLMyH4x5x4NR8V+yct+uc2QxAEVmnBQqu7YkW4kIDUjnA4ybqxoBKQLbVo3VEBAVSOZvDTAB0Zp4lYoTQmNJ+rPiQCpfqBbzoD0B31xuL/3n1WLfOGwkTUaypINFrZhjHeJxPrjJCWa9w0BIpm5FZMOSCDapJgxITh/X54n1ZOCc1q4KJs0rtAUabSHcugQOegMFdENKqEKIugJPaNX9GYNrRfr3fqYtqas2cwu+gXr8xth2J3F</latexit> <latexit sha1_base64="ws5cJAX3oCOTYhK7CiVfRA+KIzQ=">ACBnicbVDLSsNAFJ34rPUVdSnCYBHqpiSloC6EohuXrdgHNLFMpN26CQTZiaWErJy46+4caGIW7/BnX/jtM1CWw9cOJxzL/fe40WMSmVZ38bS8srq2npuI7+5tb2za+7tNyWPBSYNzBkXbQ9JwmhIGoqRtqRICjwGl5w+uJ3ogQlIe3qlxRNwA9UPqU4yUlrm0W1xdAovoeMLhBM7TcopHN07ikewDkds2CVrCngIrEzUgAZal3zy+lxHAckVJghKTu2FSk3QUJRzEiad2JIoSHqE86moYoINJNpm+k8EQrPehzoStUcKr+nkhQIOU48HRngNRAznsT8T+vEyv/3E1oGMWKhHi2yI8ZVBxOMoE9KghWbKwJwoLqWyEeIB2I0snldQj2/MuLpFku2ZXSRb1SqF5lceTAITgGRWCDM1AFN6AGgCDR/AMXsGb8WS8GO/Gx6x1ychmDsAfGJ8/tb2XZw=</latexit> <latexit sha1_base64="FhViANzgI41mA7lEGHD9+SXO/+o=">ACDXicbZDLSsNAFIYnXmu9RV0qMlgFQShJEdRd0Y3LFuwF2hAmk0k7dDIJMxOlhC7duPFV3AhVxK17dz6DL+E07UJbfzjw8Z9zmDm/FzMqlWV9GXPzC4tLy7mV/Ora+samubVdl1EiMKnhiEWi6SFJGOWkpqhipBkLgkKPkYbXuxr1G7dESBrxG9WPiROiDqcBxUhpyzUPY9eGd7pOYOyWNJU0tbEfKZlZPW31XLNgFa1McBbsCRTKe8Pq9/3+sOKan20/wklIuMIMSdmyrVg5KRKYkYG+XYiSYxwD3VISyNHIZFOml0zgEfa8WEQCV1cwcz9vZGiUMp+6OnJEKmunO6NzP96rUQF505KeZwowvH4oSBhUEVwFA30qSBYsb4GhAXVf4W4iwTCSgeY1yHY0yfPQr1UtE+LF1WdxiUYKwd2wQE4BjY4A2VwDSqgBjB4AE/gBbwaj8az8Wa8j0fnjMnODvgj4+MHgBqcaw=</latexit> <latexit sha1_base64="NgkEF/pzBUBbtk0dDOe5Z6kCGk=">ACE3icbVDLSgMxFM3UV62vqks3wSJUhTIjgnZXdCPiop9QGcomUzahmYyQ5KxlGH+wY1bP8ONC0XcunHnb/gFptMutPVA4HDOfeW4IaNSmeaXkZmbX1hcyi7nVlbX1jfym1t1GUQCkxoOWCaLpKEU5qipGmqEgyHcZabj985HfuCNC0oDfqmFIHB91Oe1QjJSW2vkD26e8HQ8SeFUc7MNDGNvp0FgQL4E205M8lMAbzBbNkpoCzxJqQqWMLp3vwmO1nf+0vQBHPuEKMyRlyzJD5cRIKIoZSXJ2JEmIcB91SUtTjnwinThdn8A9rXiwEwj9uIKp+rsjRr6UQ9/VlT5SPTntjcT/vFakOqdOTHkYKcLxeFEnYlAFcBQ9KgWLGhJgLqm+FuIcEwkrHmNMhWNfniX1o5J1XCpf6zTOwBhZsAN2QRFY4ARUwAWoghrA4B48gRfwajwYz8ab8T4uzRiTnm3wB8bHD4ocoFU=</latexit> 1. Generalized ℓ " -regularization R ( w ) = 1 2 w > Qw Use a preconditioned GD/SGD path instead! w k +1 = w k � η Q − 1 r L ( w ) solves min w L ( w ) + λ R ( w ) p 1 w 1 + p 2 w 2 + · · · + p k w k
Recommend
More recommend