0 Taking a macro-step r L T ( w t ) is the same as taking the N - PowerPoint PPT Presentation

Stochastic Gradient Descent Batch Gradient Descent Cook'T I encio P N • r L T ( w ) = 1 n = 1 r ` n ( w ) . N 0 • Taking a macro-step � ↵ r L T ( w t ) is the same as taking the N micro-steps � α N r ` 1 ( w t ) , . . . , � α N r ` N ( w t ) 0 • First compute all the N steps at w t , then take all the steps • Thus, standard gradient descent is a batch method: Compute the gradient at w t using the entire batch of data, then move • Even with no line search, computing N micro-steps is still expensive COMPSCI 527 — Computer Vision Training Neural Nets 18 / 29

Stochastic Gradient Descent Stochastic Descent • Taking a macro-step � ↵ r L T ( w t ) is the same as taking the N micro-steps � α N r ` 1 ( w t ) , . . . , � α N r ` N ( w t ) • First compute all the N steps at w t , then take all the steps • Can we use this effort more effectively? • Key observation: �r ` n ( w ) is a poor estimate of �r L T ( w ) , but an estimate all the same : Micro-steps are correct on average! • After each micro-step, we are on average in a better place • How about computing a new micro-gradient after every micro-step ? • Now each micro-step gradient is evaluated at a point that is on average better (lower risk) than in the batch method COMPSCI 527 — Computer Vision Training Neural Nets 19 / 29

Stochastic Gradient Descent Batch versus Stochastic Gradient Descent O • s n ( w ) = � α N r ` n ( w ) • Batch: • Compute s 1 ( w t ) , . . . , s N ( w t ) • Move by s 1 ( w t ) , then s 2 ( w t ) , . . . then s N ( w t ) (or equivalently move once by s 1 ( w t ) + . . . + s N ( w t ) ) • Stochastic (SGD): OO • Compute s 1 ( w t ) , then move by s 1 ( w t ) from w t to w ( 1 ) t • Compute s 2 ( w ( 1 ) ) , then move by s 2 ( w ( 1 ) ) from w ( 1 ) to w ( 2 ) t t t t 4 . . . • Compute s N ( w ( N − 1 ) ) , then move by s N ( w ( N − 1 ) ) from w ( N − 1 ) t t t to w ( N ) = w t + 1 t • In SGD, each micro-step is taken from a better (lower risk) place on average COMPSCI 527 — Computer Vision Training Neural Nets 20 / 29

Stochastic Gradient Descent Why “Stochastic?” • Progress occurs only on average • Many micro-steps are bad, but they are good on average • Progress is a random walk https://towardsdatascience.com/ COMPSCI 527 — Computer Vision Training Neural Nets 21 / 29

Stochastic Gradient Descent Reducing Variance: Mini-Batches • Each data sample is a poor estimate of T : High-variance micro-steps • Each micro-step take full advantage of the estimate, by moving right away: Low-bias micro-steps • High variance may hurt more than low bias helps • Can we lower variance at the expense of bias? • Average B samples at a time: Take mini-steps • With bigger B , • Higher bias • Lower variance • The B samples are a mini-batch COMPSCI 527 — Computer Vision Training Neural Nets 22 / 29

Stochastic Gradient Descent Mini-Batches • Scramble T at random • Divide T into J mini-batches T j of size B • w ( 0 ) = w • For j = 1 , . . . , J : • Batch gradient: P jB g j = r L T j ( w ( j − 1 ) ) = 1 n =( j − 1 ) B + 1 r ` n ( w ( j − 1 ) ) B a w ( j ) = w ( j − 1 ) � ↵ g j 0 • Move: • This for loop amounts to one macro-step • Each execution of the entire loop uses the training data once 0 • Each execution of the entire loop is an epoch • Repeat over several epochs until a stopping criterion is met COMPSCI 527 — Computer Vision Training Neural Nets 23 / 29

Stochastic Gradient Descent Momentum • Sometimes w ( j ) meanders around in shallow valleys I FEEIFOROV No ↵ adjustment here • ↵ is too small, direction is still promising a • Add momentum v 0 = 0 Too v ( j + 1 ) = µ ( j ) v ( j ) � ↵ r L T ( w ( j ) ) (0  µ ( j ) < 1) w ( j + 1 ) = w ( j ) + v ( j + 1 ) I COMPSCI 527 — Computer Vision Training Neural Nets 24 / 29

Regularization Regularization • The capacity of deep networks is very high: It is often possible to achieve near-zero training loss • “Memorize the training set” ) overfitting • All training methods use some type of regularization • Regularization can be seen as inductive bias : Bias the training algorithm to find weights with certain properties 0 O • Simplest method: weight decay , add a term � k w k 2 to the risk function: Keep the weights small (Tikhonov) • Many proposals have been made • Not yet clear which method works best, a few proposals follow COMPSCI 527 — Computer Vision Training Neural Nets 25 / 29

Regularization LAO Lo i Early Termination • Terminating training well before the L T is minimized is Et somewhat similar to “implicit” weight decay • Progress at each iteration is limited, so stopping early keeps us close to w 0 , which is a set of small random weights 0 • Therefore, the norm of w t is restrained, albeit in terms of how long the learner takes to get there rather than in DIFF absolute terms • A more informed approach to early termination stops when a validation risk (or, even better, error rate) stops declining • This (with validation check) is arguably the most widely used regularization method COMPSCI 527 — Computer Vision Training Neural Nets 26 / 29

Regularization Dropout • Dropout inspired by ensemble methods: Regularize by averaging multiple predictors • Key difficulty: It is too expensive to train an ensemble of deep neural networks • Efficient (crude!) approximation: • Before processing a new mini-batch, flip a coin with P [ heads ] = p (typically p = 1 / 2) for each neuron • Turn off the neurons for which the coin comes up tails • Restore all neurons at the end of the mini-batch • When training is done, multiply all weights by p • This is very loosely akin to training a different network for every mini-batch • Multiplication by p takes the “average” of all networks • There are flaws in the reasoning, but the method works COMPSCI 527 — Computer Vision Training Neural Nets 27 / 29

Regularization COMPSCI 527 — Computer Vision Training Neural Nets 28 / 29

Regularization Data Augmentation • Data augmentation is not a regularization method, but combats overfitting • Make new training data out of thin air • Given data sample ( x , y ) , create perturbed copies x 1 , . . . , x k of x (these have the same label!) • Add samples ( x 1 , y ) , . . . , ( x k , y ) to training set T • With images this is easy. The x i s are cropped, rotated, stretched, re-colored, . . . versions of x • One training sample generates k new ones • T grows by a factor of k + 1 • Very effective, used almost universally • Need to use realistic perturbations COMPSCI 527 — Computer Vision Training Neural Nets 29 / 29

0 Taking a macro-step r L T ( w t ) is the same as taking the N - PowerPoint PPT Presentation

Stochastic Gradient Descent Batch Gradient Descent Cook'T I encio P N r L T ( w ) = 1 n = 1 r ` n ( w ) . N 0 Taking a macro-step r L T ( w t ) is the same as taking the N micro-steps N r ` 1 ( w t ) , . . . , N r `

BATCH BINARY WEIERSTRASS ECC 2019, Bochum, Germany 02 December 2019 Billy Bob Brumley Sohaib ul

Batch IS NOT Heavy: Learning Word Representations From All Samples 1 1 1 Xin Xin, Fajie Yuan,

ConQUR: Mitigating Delusional Bias in Deep Q-Learning DiJia (Andy) Su (Princeton) Jayden Ooi

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Jesse Read 1 ,

Yeti Operations Committee MARCH 7 2016 MEETING Agenda 1. Usage Report 2. Home Directory

CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase Chen Zhang Hans De Sterck

CEE 370 Environmental Engineering Principles Lecture #17 Ecosystems IV: Microbiology &

Mendel at NERSC: Multiple Workloads on a Single Linux Cluster Larry Pezzaglia NERSC

work_mem

How Many Dissimilarity/Kernel Self Organizing Map Variants Do We Need? Fabrice Rossi SAMM,

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

APEL Accounting: Data Flow and Work Plan Adrian Coveney, Greg Corbett apel-admins@stfc.ac.uk

Distributed Training Across the World 183ms 23Mbps California 35ms Tokyo 17Mbps 63Mbps

Learning Deconvolution Network for Semantic Segmentation Hyeonwoo Noh, Seunghoon Hong, Bohyung

Batch binary Edwards D. J. Bernstein University of Illinois at Chicago NSF ITR0716498 p

UART Universal asynchronous receiver and transmitter UART A protocol that translates data

Serial Communications time. 3 4 Serial Interfaces Serial vs. Parallel Different from a

8051 Serial Port and Timer/Counter Serial Port Timer Counter Chatchai Jantaraprim

Copying Files With XMODEM What Is XMODEM XMODEM is a simple file transfer protocol which became

Management of Sliceable Transponder with NETCONF and YANG M. Dallaglio 1 , N. Sambo 1 , F. Cugini

Networking with Zigbee and Emergent Behavior 5XID0, Week 2 ZigBee Communication standard for

GNU Guix as an alternative to the Yocto Project Mathieu Othacehe <m.othacehe@gmail.com>

Presenter: Rob Miller Ishtiaq Rouf, Hossen Mustafa, Miao Xu, Wenyuan Xu, Dept. of CSE,

CompSci 356: Computer Network Architectures Lecture 4: Link layer: Encoding, Framing, and Error

0 Taking a macro-step r L T ( w t ) is the same as taking the N - PowerPoint PPT Presentation

Stochastic Gradient Descent Batch Gradient Descent Cook'T I encio P N r L T ( w ) = 1 n = 1 r ` n ( w ) . N 0 Taking a macro-step r L T ( w t ) is the same as taking the N micro-steps N r ` 1 ( w t ) , . . . , N r `

BATCH BINARY WEIERSTRASS ECC 2019, Bochum, Germany 02 December 2019 Billy Bob Brumley Sohaib ul

Batch IS NOT Heavy: Learning Word Representations From All Samples 1 1 1 Xin Xin, Fajie Yuan,

ConQUR: Mitigating Delusional Bias in Deep Q-Learning DiJia (Andy) Su (Princeton) Jayden Ooi

Batch-Incremental vs. Instance-Incremental Learning in Dynamic and Evolving Data Jesse Read 1 ,

Yeti Operations Committee MARCH 7 2016 MEETING Agenda 1. Usage Report 2. Home Directory

CloudBATCH: A Batch Job Queuing System on Clouds with Hadoop and HBase Chen Zhang Hans De Sterck

CEE 370 Environmental Engineering Principles Lecture #17 Ecosystems IV: Microbiology &amp;

Mendel at NERSC: Multiple Workloads on a Single Linux Cluster Larry Pezzaglia NERSC

work_mem

How Many Dissimilarity/Kernel Self Organizing Map Variants Do We Need? Fabrice Rossi SAMM,

Learning about the process and organism: Batch Sef Heijnen, Department of Biotechnology, Faculty

APEL Accounting: Data Flow and Work Plan Adrian Coveney, Greg Corbett apel-admins@stfc.ac.uk

Distributed Training Across the World 183ms 23Mbps California 35ms Tokyo 17Mbps 63Mbps

Learning Deconvolution Network for Semantic Segmentation Hyeonwoo Noh, Seunghoon Hong, Bohyung

Batch binary Edwards D. J. Bernstein University of Illinois at Chicago NSF ITR0716498 p

UART Universal asynchronous receiver and transmitter UART A protocol that translates data

Serial Communications time. 3 4 Serial Interfaces Serial vs. Parallel Different from a

8051 Serial Port and Timer/Counter Serial Port Timer Counter Chatchai Jantaraprim

Copying Files With XMODEM What Is XMODEM XMODEM is a simple file transfer protocol which became

Management of Sliceable Transponder with NETCONF and YANG M. Dallaglio 1 , N. Sambo 1 , F. Cugini

Networking with Zigbee and Emergent Behavior 5XID0, Week 2 ZigBee Communication standard for

GNU Guix as an alternative to the Yocto Project Mathieu Othacehe &lt;m.othacehe@gmail.com&gt;

Presenter: Rob Miller Ishtiaq Rouf, Hossen Mustafa, Miao Xu, Wenyuan Xu, Dept. of CSE,

CompSci 356: Computer Network Architectures Lecture 4: Link layer: Encoding, Framing, and Error

CEE 370 Environmental Engineering Principles Lecture #17 Ecosystems IV: Microbiology &

GNU Guix as an alternative to the Yocto Project Mathieu Othacehe <m.othacehe@gmail.com>