stochastic gradient descent
play

Stochastic Gradient Descent Many slides attributable to: Prof. - PowerPoint PPT Presentation

Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Stochastic Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James,


  1. Tufts COMP 135: Introduction to Machine Learning https://www.cs.tufts.edu/comp/135/2020f/ Stochastic Gradient Descent Many slides attributable to: Prof. Mike Hughes Erik Sudderth (UCI), Emily Fox (UW), Finale Doshi-Velez (Harvard) James, Witten, Hastie, Tibshirani (ISL/ESL books) 2

  2. Objectives Today (day 12) Stochastic Gradient Descent • Review: Gradient Descent • Repeatedly step downhill until converged • Review: Training Neural Nets with Backprop • Backprop = chain rule plus dynamic programming • L-BFGS : How to step in better direction? • Stochastic Gradient Descent : How to go fast? Mike Hughes - Tufts COMP 135 - Fall 2020 3

  3. <latexit sha1_base64="QD3y6jOAWeQ5sqrhpmn6rqR5jJ8=">ACN3icbVBNSyNBEO3xY9Wsu0Y9emkMC3sKM+7CehS9iAdRMCqkQ6jp1GQaez7orhHCmH/lxb/hTS8eFPHqP7AnieLHPmjq8V4V1fXCXCtLvn/jTU3PzH6bm1+ofV/8XOpvrxybLPCSGzJTGfmNASLWqXYIkUaT3ODkIQaT8Kznco/OUdjVZYe0SDHTgL9VEVKAjmpW98XoPMYuJC9jLjQGNEFf60iMiBLkYMhBXr4xrigGAmGfI8Lo/px1Tqu3XrDb/oj8K8kmJAGm+CgW78WvUwWCaYkNVjbDvycOmW1R2oc1kRhMQd5Bn1sO5pCgrZTju4e8l9O6fEoM+6lxEfq+4kSEmsHSeg6E6DYfvYq8X9eu6Bos1OqNC8IUzleFBWaU8arEHlPGZSkB46ANMr9lcsYXFjkoq65EILPJ38lxvN4E9z4/BvY2t7Esc8W2Pr7DcL2D+2xXbZAWsxyS7ZLbtnD96Vd+c9ek/j1ilvMrPKPsB7fgFLZq1r</latexit> Review: Gradient Descent in 1D input: initial θ ∈ R input: step size α ∈ R + while not converged: θ ← θ − α d d θ J ( θ ) Q: How far to step in that direction? Q: Which direction to step? � � � � ∂ � � � � A: α · ∂θ J A: Straight downhill � � � � � � � � (steepest descent at current location) Step size parameter picked in advance, unaware of current location Mike Hughes - Tufts COMP 135 - Fall 2020 4

  4. <latexit sha1_base64="YmhBdBy1tgcD8mJGsWOwVKwghNI=">ACLnicbVBNa9tAEF2lH3GVfjNsZelJuBejOQUkqNJKJSeHKgTg2XMaD2yFq9WYncUMIp/US75K80h0IbSa35G1x+F1u7AMo/3jA7Ly6UtBQE372dJ0+fPd+tvfD3Xr56/a+/bC5qUR2BO5yk0/BotKauyRJIX9wiBkscLeHq20C+v0FiZ680K3CYwUTLRAogR43qnyJQRQo8EuOceKQwoes/zY80xApGVUQpEsz5l+YKfAjIyfpwrnqo3ojaAXL4tsgXIMGW1d3VL+LxrkoM9QkFg7CIOChUYkLh3I9KiwWIKUxw4KCGDO2wWp4754eOGfMkN+5p4kv274kKMmtnWeycGVBqN7UF+T9tUFJyMqykLkpCLVaLklJxyvkiOz6WBgWpmQMgjHR/5SIFA4Jcwr4LIdw8eRtctFvhUat9/rHROV3HUWPv2HvWZCE7Zh32mXVZjwl2w76xH+zBu/XuvZ/er5V1x1vPHLB/ynv8DVt1qL8=</latexit> <latexit sha1_base64="N3+JepYHD0brMFSlWy+mt6ByhW4=">ACfnichVHLatwFJXdV+q+ps2yG7VDStqSqZ0Umk0hNJuQVQqdJDAy5lpzPSMiy0a6DgzGn9Ef6y7f0k01M6bkUegFweHcx86N6+1chTHV0F47/6Dh482HkdPnj57/mLw8tWpqxorcSwrXdnzHBxqZXBMijSe1xahzDWe5ReHy/zZJVqnKvODFjWmJcyMKpQE8lQ2+CkM5BqyVtAcCTp+vL1G7/nXSGgsaBKJHGfKtGAtLpWdpEoLMhW1GBJge7+Ir4uzWLfJhKC/1eYrIRopn3zSFg1m1OaDYbxKF4FvwuSHgxZHyfZ4JeYVrIp0ZDU4NwkiWtK2+U0qdH3bRzWIC9ghMPDZTo0nZlX8e3PDPlRWX9M8RX7PWKFkrnFmXulSXQ3N3OLcl/5SYNFftpq0zdEBq5HlQ0mlPFl7fgU2VRkl54ANIqvyuXc/CWkb9Y5E1Ibn/5LjdHSV7o93vn4cH3o7Nthr9pZts4R9YQfsiJ2wMZPsd/Am+B8DFn4LtwJP62lYdDXbLIbEe7/AYrCwhU=</latexit> <latexit sha1_base64="1JLgwDRjFbF3NXnXhXd7r/jcJ0g=">ACH3icbZBNSwMxEIazflu/qh69BIugB8tuFfUoehFPClaFbimz6bQNZrNLMiuUpf/Ei3/FiwdFxJv/xrTdg18vBJ68M0Myb5Qqacn3P72Jyanpmdm5+dLC4tLySnl17domRFYF4lKzG0EFpXUWCdJCm9TgxBHCm+iu9Nh/eYejZWJvqJ+is0Yulp2pAByVqt8EFIPCXjYRbK8uOzyEFTac6GSErH/sDfr49p1WueJX/ZH4XwgKqLBCF63yR9hORBajJqHA2kbgp9TMwZAUCgelMLOYgriDLjYcaojRNvPRfgO+5Zw27yTGHU185H6fyCG2th9HrjMG6tnftaH5X62RUeomUudZoRajB/qZIpTwodh8bY0KEj1HYAw0v2Vix4YEOQiLbkQgt8r/4XrWjXYq9Yu9yvHJ0Uc2yDbJtFrBDdszO2AWrM8Ee2BN7Ya/eo/fsvXnv49YJr5hZz/kfX4B3NKiPQ=</latexit> Review: Gradient Descent in 2D+ gradient = vector of D input: initial θ ∈ R partial derivatives ∂ input: step size α ∈ R +  � ∂θ 0 J r θ J ( θ ) = ∂ ∂θ 1 J while not converged: θ ← θ − α d θ θ � α r θ J ( θ ) d θ J ( θ ) Q: How far to step in that direction? Q: Which direction to step? α · || r θ J ( θ ) || A: A: Straight downhill (steepest descent at current location) Step size parameter picked in advance, unaware of current location Mike Hughes - Tufts COMP 135 - Fall 2020 5

  5. Review: Step size matters Even in one dimension, tough to select step size. 𝑔(𝒚) 𝑔(𝒚) 𝒚 𝒚 𝒚 𝒚 Recommendations - Try multiple values - Might need different sizes at different locations Mike Hughes - Tufts COMP 135 - Fall 2020 6

  6. Review: Neural Net as computational graph 2 directions of propagation Forward: compute loss Backward: compute grad Mike Hughes - Tufts COMP 135 - Fall 2020 7

  7. <latexit sha1_base64="nJnhInwsXMzfqZM21hSK+iPOXxw=">ACGHicbVDLSgMxFM3Ud31VXboJFqEFqTNV0I0giuBKFOwDOnXIpGkbmSGJGMdhvkMN/6KGxeKuHXn35hpu9DqgQsn59xL7j1+yKjStv1l5WZm5+YXFpfyura+uFjc26CiKJSQ0HLJBNHynCqCA1TUjzVASxH1Gv7gPMb90QqGohbHYekzVFP0C7FSBvJK+y7nAovGabQVRH3EnHipHdX8KIEY0/sQbePdBKnpYfsMSzDslco2hV7BPiXOBNSBNce4VPtxPgiBOhMUNKtRw71O0ESU0xI2nejRQJER6gHmkZKhAnqp2MDkvhrlE6sBtIU0LDkfpzIkFcqZj7pMj3VfTXib+57Ui3T1uJ1SEkSYCjz/qRgzqAGYpwQ6VBGsWG4KwpGZXiPtIqxNlnkTgjN98l9Sr1acg0r15rB4ejaJYxFsgx1QAg4AqfgElyDGsDgETyDV/BmPVkv1rv1MW7NWZOZLfAL1uc3L+2ejg=</latexit> Review: Training Neural Nets Training Objective: N X min E ( y n , ˆ y ( x n , w )) w n =1 Gradient Descent Algorithm: w = initialize_weights_at_random_guess(random_state=0) while not converged: total_grad_wrt_w = zeros_like(w) for n in 1, 2, … N: loss[n], grad_wrt_w[n] = forward_and_backward_prop(x[n], y[n], w) total_grad_wrt_w += grad_wrt_w[n] w = w – alpha * total_grad_wrt_w How to pick step size reliably? How to go fast on big datasets? Mike Hughes - Tufts COMP 135 - Fall 2020 8

Recommend


More recommend