one and a half simple differential programming languages
play

One-and-a-Half Simple Differential Programming Languages Gordon - PowerPoint PPT Presentation

One-and-a-Half Simple Differential Programming Languages Gordon Plotkin Calgary, 2019 ~ Joint work at Google with Martn Abadi ~ Talk Synopsis Review of neural nets Review of Differentiation A minilanguage Differentiating conditionals and


  1. One-and-a-Half Simple Differential Programming Languages Gordon Plotkin Calgary, 2019 ~ Joint work at Google with Martín Abadi ~

  2. Talk Synopsis Review of neural nets Review of Differentiation A minilanguage Differentiating conditionals and loops Language semantics: operational and denotational Beyond powers of R Conclusion and future work

  3. Neural networks: a very brief introduction Neural networks Deep learning is based on neural networks: ● loosely inspired by the brain; ● built from simple, trainable functions. input output

  4. Primitives: the programmer’s neuron Primitives: the neuron y = F ( w 1 x 1 +...+ w n x n + b ) ● w 1 ... w n are weights, ● b is a bias, ● weights and biases are parameters, ● F is a “differentiable” non-linear function, X 1 ... X n e.g., the “ReLU” inputs F ( x ) = max(0, x )

  5. Two activation functions: ReLU and Swish ReLU max( x , 0) 16/12/2017 Desmos | Graphing Calculator x · σ ( β x ), where σ ( z ) = (1 + e − z ) − 1 Swish 16/12/2017 Desmos | Graphing Calculator (Ramachandran, Zoph, and Le, 2017) https://www.desmos.com/calculator 1/2 https://www.desmos.com/calculator 1/2

  6. 13/12/2017 Conv Nets: A Modular Perspective - colah's blog Conv Nets: A Modular Perspective Posted on July 8, 2014 neural networks (../../posts/tags/neural_networks.html), deep learning (../../posts/tags/deep_learning.html), convolutional neural networks (../../posts/tags/convolutional_neural_networks.html), modular neural networks (../../posts/tags/modular_neural_networks.html) Introduction In the last few years, deep neural networks have lead to breakthrough results on a variety of pattern recognition problems, such as computer vision and voice recognition. One of the essential components leading to these results has been a special kind of neural network called a convolutional neural network . At its most basic, convolutional neural networks can be thought of as a kind of neural network that uses many identical 1 copies of the same neuron. This allows the network to have lots of neurons and express computationally large models while keeping the number of actual parameters – the values describing how neurons behave – that need to be learned fairly small. 2d convolutional network 1 A 2D Convolutional Neural Network With thanks to C. Olah This trick of having multiple copies of the same neuron is roughly analogous to the abstraction of functions in mathematics and computer science. When programming, we write a function once and use it in many places – not writing the same code a hundred times in different places makes it faster to program, and results in fewer bugs. Similarly, a convolutional neural network can learn a neuron once and use it in many places, making it easier to learn the model and reducing error. Structure of Convolutional Neural Networks Suppose you want a neural network to look at audio samples and predict whether a human is speaking or not. Maybe you want to do more analysis if someone is speaking. You get audio samples at different points in time. The samples are evenly spaced.

  7. Some tensors � Knoldus Inc c

  8. ������ �������� �������� ���� ����������� ������������ �������� ���� ��������� ���������� ���������� ���������� �������� �������� �������� ������� ��������� �������� ���������������� ��������� ����������� ������� � ��������������� 4282 ����������� ��������� ����������������������� ���� ����������������� ��������� ����������������������� ����������������� ����������������� ����������� ����������������� ����������������������� � ����������� ������ ��������������� ������������������ ����������������� ���������������� ���������������� ������������ ������� ������������ ������������ �������� � ������������� � ��������������� Convolutional image classification Inception-Resnet-v1 achitecture Schema Stem Szegedy et al, arxiv.org/abs/1602.07261

  9. Recurrent neural networks (RNNs) Recurrent Neural Networks (RNNs) ≅ Cell Cell Cell Cell with shared parameters There are many variants, e.g., LSTMs.

  10. Mixture of experts A model MoE architecture with a conditional and a loop: With thanks to Yu et al.

  11. 26/12/2017 Attention and Augmented Recurrent Neural Networks Neural Turing Machines Neural Turing Machines Neural Turing Machines combine a RNN with an external memory Neural Turing Machines combine a RNN with an external memory bank. Since vectors are the natural language of neural bank: networks, the memory is an array of vectors: Memory is an array of vectors. Network A writes and reads from this memory each step. x0 y0 x1 y1 x2 y2 x3 y3 But how does reading and writing work? The challenge is that we want to make them differentiable. In particular, we want to make them differentiable with respect to the location we read from or write to, so that we can learn where to read and write. With thanks to C. Olah This is tricky because memory addresses seem to be fundamentally discrete. NTMs take a very clever solution to this: every step, they read and write everywhere, just to different extents. As an example, let’s focus on reading. Instead of specifying a single location, the RNN outputs an “attention distribution” that describes how we spread out the amount we care about different memory positions. As such, the result of the read operation is a weighted sum. attention The RNN gives an attention distribution which describe how we spread out the amount we care about different memory memory positions. The read result is a weighted sum. Similarly, we write everywhere at once to different extents. Again, an attention distribution describes how much we write at every location. We do this by having the new value of a position in memory be a convex combination of the old memory content and the write value, with the position between the two decided by the attention weight. https://distill.pub/2016/augmented-rnns/ 2/16

  12. Supervised learning Given a training dataset of (input, output) pairs, e.g., a set of images with labels: While not done: Pick a pair ( x , y ) Run the neural network on x to get Net( x , b , . . . ) Compare this to y to calculate the loss (= error = cost) Loss( b , . . . ) = | y − Net( x , b , . . . ) | Adjust parameters b , . . . to reduce the loss More generally, pick a “mini-batch" ( x 1 , y 1 ) , . . . , ( x n , y n ) and minimise the loss � n � � � ( y i − Net ( x i , b , . . . )) 2 Loss( b , . . . ) = � i =1

  13. Slope of a line slope = change in y change in x = ∆ y ∆ x So ∆ y = slope × ∆ x So x ′ = x − r slope = ⇒ y ′ = y − r slope 2

  14. Gradient descent Gradient descent Compute partial derivatives along paths in the neural network. Follow the gradient of the loss function Follow the gradient of the loss with respect to the parameters. Thus: x ′ := x − r (slope of Loss at x ) = r d Loss(x) d x

  15. Multi-dimensional gradient descent x ′ := x − r ∂ L(x,y) y ′ := y − r ∂ L(x,y) and ∂ x ∂ y ( x ′ , y ′ ) := ( x , y ) − r ( ∂ L(x,y) , ∂ L(x,y) ) ∂ x ∂ y v ′ := v − r ∇ L

  16. Looking at differentiation Expressions with several variables: � ∂ e [ x , y ] � ∂ x � x , y = a , b Gradient of functions f : R 2 → R of two arguments: ∇ ( f ) : R 2 → R 2 � ∂ f ( u , v ) , ∂ f ( u , v ) � ∇ ( f )( u , v ) = ∂ u ∂ v Chain rule ∂ f ( g ( x , y , z ) , h ( x , y , z )) ∂ f ( u , v ) · ∂ g ( x , y , z ) + ∂ f ( u , v ) · ∂ h ( x , y , z ) = ∂ x ∂ u ∂ x ∂ v ∂ x where u , v = g ( x , y , z ) , h ( x , y , z ).

  17. A matrix view of the multiargument chain rule. We have: ∂ f ( g ( x , y , z ) , h ( x , y , z )) ∂ f ( u , v ) · ∂ g ( x , y , z ) + ∂ f ( u , v ) · ∂ h ( x , y , z ) = ∂ x ∂ u ∂ x ∂ v ∂ x ∂ f ( g ( x , y , z ) , h ( x , y , z )) ∂ f ( u , v ) · ∂ g ( x , y , z ) + ∂ f ( u , v ) · ∂ h ( x , y , z ) = ∂ y ∂ u ∂ y ∂ v ∂ y ∂ f ( g ( x , y , z ) , h ( x , y , z )) ∂ f ( u , v ) · ∂ g ( x , y , z ) + ∂ f ( u , v ) · ∂ h ( x , y , z ) = ∂ z ∂ u ∂ z ∂ v ∂ z Set k = � g , h � : R 3 → R 2 and define its Jacobian to be the 2 × 3 matrix:  ∂ g ∂ g ∂ g  ∂ x ∂ y ∂ z J k =   ∂ h ∂ h ∂ h ∂ x ∂ y ∂ z Then the gradient of the composition f ◦ k is given by the vector-Jacobian product: ∇ f ( g ( x , y , z ) , h ( x , y , z )) = ∇ f ( u , v ) · J k ( x , y , z )

Recommend


More recommend