However... x1 x2 y 0 0 0 0 1 1 1 0 1 1 1 0 [0.004697241052453581, -0.009743527387551375, -0.00476408160440969] [0, 0] 0 -> 0.5011743081039458 [0, 1] 1 -> 0.4999832898620173 [1, 0] 1 -> 0.49873843109337934 [1, 1] 0 -> 0.4975474276853999 This is the XOR problem 17
So far, not very spectacular... One neuron on its own is hardly a brain Multilayer Perceptron (MLP): stack different neurons in layers Input layer Hidden layer Output layer Connect all outputs with all inputs of next layer ("fully connected" or "dense" architecture) The question is now: how to train? 18
Backpropagation We can’t use the same approach as we did for a single perceptron as we don’t know what the “true outcome” should be for the lower layers This issue took quite some time to solve Eventually, a method called “backpropagation” was devised to overcome this 19
Backpropagation Note that feedforward is still easy http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html 20
Backpropagation We can also still compare the predicted output with the expected one, from which we can derive a loss value http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html 21
Backpropagation The idea of backpropagation is to “back propagate” the error through the network Using the chain rule of partial derivatives http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html 22
Backpropagation Using this, we know how to shift the weights http://home.agh.edu.pl/~vlsi/AI/backp_t_en/backprop.html Further information: https://victorzhou.com/blog/intro-to-neural-networks/ http://www.emergentmind.com/neural-network https://www.youtube.com/watch? v=aircAruvnKk&list=PLZHQObOWTQDNU6R1_67000Dx_ZCJB-3pi 23
Further aspects Here, we have used one output neuron, but more than one output is possible as well (e.g. for a multi-class problem) Multiple hidden layers can be added in as well Note that the neurons in the hidden layers commonly use different activation functions than the output neurons E.g. ReLU is common for the hidden layers. The activation function of the output layer depends on the task (regression, binary classification, multiclass) For multiclass, a "softmax" layer is added on top of the output neurons to summarize their outputs to 1 The discussion regarding backpropagation reveals that the error function used and activation functions should be differentiable For the perceptron model, a naïve error function was used (absolute difference) Many other error (or “loss”) functions exist as well… 24
Further aspects: loss functions Common choices include For regression: 1 ∑ i =1 i 2 Mean squared error (MSE): E = N ( ^ i − y ) y 2 For classification: N ∑ c =1 Cross entropy: E = − ( y × ln( ^ ic )) ∑ i =1 C y ic Cross entropy for binary classification: E = − ( y × ln( ^ i ) + (1 − y ) × ln(1 − ^ i )) ∑ i =1 N y y i i Note that a great deal in research regarding finding new architectures consists of finding appropriate loss functions 25
Further aspects: gradient descent Recall: Normal gradient descent ("batch" gradient descent) presents all training instances to the network One update of the weights follows based on averaged gradients over the whole trainng set Very precise, but very time-consuming Stochastic gradient descent updates weights after every instance Quicker, but more sensitive to particular examples (looks like a "drunk walk" towards the minimum) Might need to shuffle instances every epoch Most implementations hence use a "mini-batch" approach Shuffle the training set, present in small batches Update weights after each mini-batch 26
Further aspects: initialization In our simple example, we have initialized all weights to 0, though it is also common to initialize the weights randomly A possible approach here as well is preliminary training Good starting values for the weights are essential for getting good solutions (and not getting stuck in local minima) Preliminary training uses a small number of random starting weights and takes a few iterations from each Use the best of the final values as the new starting value and continue with those 27
Further aspects: backpropagation alternatives Most implementations will use backpropagation, though other approaches to train an artificial neural network exist as well Advanced nonlinear optimization algorithms Hessian based Newton based methods Conjugate gradient Levenberg-Marquardt Genetic algorithm based … 28
Further aspects: beyond stochastic gradient descent Even when using backpropagation, different optimization strategies exist other then stochastic gradient descent Momentum Rmsprop Adagrad Adadelta Eve Adabound ... http://cs231n.github.io/neural-networks-3/ Adaptive learning rate tuning, see e.g. the learning rate finder (https://github.com/surmenok/keras_lr_finder) and cyclical training (bouncing the learning rate back and forth, https://arxiv.org/abs/1506.01186) A lot of research is being put in this field 29
Further aspects: ReLU Another interesting aspect to note is the popularity of ReLU ( f ( x ) = max(0, x ) ) as an activation function in the hidden nodes instead of the previously used tanh or sigmoid functions ReLU reduces the likelihood of vanishing gradient The problem of vanishing gradients happens for activation functions which gradient becomes increasingly small as the absolute value of x increases, causing that updates in “lower” layers happen very slowly and get “vanished out” The constant gradient of ReLUs results in faster learning Variants such as noisy or leaky ReLU’s also commonly used https://towardsdatascience.com/activation-functions-and-its-types-which-is-better-a9a5310cc8f 30
Further aspects: preventing overfitting Continuous training will continue to lower the error on the training set, but will eventually lead to overfitting (memorizing the training data) As such, validation is crucial (commonly with a validation split) Early stopping : stop training when validation error has reached its minimum level Regularization (penalizing large weights) is another approach, as larger weights generally are a sign that overfitting is occurring 31
Further aspects: preventing overfitting Dropout is another method: at each training stage, individual nodes are either "dropped out" of the net with a given probability, so that a reduced network is left: incoming and outgoing edges to a dropped-out node are also removed Improves training and reduces node interactions Forces the network to learn alternative pathways -- e.g. enforces redudancy Leading to better genrealization Batch normalization has also become popular: one often normalizes the input layer by adjusting and scaling the activations; if the input layer is benefiting from it, why not do the same thing also for the values in the hidden layers, that are changing all the time? Batch normalization reduces the amount by what the hidden unit values shift around (covariance shift) Allows to use higher learning rates because batch normalization makes sure that there’s no activation that’s gone really extreme Reduces overfitting because it has a slight regularization effects. Similar to dropout, it adds some noise to each hidden layer’s activations Some people have recently argued against Dropout altogether, in favor of (heavily) using Batch Normalization only Though not all: http://nyus.joshuawise.com/batchnorm.pdf (Batch Normalization for Improved DNN Performance, My Ass) 32
Example Let's summarize a bit... https://playground.tensorflow.org/ 33
MLPs are already powerful import keras from keras.models import Sequential from keras.layers import Dense, Flatten from keras.datasets import mnist from matplotlib import pyplot as plt import numpy as np from PIL import Image (X_train, y_train), (X_test, y_test) = mnist.load_data() print(X_train.shape) # (60000, 28, 28) plt.imshow(X_train[0], cmap='gray'); plt.show() print(y_train[0]) # 5 X_train, X_test = X_train.astype('float32') /= 255, # 60000 train samples X_test.astype('float32') /= 255 # 10000 test samples num_classes = 10 y_train = keras.utils.to_categorical(y_train, num_classes) y_test = keras.utils.to_categorical(y_test, num_classes) 34
MLPs are already powerful model = Sequential() model.add(Flatten(input_shape=(28, 28))) model.add(Dense(8, activation='relu')) model.add(Dense(8, activation='relu')) model.add(Dense(num_classes, activation='softmax')) model.summary() model.compile(loss='categorical_crossentropy', optimizer='adam', metrics=['accuracy']) 35
MLPs are already powerful batch_size = 128 epochs = 20 model.fit(X_train, y_train, batch_size=batch_size, epochs=epochs, verbose=2, validation_data=(X_test, y_test)) score = model.evaluate(X_test, y_test, verbose=0) print('Test loss:', score[0]) print('Test accuracy:', score[1]) # Epoch 1/20 # - 3s - loss: 0.9510 - acc: 0.7003 - val_loss: 0.4914 - val_acc: 0.8681 # Epoch 2/20 # - 2s - loss: 0.4345 - acc: 0.8772 - val_loss: 0.3707 - val_acc: 0.8979 # Epoch 20/20 # - 2s - loss: 0.2511 - acc: 0.9295 - val_loss: 0.2701 - val_acc: 0.9262 # Test loss: 0.27007879534959794 # Test accuracy: 0.9262 36
MLPs are already powerful Layer (type) Output Shape Param # ================================================================= flatten_1 (Flatten) (None, 784) 0 dense_1 (Dense) (None, 8) 6280 = 784 * 8 + 8 dense_2 (Dense) (None, 8) 72 = 8 * 8 + 8 dense_3 (Dense) (None, 10) 90 = 8 * 10 + 10 Total params: 6 442 37
MLPs are already powerful, but how do they learn? 38
MLPs are already powerful, but how do they learn? [[0 0 0.016035 *0.983951* 0 0.000015 0 0 0 0]] # This is a three? 39
Deep learning 40
Deep what? The deep in deep learning isn’t a reference to any kind of deeper understanding achieved by the approach, but stands for the idea of successive layers of representations Other appropriate names for the field could have been: Layered representations learning Hierarchical representations learning Differential function learning Modern deep learning often involves tens or even hundreds of successive layers of representations Enabled by computational power rise Main contributions follow from architecture and loss functions “ The goal is to create algorithms that can take in very unstructured data, like images, audio waves or text blocks (things traditionally very hard for computers to process) and predict the properties of those inputs – Andrew “ Ng 41
We'll look at the following types Convolutional neural networks Recurrent neural networks Generative adversarial networks Reinforcement learning Embeddings and representational learning: when we discuss text mining 42
Convolutional neural networks (CNNs) Our “deep” MLP already does pretty well on a simple data set Black-white image Small Only 10 classes How about a data set with pictures of 1000 classes? (Cats, dogs, cars, boats, …) Increase number of layers? Hidden units? Lots of weights to train! 43
Convolutional neural networks (CNNs) In 2010, a large database known as “Imagenet” containing millions of labeled images was created and published by a research group at Stanford In 2012, Alex Krizhevsky, Ilya Sutskever, and Geoff Hinton entered a submission that would halve the error rate This model combined several critical components Probably the most important piece was the use of graphics processing units (GPUs) to train the model They also introduced a method to reduce overfitting known as dropout and used the rectified linear activation unit (ReLU) The network went on to become known as “Alexnet” and the paper describing it has been cited nearly 10000 times since it was published And even before this First convolutional neural networks (CNNs) to recognize handwritten digits by Yann Lecun at AT&T Bell Labs (“LeNet”) 44
Convolutional neural networks (CNNs) Series of convolutional, pooling layers, followed by fully connected layers, and a softmax output layer width × height × 3 input layer for colored images Convolutional layer does most of the heavy lifting: learns a number of "filters" by retaining spacial topology I.e. don't fully connect everything Pooling layer applies simple downsampling (i.e. downsizing the image) 45
Convolutional neural networks (CNNs) https://ujjwalkarn.me/2016/08/11/intuitive-explanation-convnets/ Look at every w × h × 3 window and apply the filter: spatially local weighted sum f f Do this for every such window by moving it n pixels (stride) Paddings can be defined for the edges of an image Every window leads to one output: a convolved image The same kernel is used for all positions in the image (kernel): parameter sharing! A convolutional layer learns multiple filters (the depth) 46
Convolutional neural networks (CNNs) See example: http://scs.ryerson.ca/~aharley/vis/ 47
Convolutional neural networks (CNNs) Many variations have been developed Newer architectures remove the pooling layers Use dropout, batch normalization Data augmentation: add in variety (e.g. see https://keras.io/preprocessing/image/, https://github.com/aleju/imgaug) Prevents overfitting Can also be applied at prediction time (test-time augmentation) LeNet: the first successful applications of CNNs, developed by Yann LeCun in 1990’s AlexNet: the first work that popularized CNNs in Computer Vision ZF Net: a Convolutional Network from Matthew Zeiler and Rob Fergus; an improvement on AlexNet by tweaking the architecture hyperparameters GoogLeNet: main contribution was the development of an “Inception Module” that dramatically reduced the number of parameters in the network VGGNet: showed that the depth of the network is a critical component for good performance (140M parameters) ResNet and ResNeXt: features special skip connections and a heavy use of batch normalization. The architecture is also missing fully connected layers at the end of the network SqueezeNet: achieves AlexNet performance levels with 50x fewer parameters, leading to a very small model that is easy to deploy on e.g. smart devices 48
Transfer learning While data is a critical part of creating the network, the idea of transfer learning has helped to lessen the data demands Transfer learning is the process of taking a pre-trained model and “fine-tuning” the model with your own dataset The idea is that this pre-trained model will act as a feature extractor: you remove the last layer(s) of the network and replace it with your own classifier, and only retrain those weights while keeping the rest frozen Or simply keep as is but only retrain last layers When we think about the lower layers of the network, we know that they will detect features like edges and curves Rather than training the whole network through a random initialization of weights, we can use the weights of the pre-trained model and focus on the more important layers (ones that are higher up) for training "Clever" example: https://teachablemachine.withgoogle.com/ Recently also heavily applied in the textual domain! 49
Convolutional neural networks (CNNs) and other image tasks Basic CNNs are easy to set up for image classification Taking an input image and outputting a class number out of a set of categories For object localization, the goal is not only to produce a class label but also a bounding box that describes where the object is in the picture RCNN, Fast RCNN, Faster RCNN, MultiBox, Bayesian Optimization, Multi-region, RCNN Minus R, Image Windows For object segmentation, the task is to output a class label as well as an outline of every object in the input image Semantic Seg, Unconstrained Video, Shape Guided, Object Regions, Shape Sharing 50
Convolutional neural networks (CNNs) and other image tasks A basic CNN setup can also be used to localize objects of interest by preprocessing the data appropriately At prediction, the model is queried for each slice over the image 51
Convolutional neural networks (CNNs) and other tasks One dimensional CNNs have been used for text and time series analysis as well Capsule networks (Geoffrey Hinton) try to remove standing issues of the traditional CNN architecture Standard CNNs focus heavily on small texture and edge based filters but have difficulty with pose and overall composition 52
Convolutional neural networks (CNNs) and other tasks Image classification, segmentation, detection Face recognition and classification (from proper to “bad” science) Alibaba launches ‘smile to pay’ facial recognition system at KFC in China: https://www.cnbc.com/2017/09/04/alibaba-launches-smile-to-pay-facial-recognition-system-at-kfc-china.html Beijing KFC is pioneering technology to try to predict and remember people’s fast food choices: https://www.theguardian.com/technology/2017/jan/11/china-beijing-first-smart-restaurant-kfc-facial-recognition New AI can guess whether you're gay or straight from a photograph: https://www.theguardian.com/technology/2017/sep/07/new-artificial-intelligence-can-tell-whether-youre-gay-or- straight-from-a-photograph Facial Recognition Is Accurate, if You’re a White Guy: https://www.nytimes.com/2018/02/09/technology/facial- recognition-race-artificial-intelligence.html Pose and gait detection Business applications, e.g. in insurance (take a picture of your car to file a damage claim), fraud detection (forged signatures), etc. Stylistic and artistic use cases, e.g. photo editing and processing 53
Convolutional neural networks (CNNs) and style transfer https://mspoweruser.com/popular-ios-app-prisma-coming-windows-10-month/ 54
Convolutional neural networks (CNNs) and style transfer Uses a pre-trained network with three input images: original, style, and combined image A simple optimizer is used to minimize a custom loss by tweaking the combined image (starting from the original or a random image) Content loss (difference original and combination), style loss (difference style and combination) and variance loss (keep generated image smooth) 55
Convolutional neural networks (CNNs) and deep dreaming https://ai.googleblog.com/2015/06/inceptionism-going-deeper-into-neural.html “ One way to visualize what goes on is to turn the network upside down and ask it to enhance an input image in such a way as to elicit a particular interpretation. Say you want to know what sort of image would result in “Banana. Start with an image full of random noise, then gradually tweak the image towards what the neural net considers a banana. By itself, that doesn’t work very well, but it does if we impose a prior constraint that the image should have similar statistics to natural images, such as neighboring “ pixels needing to be correlated. 56
Convolutional neural networks (CNNs) and deep dreaming Convolutional layer outputs attain higher values when the corresponding pattern has been detected Therefore, we should choose some layers in the network, and aim to maximize the intensity of their output The selection of the layers to maximize depends primarily on whether we want to focus on lower or higher level feature representations (or perhaps a combination) A continuity loss (total variation loss): to give the image local coherence and avoid messy blurs A L2 norm loss on the resulting image in order to prevent pixels from taking very high values (otherwise, the image overall would become too bright) 57
One-shot learning Deep neural networks are really good at learning from high dimensional data like images or spoken language, but only when they have huge amounts of labelled examples to train on Humans on the other hand, are capable of one-shot learning Take a human who’s never seen a tomato before, and show them a single picture of a tomato, they will probably be able to distinguish tomatoes from other fruits with astoundingly high precision Trivial to us, but not so much for a computer 58
One-shot learning 1 nearest neighbor (take the nearest known sample based on Euclidean distance) Very low accuracy, but still better than random Hierarchical Bayesian Learning (Lake et al.) Better results, but inputs modified or annotated Naïve deep neural network approach Would horribly overfit Transfer learning Works better, makes sense Siamese networks (Koch et al.) Provide two images and train the network to predict whether they have the same category During prediction-time, the network can be used to compare a new image to each in the support set and pick the best matching category based on this We want an architecture that takes two inputs and outputs the probability of sharing the same class Symmetry: p ( x 1, x 2) = p ( x 2, x 1) – which means we cannot just “join” both images together to one large image Siamese network: shared parameters for identical convnets, then joined by a distance function Also possible: zero-shot learning (no examples for some classes), student- teacher networks (alternative transfer learning approach) 59
http://openaccess.thecvf.com/content_cvpr_2018/CameraReady/2406.pdf 60
Recurrent neural networks (RNNs) The basic idea behind RNNs is to make use of sequential information In a traditional neural network we assume that all inputs (and outputs) are independent of each other However, if you want to predict the next word in a sentence, it makes sense to know the words that came before it RNNs are called recurrent because they perform the same task for every element of a sequence, with the output being dependent on the previous computations RNNs have a “memory” which captures information about what has been calculated so far Similar to human reasoning: humans don’t start their thinking from scratch for every input. Your thoughts (previously seen instances) have persistence 61
Recurrent neural networks (RNNs) RNNs have shown great success in many NLP tasks Text classification Language modeling and generating text Machine translation Question-answering Chatbots Speech recognition Generating image descriptions RNN + CNN RCNN: object detection 62
Recurrent neural networks (RNNs) The most well known variant of RNN is probably the LSTM (long-short term memory) Sometimes, we only need to look at recent information to perform the present task But there are also cases where we need more context, from further back Standard RNNs don’t remember this context so far back Long Short Term Memory networks solve this issue by learning long-term dependencies Introduced by Hochreiter and Schmidhuber Work very on a large variety of problems, and are still widely used Instead of having a single neural network layer per repeating block as in RNN, there are multiple layers, interacting in a special way: having an input gate, a “forget gate” and an output gate http://colah.github.io/posts/2015-08-Understanding-LSTMs/ 63
Recurrent neural networks (RNNs) https://www.altumintelligence.com/articles/a/Time-Series-Prediction-Using-LSTM-Deep-Neural-Networks 64
Generative adversarial networks (GANs) Generative adversarial networks (GANs) are deep neural net architectures comprised of two nets, pitting one against the other The adversarial network tries to fool the discriminator The discriminator tries to spot fooling attempts GANs were introduced in a paper by Ian Goodfellow et al. https://arxiv.org/abs/1406.2661 Yann LeCun called adversarial training “the most interesting idea in the last 10 years in ML” 65
Generative adversarial networks (GANs) Traditional discriminative algorithms try to classify input data; that is, given the features of an instance, they predict the class to which that instance belongs, or P(Y=1|X) Discriminative algorithms map features to labels. They are concerned solely with that correlation One way to think about generative algorithms is that they do the opposite. Instead of predicting a label given certain features, they attempt to predict features given a certain label, or P(X|Y) The question a generative algorithm tries to answer is: assuming this label, how likely are these features? While discriminative models care about the relation between y and x, generative models care about “how you get x” In other words: discriminative models learn the boundary between classes, generative models model the distribution of individual classes 66
Auto-encoders 67
Generative adversarial networks (GANs) https://www.kdnuggets.com/2017/01/generative-adversarial-networks-hot-topic-machine-learning.html 68
Generative adversarial networks (GANs) See example: https://poloclub.github.io/ganlab/ Text to image generation: https://arxiv.org/abs/1605.05396 Image to image translation: https://arxiv.org/abs/1611.07004 Increasing image resolution: https://arxiv.org/abs/1609.04802 Predicting next video frames: https://arxiv.org/abs/1511.06380 See example: https://affinelayer.com/pixsrv/ 69
Generative adversarial networks (GANs) https://thispersondoesnotexist.com/ https://thisrentaldoesnotexist.com/ https://www.thiswaifudoesnotexist.net/ We're getting better at this: DCGAN, StyleGAN, ... "Deep fakes": http://fortune.com/2019/01/31/what-is-deep-fake-video/, http://fortune.com/2018/09/11/deep-fakes-obama-video/, https://www.theguardian.com/technology/2018/nov/12/deep-fakes-fake-news- truth 70
Generative adversarial networks (GANs) “ In May, a video appeared on the internet of Donald Trump offering advice to the people of Belgium on the issue of climate change. The video was created by a Belgian political party, sp.a, and posted on sp.a’s Twitter and Facebook. It provoked hundreds of comments, many expressing outrage that the American president would dare weigh in on Belgium’s climate policy. But this anger was misdirected. The speech, it was later revealed, was nothing more than a hi- tech forgery. It was a small-scale demonstration of how this technology might be used to threaten our already vulnerable information ecosystem – and perhaps undermine the possibility of a reliable, shared reality. Fake videos can now be created using a machine learning technique called a “generative adversarial network”, or a GAN. The use of this machine learning technique was mostly limited to the AI research community until late 2017, when a Reddit user who went by the moniker “Deepfakes” – a portmanteau of “deep learning” and “fake” – started posting digitally altered pornographic videos. He was building GANs using TensorFlow, Google’s free open source machine learning software, to superimpose celebrities’ faces on the bodies of women in pornographic movies. When Danielle Citron, a professor of law at the University of Maryland, first became aware of the fake porn movies, she was initially struck by how viscerally they violated these women’s right to privacy. But once she started thinking about deep fakes, she realized that if they spread beyond the trolls on Reddit they could be even more dangerous. They could be weaponized in ways that weaken the fabric of democratic society itself. "What would’ve happened if a deep fake emerged of the police chief saying something racist?" In particular, they could foresee deep fakes being “ exploited by purveyors of “fake news”. 71
Reinforcement learning Reinforcement learning allows to create AI agents that learn from the environment by interacting with it Learns by trial and error The environment exposes a state to the agent, with a number of possible actions the agent can perform After each action, the agent receives the feedback The feedback consists of the reward and next state of the environment See: http://projects.rajivshah.com/rldemo/ 72
Q-learning Given one run of the agent through an environment (one episode), we can easily calculate the total reward for that episode: R = r + r + ... + r 1 2 n The total future reward from time point t onward can be expressed as: R = r + r + r + ... + r t +1 t +2 t t n Because the environment is stochastic it is common to use discounted future reward instead: 2 t +2 n − t n R = r + γr + γ r + ... + γ r t +1 t t R = r + γ ( r + γ ( r + ...)) = r + γR t +1 t +2 t +1 t t t If we set the discount factor γ=0, our strategy will be short-sighted and we rely only on the immediate rewards Balance between immediate and future rewards with e.g. γ=0.9 If our environment is fully deterministic and the same actions always result in same rewards: γ=1 A good strategy for an agent would be to always choose an action that maximizes the (discounted) future reward 73
Q-learning Define a function Q(s, a) representing the maximum discounted future reward when performing action a in state s and continuing optimally from the point onwards: Q ( s , a ) = max R t +1 t t The best possible score at the end of the game after performing action a in state s Quality of a certain action in a certain state 74
Q-learning But: how can we estimate the score at the end of game? We know just the current state and action, and not the actions and rewards coming after that We can’t, Q is just a theoretical construct If we could find an estimate for Q, we could deterime a policy as follows: just pick the action with the highest Q value in a certain state: π ( s ) = argmax Q ( s , a ) a Here π represents the policy, the rule how we choose an action in each state 75
Q-learning Say we have one action: <s, a, r, s’> Just like with discounted future rewards, we can express the Q-value of state s and action a in terms of the Q-value of the next state s’: ′ ′ Q ( s , a ) = r + γ × Q ( s , π ( s )) This is called the Bellman equation The main idea in Q-learning is that we can iteratively approximate the Q-function using the Bellman equation . In the simplest case the Q-function is implemented as a table, with states as rows and actions as columns 76
Q-learning initialize Q[num_states,num_actions] arbitrarily observe initial state s repeat select action a according to policy execute action a and obtain new state s' select action a' in s' accordingly to policy Q[s,a] = Q[s,a] + α (r + γ * Q[s',a'] - Q[s,a]) move to new state s' until termination α is a learning rate that controls how much of the difference between previous Q-value and newly proposed Q-value is taken into account. In particular, when α=1, then the update is exactly the same as the Bellman equation Q[s',a'] with a' that we use to update Q[s,a] is only an approximation and in early stages of learning it may be completely wrong However the approximation gets more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value 77
Q-learning: example http://mnemstudio.org/path-finding-q-learning-tutorial.htm 78
Q-learning: example Reward matrix Indicates possible actions from a certain state And the reward per action Q matrix The “brain” of our agent Initially all values are 0 79
Q-learning: example for each episode, do start from a random initial state as the current state while the current state is not the goal state, do select a random action from the possible ones see to which new state that action leads get the max Q value in that new state update Q[state][action] = Q[state][action] + alpha * (R[state][action] + gamma * max_Q - Q[state][action]) set the new state as the current one end while end for 80
Q-learning: example Start from a random state (let’s say 3) and see which actions are possible In state 3, we can do actions: go to 1, go to 2, go to 4 Pick a random action to explore, e.g. go to 4 This would bring us to a new state (4) Check the actions which are possible there and determine the max Q value 0, 3 and 5 are possible, max(Q[4][0], Q[4][3], Q[4][5]) = 0 81
Q-learning: example We are in state 3, and are exploring action 4, and have determined max_Q Q[state][action] = Q[state][action] + alpha * (R[state][action] + gamma * max_Q - Q[state] [action]) If alpha = 1, the formula is easy: Q[3][4] = R[3][4] + 0.8 * 0 = 0 (gamma = 0.8) We move now to state 4 and continue 82
Q-learning: example We are now in state 4, and can go to 0, 3 or 5 from here, let’s randomly pick 5 to explore The immediate reward R[4][5] = 100 Determine max_Q in state 5: max(Q[5][1], Q[5][4], Q[5][5]) = 0 Q[state][action] = Q[state][action] + alpha * (R[state][action] + gamma * max_Q - Q[state] [action]) If alpha = 1, the formula is easy: Q[4][5] = R[4][5] + 0.8 * 0 = 100 We move now to state 5. This is the end goal, so we start a new episode 83
Q-learning: example The approximation gets more and more accurate with every iteration and it has been shown, that if we perform this update enough times, then the Q-function will converge and represent the true Q-value [[0, 0, 0, 0, 80, 0], [0, 0, 0, 64, 0, 100], [0, 0, 0, 64, 0, 0], [0, 80, 51, 0, 80, 0], [64, 0, 0, 64, 0, 100], [0, 0, 0, 0, 0, 0]] 84
Q-learning “ If we apply the same preprocessing to game screens as in the DeepMind paper – take the four last screen images, resize them to 84×84 and convert 84×84×4 to grayscale with 256 gray levels – we would have 256 ≈ 67970 10 “ possible game states 67970 This means 10 rows in our Q-table, more than the number of atoms in the known universe One could argue that states never occur, we could possibly represent it as a sparse table containing only visited states Even so, most of the states are very rarely visited and it would take a lifetime of the universe for the Q-table to converge Ideally, we would also like to have a good guess for Q-values for states we have never seen before 85
Deep Q-learning This is the point where deep learning steps in We could represent our Q-function with a neural network, that takes the state and action as input and outputs the corresponding Q-value “According to the network, which action leads to the highest payoff in a given state?” 86
Deep Q-learning Estimate the future reward in each state using Q-learning and approximate the Q-function using a convolutional neural network It turns out that approximation of Q-values using non-linear functions is not very stable Not easy to converge and takes a long time, almost a week on a single GPU Hence, experience replay is applied. During gameplay all the experiences <s, a, r, s’> are stored in a replay memory When training the network, random minibatches from the replay memory are used instead of the most recent transition This breaks the similarity of subsequent training samples, which otherwise might drive the network into a local minimum Help avoid the neural network to overly adjust its weights for the most recent state which may affect the action output of other states 87
Deep Q-learning Q-learning attempts to solve the credit assignment problem – it propagates rewards back in time, until it reaches the crucial decision point which was the actual cause for the obtained reward When a Q-table or Q-network is initialized randomly, then its predictions are initially random as well. If we pick an action with the highest Q-value, the action will be random and the agent performs crude “exploration” As a Q-function converges, it returns more consistent Q-values and the amount of exploration decreases But this exploration is “greedy”, it settles with the first effective strategy it finds. We need a tradeoff between exploration and exploitation A simple and effective fix for the above problem is ε-greedy exploration – with probability ε choose a random action, otherwise go with the “greedy” action with the highest Q- value In their system DeepMind actually decreases ε over time from 1 to 0.1 – in the beginning the system makes completely random moves to explore the state space maximally, and then it settles down to a fixed exploration rate 88
Reinforcement learning Deep Q Learning (DQN) https://arxiv.org/abs/1312.5602 Used to play simple Atari games together with a CNN Double DQN https://arxiv.org/abs/1509.06461 The Q-learning algorithm is known to overestimate action values under certain conditions Deep Deterministic Policy Gradient (DDPG) https://arxiv.org/abs/1509.02971 Asynchronous Advantage Actor-Critic (A3C) https://arxiv.org/abs/1602.01783 Continuous DQN (CDQN or NAF) https://arxiv.org/abs/1603.00748 Cross-Entropy Method (CEM) http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.81.6579&rep=rep1&type=pdf Dueling network DQN (Dueling DQN) https://arxiv.org/abs/1511.06581 Deep SARSA http://ieeexplore.ieee.org/document/7849837/ 89
Reinforcement learning https://www.alexirpan.com/2018/02/14/rl-hard.html 90
Conclusions CNNs: “using filters to construct topologically relevant abstractions” RNNs: “order matters: dealing with sequences, temporal aspect” GANs: “if I can generate something, I understand it” RL: “learning how to behave optimally in an environment” Artificial neural networks are back Powerful But: require a lot of tuning, configuration, risk of overfitting, and require huge amounts of samples for non-explored problems So many architectures What are best practices? Black box! Probability versus uncertainty: https://alexgkendall.com/computer_vision/bayesian_deep_learning_for_safe_ai/ The last 10%: deep learning gets you quickly to 90% of good results, but the last 10% is still very hard to reach CNN and RNN pretty “stable”, RL and GANs still face a lot of open questions 91
Conclusions Tooling support has definitely improved: PyTorch and Torch Torch is a computational framework with an API written in Lua that supports machine-learning algorithms Its powerful functionality saw interest from e.g. Facebook and Twitter Though the use of Lua was a bit of a drawback to spur wide adoption A Python version of Torch, known as PyTorch, was open-sourced by Facebook in January 2017 PyTorch offers dynamic computation graphs, which let you process variable-length inputs and outputs, instead of being limited to a fixed neural net architecture: very powerful concept! PyTorch has quickly become a favorite among researchers, because it allows complex architectures to be built easily Adoption in industry slowly growing Caffe and Caffe2 Caffe2 is the long-awaited successor to the original Caffe Its creator Yangqing Jia now works at Facebook The main difference with Torch is that Caffe2 is somewhat more light-weight Though not much in use these days 92
Conclusions Tooling support has definitely improved: TensorFlow and Theano Theano is the grand-daddy of deep-learning frameworks Written in Python, focus on fast handling of multidimensional arrays GPU support not perfect, speed not perfect, but solid for experimentation, learning and research Yoshua Bengio announced on September 2017, that development on Theano would cease, so not viable anymore Google created TensorFlow to replace Theano Some of the creators of Theano, such as Ian Goodfellow, went on to create Tensorflow at Google before leaving for OpenAI TensorFlow is written with a Python API over a C/C++ engine Java API In October 2017, Google introduced Eager, a dynamic computation graph module for TensorFlow, to compete with PyTorch Very popular with strong industry adoption 93
Conclusions Tooling support has definitely improved: Keras Keras is a deep-learning library that sits on top of Theano, TensorFlow, or CNTK Provides a high-level, easy API inspired by Torch above these Created by Francois Chollet, a software engineer at Google Chosen as an official high-level Tensorflow API by Google On its way to become a standard wrapper around different “engines” For newcomers: easy way to get started! Relatively easy to install Lots of tutorials, code,… available High level, pre-made layers and neurons Less preferred by expert-level coders 94
Conclusions https://towardsdatascience.com/deep-learning-framework-power-scores-2018-23607ddf297a 95
Conclusions “Lessons from Optics, The Other Deep Learning” http://www.argmin.net/2018/01/25/optics/ “There’s a mass influx of newcomers to our field and we’re equipping them with little more than folklore and pre-trained deep nets, then asking them to innovate. We can barely agree on the phenomena that we should be explaining away. I think we’re far from teaching this stuff in high schools.” “It would be nice if we could provide mental models, at various layers of abstraction, of the action of the layers of a deep net. What could be our equivalent of refraction, dispersion, and diffraction? Maybe you already think in terms of these actions, but we just haven’t standardized our language around these concepts?” “ Don’t believe the short-term hype, but do believe in the long-term vision. It may take a while for AI to be deployed to its true potential—a potential the full extent of which no one has yet dared to dream—but AI is coming, “ and it will transform our world in a fantastic way. - Francois Chollet 96
Conclusions 97
Conclusions Traditional algorithms Deep learning Accuracy Fair to good (on structured data) Good to excellent Training time Short (seconds) to medium (hours) Medium to (very) long (weeks) High (many thousands of e.g. images, Data Limited (a couple of hundred rows of though “transfer learning” possible in some requirements “small” data) cases) Manual trend features, windowing, Feature aggregations, domain-specific Automatic, done “by the model” engineering approaches Few to some (depending on the Many (architecture, number of hidden Hyperparameters algorithm) layers, activation functions, optimizer, …) High (white-box models) to Low (black-box model, though some Interpretability reasonable explanations can be extracted) Cost and Reasonable to high (GPU, cloud, parallel operational Low to reasonable computational requirements) efficiency 98
Opening the black box (part 2) 99
Adversarial learning: a motivating example NIPS 2017 100
Recommend
More recommend