deep learning
play

Deep Learning Barun Patra Index Convolutional Networks - PowerPoint PPT Presentation

Deep Learning Barun Patra Index Convolutional Networks Introduction to Neural Nets Inspiration Activations Kernels Sigmoid Idea Tanh As used in NLP Relu (Derivatives) Paper Discussion


  1. Deep Learning Barun Patra

  2. Index Convolutional Networks ● Introduction to Neural Nets ● Inspiration ○ Activations ● Kernels ○ Sigmoid ○ Idea ○ Tanh ○ As used in NLP ○ Relu (Derivatives) ○ Paper Discussion ● Gradients ● Initialization ● Regularization ● Dropout ○ Batch Norm ○

  3. Introduction Image from Stanford’s CS231n supplementary notes

  4. Representational Power A single hidden layer NN can approximate any function ● http://neuralnetworksanddeeplearning.com/chap4.html ● So why do we use deep neural networks ?? Sometimes more intuitive (Images) ● Works well in practice ●

  5. Commonly Used Activations : Sigmoid Historically used. ● Has a nice interpretation as neuron firing ● Tendency to saturate and kill gradient ● In regions where neuron is 1 or 0, gradient is 0 ●

  6. Commonly Used Activations: Tanh Still Saturates, killing gradient ● Gradient ≠ 0, when the function is 0 ●

  7. Commonly Used Activations: ReLU Does not saturate ● Faster to implement ● Can cause network to die ● Converges faster in practice ●

  8. Commonly Used Activations: Leaky ReLU & Maxout Leaky ReLU Generalizing Leaky ReLU (Maxout) Double the number of parameters ●

  9. Backpropagation and Gradient Computation Let z (i) be the output of the i (th) layer, and s (i) be the input. ● Let f be the activation being applied. ● Let w (i) jk be the weight connecting the jth and the kth unit in the ith ● layer We have: ●

  10. Backpropagation and Gradient Computation

  11. Backpropagation and Gradient Computation

  12. Backpropagation and Activation Why does sigmoid learn slowly ?? ● Taken from “Understanding the difficulty of training deep feedforward neural networks”, Glorot and Bengio

  13. Babysitting your gradient: For few examples (4-5 in a batch), compute numerical gradient ● Compare gradient from backprop and the numerical gradient ● Use relative error instead of absolute error ○ Rule of thumb: ● relative error > 1e-2 usually means the gradient is probably wrong ○ 1e-2 > relative error > 1e-4 should make you feel uncomfortable ○ 1e-4 > relative error is usually okay for objectives with kinks. But if there are no kinks (e.g. ○ use of tanh nonlinearities and softmax), then 1e-4 is too high. 1e-7 and less you should be happy. ○

  14. Initialization : Glorot Uniform / Xavier Do not start with all 0’s (Nothing to break symmetry in a layer) ● Sample from above uniform distribution/ Gaussian distribution ●

  15. Initialization : Kuch Bhi ?? Consider a network with linear neurons. ● Let z (i) be the output of the i (th) layer, and s (i) be the input ● Let Input (x) be of uniform variance Var[x] and 0 mean ● Let all weights are i.i.d’s. Then: ●

  16. Initialization : Kuch Bhi ?? Similarly ●

  17. Initialization : Kuch Bhi ?? For Information to flow, we want ● And hence: ●

  18. Initialization : Sigmoid and ReLU The linear assumption good enough for tanh ● For Sigmoid and ReLU, small modifications needed ● The modification for ReLU + Some other stuff: ● By He, Zang and Reng : https://arxiv.org/pdf/1502.01852.pdf ○ Surpassed human level performance on ImageNet Classification ○

  19. Regularization (Motivation): Strong tendency of a Neural Net to overfit

  20. Regularization (Motivation): Effect of L2 Regularization

  21. Regularization (Methodology): L1 Regularization ● L2 Regularization ● Introducing Noise ● Max norm of weights ● Early stopping using validation set ● Dropouts ● Batch Normalization ●

  22. Dropouts: Each hidden unit in a neural network trained with dropout must learn to work with a ● randomly chosen sample of other units. This should make each hidden unit more robust and drive it towards creating useful features on its own without relying on other hidden units to correct its mistakes. http://www.jmlr.org/papers/volume15/srivastava14a.old/source/srivastava14a.pdf ●

  23. Batch Normalization: Normalizing the input helps in training. ● What if we could normalize the input to every layer of the network ? ● For each layer with d dim input (x1 … xd), we want ● But normalizing like this may change what the layer represents ● To overcome that, the transformation inserted in the network can ● represent the identity transform

  24. Batch Normalization: Leads to faster training ● Less dependance on initializations ●

  25. Some practical advice: Gradient check on small data ● Overfit without regularization on ● small data. Decay learning rate with time ● Regularize ● Always check learning curves ●

  26. Introduction to Convolutional networks What are these convolutions and kernels ?? ● https://docs.gimp.org/en/plug-in-convmatrix.html ●

  27. Introduction to Convolutional Networks Animation at http://cs231n.github.io/convolutional-networks/ ●

  28. Kind of features learnt :

  29. Convolutional Networks in NLP Gives a good generalization of unigram, bigram etc. features in embedding space ● With more layers, the receptor field increases ● Taken from Hierarchical Question Answering using Co Attention

  30. Relation Extraction with Conv Networks:

  31. Issue 1 with the Previous task (Mintz et al., 2009): Assumption : Every sentence between two entities express the relation ● Issue : Too Strong ● Solution: Use Multi Instance Multi Label Model ● Taken from (Zeng et. al, 2015)

  32. Issue 2 with the Previous Task : Used hand crafted features + other NLP tools like dependency parsers ● Have poor performance with increased sentence length ○ Long sentences form nearly 50% of the corpus being used to extract the ○ relations Solution : Use Deep Learning ! ● Enter Convolutional Networks ○

  33. The Model (Overview) : Taken from (Zeng et. al, 2015)

  34. The Model (Embedding): Train word2vec (skip gram model) [Why not CBOW ?] ● Use positional embeddings ( distance from the two entities) : ● Capture the notion of distance of the word from the entities ○ The same word, at different locations at the sentence, might have different semantics ○ A proxy to LSTM embeddings ○ Final dimension for one word : R (embed_dim + 2*embed_position) ● Final dimension of the embedding layer : R |Sentence| * ( embed_dim + ● 2*embed_position)

  35. The Model (Convolution): Convolution with kernel width W ● W ∈ R W * ( n_dim_vector) ● N filters (Hyper parameter) ● Zero padding to ensure every word ● gets convolved Final Layer Dimension: R |N| * (|S| + |W| - ● 1)

  36. The Model (Pooling + Softmax): Pooling done in piecewise manner ● Idea from three parts of sentence ● Remember ReVerb ?? ○ Less coarse than a single softmax ● The final dimension is R |num_filters| * 3 ● Tanh ● Softmax to get probability over all ● relations

  37. The Data : A bag is labeled r if there is at least one sentence which contains r ● A bag contains all sentences between a pair of entities ● Potentially multiple same bags with different labels [ Unclear ] ●

  38. The Objective Function And Training : Trained with mini batches of bags, with Ada Delta ●

  39. Inference : Given a bag and a relation r ● The bag is marked to contain r if there exists one sentence in bag with ● predicted r

  40. Experiment Setup: Dataset: Aligning Freebase relations with NYT corpus ● Training: Sentence from 2005-06 ● Testing : Sentences from 2007 ● Held out evaluation : Extracted Relations against Freebase ● Manual evaluation : Evaluation by human ● Word2Vec trained on the NYT corpus. Entity tokens concatenated with ## ● Grid search over hyper parameters ●

  41. Results (Held out evaluation) : Half of the Freebase relations used for testing [ Doubt ] ● Relations extracted from test articles compared against the Freebase Relations ● Results compared against Mintz, MultiR and Multi Instance Multi Label Learning ●

  42. Results (Manual Evaluation) Chose Entity pairs where at least one was not present in Freebase as ● candidate (To avoid overlap with the held out set) Top N relations extracted, and precision computed ● Since not all relations are known, recall not given (Pseudo Recall ??) ●

  43. Results (Ablation Study):

  44. Problems : Analysis of where PCNN improves over MultiR/MIML lacking [Surag] ● No coreference resolution [Rishab] ● No alternatives to 3 segment piecewise convolution [Haroun] ● Suffers from incomplete Freebase [Daraksha] ● Does not consider overlapping relations [Daraksha] ● A lot of training examples not being used [Shantanu] ● No comparison with other architectures [Akg] ●

Recommend


More recommend