introduction to deep learning
play

Introduction to Deep Learning Milan Straka February 24, 2020 - PowerPoint PPT Presentation

NPFL114, Lecture 1 Introduction to Deep Learning Milan Straka February 24, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Deep Learning Highlights


  1. NPFL114, Lecture 1 Introduction to Deep Learning Milan Straka February 24, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

  2. Deep Learning Highlights Image recognition Object detection Image segmentation, Human pose estimation Image labeling Visual question answering Speech recognition and generation Lip reading Machine translation Machine translation without parallel data Chess, Go and Shogi Multiplayer Capture the flag NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 2/34

  3. Notation a a A A , , , : scalar (integer or real), vector, matrix, tensor a a A , , : scalar, vector, matrix random variable df f x dx : derivative of with respect to ∂ f f x ∂ x : partial derivative of with respect to ( ∂ x ∂ f ( x ) ) ∂ f ( x ) ∂ f ( x ) ∇ , , … , f f x x ∂ x ∂ x : gradient of with respect to , i.e., 1 2 n NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 3/34

  4. Random Variables x A random variable is a result of a random process. It can be discrete or continuous. Probability Distribution A probability distribution describes how likely are individual values a random variable can take. x ∼ P x P The notation stands for a random variable having a distribution . x P ( x ) x For discrete variables, the probability that takes a value is denoted as or explicitly as P (x = x ) . x [ a , b ] For continuous variables, the probability that the value of lies in the interval is given by b p ( x ) d x ∫ a . NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 4/34

  5. Random Variables Expectation f ( x ) P ( x ) The expectation of a function with respect to discrete probability distribution is defined as: ∑ E def [ f ( x )] = P ( x ) f ( x ) x∼ P x For continuous variables it is computed as: ∫ def E [ f ( x )] = p ( x ) f ( x ) d x x∼ p x E E [ x ] [ x ] P If the random variable is obvious from context, we can write only of even . Expectation is linear, i.e., E βg ( x )] = α E β E [ αf ( x ) + [ f ( x )] + [ g ( x )] x x x NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 5/34

  6. Random Variables Variance μ = E [ x ] Variance measures how much the values of a random variable differ from its mean . [ 2 ] def E ( x − E [ x ] ) Var( x ) = , or more generally [ 2 ] def E ( f ( x ) − E [ f ( x )] ) Var( f ( x )) = It is easy to see that [ 2 2 ] 2 2 ] Var( x ) = E x − 2 x E [ x ] + ( E [ x ] ) E x ( E [ x ] ) . = − [ 2 E [ x ] Variance is connected to , a second moment of a random variable – it is in fact a centered second moment. NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 6/34

  7. Common Probability Distributions Bernoulli Distribution The Bernoulli distribution is a distribution over a binary random variable. It has a single φ ∈ [0, 1] parameter , which specifies the probability of the random variable being equal to 1. 1− x P ( x ) = φ (1 − φ ) x E [ x ] = φ Var( x ) = φ (1 − φ ) Categorical Distribution k Extension of the Bernoulli distribution to random variables taking one of different discrete k p ∈ [0, 1] k = 1 ∑ i =1 p i outcomes. It is parametrized by such that . ∏ k x P ( x ) = p i i i E [ x ] = p , Var( x ) = p (1 − p ) i i i i i NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 7/34

  8. Information Theory Self Information Amount of surprise when a random variable is sampled. Should be zero for events with probability 1. Less likely events are more surprising. Independent events should have additive information. 1 def − log P ( x ) = log I ( x ) = P ( x ) Entropy Amount of surprise in the whole distribution. def E − E H ( P ) = [ I ( x )] = [log P ( x )] x∼ P x∼ P P H ( P ) = − P ( x ) log P ( x ) ∑ x for discrete : P H ( P ) = − P ( x ) log P ( x ) d x ∫ for continuous : NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 8/34

  9. Information Theory Cross-Entropy def − E H ( P , Q ) = [log Q ( x )] x∼ P Gibbs inequality H ( P , Q ) ≥ H ( P ) H ( P ) = H ( P , Q ) ⇔ P = Q Proof: Using Jensen's inequality, we get Q ( x ) Q ( x ) ∑ ∑ ∑ P ( x ) log ≤ log P ( x ) = log Q ( x ) = 0. P ( x ) P ( x ) x x x H ( P ) ≤ log n n Corollary: For a categorical distribution with outcomes, , because for Q ( x ) = 1/ n H ( P ) ≤ H ( P , Q ) = − P ( x ) log Q ( x ) = log n . ∑ x we get H ( P , Q ) =  H ( Q , P ) generally NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 9/34

  10. Information Theory Kullback-Leibler Divergence (KL Divergence) Sometimes also called relative entropy . def H ( P , Q ) − H ( P ) = E ( P ∣∣ Q ) = [log P ( x ) − log Q ( x )] D KL x∼ P ( P ∣∣ Q ) ≥ 0 D KL consequence of Gibbs inequality: ( P ∣∣ Q ) =  D ( Q ∣∣ P ) D KL KL generally NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 10/34

  11. Nonsymmetry of KL Divergence Figure 3.6, page 76 of Deep Learning Book, http://deeplearningbook.org NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 11/34

  12. Common Probability Distributions Normal (or Gaussian) Distribution σ 2 μ Distribution over real numbers, parametrized by a mean and variance : ( x − μ ) 2 1 ( ) 2 N ( x ; μ , σ ) = exp − 2 πσ 2 2 σ 2 x 2 1 2 − μ = 0 σ = 1 N ( x ; 0, 1) = e 2 2 π For standard values and we get . Figure 3.1, page 64 of Deep Learning Book, http://deeplearningbook.org. NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 12/34

  13. Why Normal Distribution Central Limit Theorem The sum of independent identically distributed random variables with finite variance converges to normal distribution. Principle of Maximum Entropy Given a set of constraints, a distribution with maximal entropy fulfilling the constraints can be considered the most general one, containing as little additional assumptions as possible. Considering distributions with a given mean and variance, it can be proven (using variational inference) that such a distribution with maximal entropy is exactly the normal distribution. NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 13/34

  14. Machine Learning A possible definition of learning from Mitchell (1997): A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Task T k classification : assigning one of categories to a given input x ∈ R regression : producing a number for a given input structured prediction , denoising , density estimation , … Experience E supervised : usually a dataset with desired outcomes ( labels or targets ) unsupervised : usually data without any annotation (raw text, raw images, … ) reinforcement learning , semi-supervised learning , … Measure P accuracy , error rate , F-score , … NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 14/34

  15. Well-known Datasets Name Description Instances MNIST Images (28x28, grayscale) of handwritten digits. 60k CIFAR-10 Images (32x32, color) of 10 classes of objects. 50k CIFAR- Images (32x32, color) of 100 classes of objects (with 20 defined 50k 100 superclasses). Labeled object image database (labeled objects, some with bounding ImageNet 14.2M boxes). ImageNet- Subset of ImageNet for Large Scale Visual Recognition Challenge, 1.2M ILSVRC annotated with 1000 object classes and their bounding boxes. Common Objects in Context : Complex everyday scenes with COCO 2.5M descriptions (5) and highlighting of objects (91 types). NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 15/34

  16. Well-known Datasets ImageNet-ILSVRC Image from "ImageNet Classification with Deep Convolutional Neural Networks" paper by Alex Krizhevsky et al. Image from http://image-net.org/challenges/LSVRC/2014/. NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 16/34

  17. Well-known Datasets COCO Image from http://mscoco.org/dataset/\#detections-challenge2016. NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 17/34

Recommend


More recommend