Introduction to Deep Learning Milan Straka February 24, 2020 - PowerPoint PPT Presentation

NPFL114, Lecture 1 Introduction to Deep Learning Milan Straka February 24, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated

Deep Learning Highlights Image recognition Object detection Image segmentation, Human pose estimation Image labeling Visual question answering Speech recognition and generation Lip reading Machine translation Machine translation without parallel data Chess, Go and Shogi Multiplayer Capture the flag NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 2/34

Notation a a A A , , , : scalar (integer or real), vector, matrix, tensor a a A , , : scalar, vector, matrix random variable df f x dx : derivative of with respect to ∂ f f x ∂ x : partial derivative of with respect to ( ∂ x ∂ f ( x ) ) ∂ f ( x ) ∂ f ( x ) ∇ , , … , f f x x ∂ x ∂ x : gradient of with respect to , i.e., 1 2 n NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 3/34

Random Variables x A random variable is a result of a random process. It can be discrete or continuous. Probability Distribution A probability distribution describes how likely are individual values a random variable can take. x ∼ P x P The notation stands for a random variable having a distribution . x P ( x ) x For discrete variables, the probability that takes a value is denoted as or explicitly as P (x = x ) . x [ a , b ] For continuous variables, the probability that the value of lies in the interval is given by b p ( x ) d x ∫ a . NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 4/34

Random Variables Expectation f ( x ) P ( x ) The expectation of a function with respect to discrete probability distribution is defined as: ∑ E def [ f ( x )] = P ( x ) f ( x ) x∼ P x For continuous variables it is computed as: ∫ def E [ f ( x )] = p ( x ) f ( x ) d x x∼ p x E E [ x ] [ x ] P If the random variable is obvious from context, we can write only of even . Expectation is linear, i.e., E βg ( x )] = α E β E [ αf ( x ) + [ f ( x )] + [ g ( x )] x x x NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 5/34

Random Variables Variance μ = E [ x ] Variance measures how much the values of a random variable differ from its mean . [ 2 ] def E ( x − E [ x ] ) Var( x ) = , or more generally [ 2 ] def E ( f ( x ) − E [ f ( x )] ) Var( f ( x )) = It is easy to see that [ 2 2 ] 2 2 ] Var( x ) = E x − 2 x E [ x ] + ( E [ x ] ) E x ( E [ x ] ) . = − [ 2 E [ x ] Variance is connected to , a second moment of a random variable – it is in fact a centered second moment. NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 6/34

Common Probability Distributions Bernoulli Distribution The Bernoulli distribution is a distribution over a binary random variable. It has a single φ ∈ [0, 1] parameter , which specifies the probability of the random variable being equal to 1. 1− x P ( x ) = φ (1 − φ ) x E [ x ] = φ Var( x ) = φ (1 − φ ) Categorical Distribution k Extension of the Bernoulli distribution to random variables taking one of different discrete k p ∈ [0, 1] k = 1 ∑ i =1 p i outcomes. It is parametrized by such that . ∏ k x P ( x ) = p i i i E [ x ] = p , Var( x ) = p (1 − p ) i i i i i NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 7/34

Information Theory Self Information Amount of surprise when a random variable is sampled. Should be zero for events with probability 1. Less likely events are more surprising. Independent events should have additive information. 1 def − log P ( x ) = log I ( x ) = P ( x ) Entropy Amount of surprise in the whole distribution. def E − E H ( P ) = [ I ( x )] = [log P ( x )] x∼ P x∼ P P H ( P ) = − P ( x ) log P ( x ) ∑ x for discrete : P H ( P ) = − P ( x ) log P ( x ) d x ∫ for continuous : NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 8/34

Information Theory Cross-Entropy def − E H ( P , Q ) = [log Q ( x )] x∼ P Gibbs inequality H ( P , Q ) ≥ H ( P ) H ( P ) = H ( P , Q ) ⇔ P = Q Proof: Using Jensen's inequality, we get Q ( x ) Q ( x ) ∑ ∑ ∑ P ( x ) log ≤ log P ( x ) = log Q ( x ) = 0. P ( x ) P ( x ) x x x H ( P ) ≤ log n n Corollary: For a categorical distribution with outcomes, , because for Q ( x ) = 1/ n H ( P ) ≤ H ( P , Q ) = − P ( x ) log Q ( x ) = log n . ∑ x we get H ( P , Q ) =  H ( Q , P ) generally NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 9/34

Information Theory Kullback-Leibler Divergence (KL Divergence) Sometimes also called relative entropy . def H ( P , Q ) − H ( P ) = E ( P ∣∣ Q ) = [log P ( x ) − log Q ( x )] D KL x∼ P ( P ∣∣ Q ) ≥ 0 D KL consequence of Gibbs inequality: ( P ∣∣ Q ) =  D ( Q ∣∣ P ) D KL KL generally NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 10/34

Nonsymmetry of KL Divergence Figure 3.6, page 76 of Deep Learning Book, http://deeplearningbook.org NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 11/34

Common Probability Distributions Normal (or Gaussian) Distribution σ 2 μ Distribution over real numbers, parametrized by a mean and variance : ( x − μ ) 2 1 ( ) 2 N ( x ; μ , σ ) = exp − 2 πσ 2 2 σ 2 x 2 1 2 − μ = 0 σ = 1 N ( x ; 0, 1) = e 2 2 π For standard values and we get . Figure 3.1, page 64 of Deep Learning Book, http://deeplearningbook.org. NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 12/34

Why Normal Distribution Central Limit Theorem The sum of independent identically distributed random variables with finite variance converges to normal distribution. Principle of Maximum Entropy Given a set of constraints, a distribution with maximal entropy fulfilling the constraints can be considered the most general one, containing as little additional assumptions as possible. Considering distributions with a given mean and variance, it can be proven (using variational inference) that such a distribution with maximal entropy is exactly the normal distribution. NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 13/34

Machine Learning A possible definition of learning from Mitchell (1997): A computer program is said to learn from experience E with respect to some class of tasks T and performance measure P, if its performance at tasks in T, as measured by P, improves with experience E. Task T k classification : assigning one of categories to a given input x ∈ R regression : producing a number for a given input structured prediction , denoising , density estimation , … Experience E supervised : usually a dataset with desired outcomes ( labels or targets ) unsupervised : usually data without any annotation (raw text, raw images, … ) reinforcement learning , semi-supervised learning , … Measure P accuracy , error rate , F-score , … NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 14/34

Well-known Datasets Name Description Instances MNIST Images (28x28, grayscale) of handwritten digits. 60k CIFAR-10 Images (32x32, color) of 10 classes of objects. 50k CIFAR- Images (32x32, color) of 100 classes of objects (with 20 defined 50k 100 superclasses). Labeled object image database (labeled objects, some with bounding ImageNet 14.2M boxes). ImageNet- Subset of ImageNet for Large Scale Visual Recognition Challenge, 1.2M ILSVRC annotated with 1000 object classes and their bounding boxes. Common Objects in Context : Complex everyday scenes with COCO 2.5M descriptions (5) and highlighting of objects (91 types). NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 15/34

Well-known Datasets ImageNet-ILSVRC Image from "ImageNet Classification with Deep Convolutional Neural Networks" paper by Alex Krizhevsky et al. Image from http://image-net.org/challenges/LSVRC/2014/. NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 16/34

Well-known Datasets COCO Image from http://mscoco.org/dataset/\#detections-challenge2016. NPFL114, Lecture 1 Notation Random Variables Information Theory Machine Learning Neural Nets '80s 17/34

Introduction to Deep Learning Milan Straka February 24, 2020 - PowerPoint PPT Presentation

NPFL114, Lecture 1 Introduction to Deep Learning Milan Straka February 24, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Deep Learning Highlights

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Learning curves IN TRODUCTION TO DEEP LEARN IN G W ITH K ERAS Miguel Esteban Data Scientist

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

DEEP NEURAL NETWORKS FOR OBJECT DETECTION Sergey Nikolenko Steklov Institute of Mathematics at

Neural Networks Still seeking flexible, non-linear models for classfication and CS 335: Neural

Deep Residual Learning for Image Recognition ILSVRC 2015 MS COCO 2015 K. He, X. Zhang, S. Ren

15-780 Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter

IMPLIC IMPLICATION & TION & EVIDEN EVIDENCE CE Nations - ations - 19 193 Languag

Psalm 83 War Edom Ishmaelites Moab Hagarines Gebal Ammon Amalek Philistines Tyre Assur

CRYING TO THE UNCHANGING GOD P S A L M 7 WATCH THE LIVE STREAM ON SUNDAY 9AM 1 O Lord my God,

Code Modification Forum Ashling Hotel, Dublin Wednesday, 16 January 2019 Agenda (1 of 2) 1.

Introduction to Deep Learning Milan Straka February 24, 2020 - PowerPoint PPT Presentation

NPFL114, Lecture 1 Introduction to Deep Learning Milan Straka February 24, 2020 Charles University in Prague Faculty of Mathematics and Physics Institute of Formal and Applied Linguistics unless otherwise stated Deep Learning Highlights

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

All You Want To Know About CNNs Yukun Zhu Deep Learning Deep Learning Image from

Deep Neural Networks and Deep Reinforcement Learning Deep Learning, Goodfellow, Bengio and

Presentation about Deep Learning --- Zhongwu xie Contents 1.Brief introduction of Deep learning.

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Deep learning Deep reinforcement learning Hamid Beigy Sharif university of technology December

Deep Learning: Theory and Practice Deep Learning - Practical 02-04-2020 Considerations

Deep Learning on GPUs March 2016 What is Deep Learning? GPUs and DL AGENDA DL in practice

Differen'able Func'onal Programming Noel Welsh @noelwelsh underscore Goals Deep learning

DSC 102 Systems for Scalable Analytics Arun Kumar Topic 6: Deep Learning Systems 1 Outline

ACCELERATE DEEP LEARNING WITH NVIDIA'S DEEP LEARNING PLATFORM | STEPHEN JONES | GTC16 DEEP

Deep learning for natural language processing A short primer on deep learning Benoit Favre &lt;

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Medical Imaging Elisa Sayrol Medical Imaging Interest in this area in Deep Learning: DeepDeep

Learning curves IN TRODUCTION TO DEEP LEARN IN G W ITH K ERAS Miguel Esteban Data Scientist

Deep learning Optimization and Regularization in deep networks Hamid Beigy Sharif university of

DEEP NEURAL NETWORKS FOR OBJECT DETECTION Sergey Nikolenko Steklov Institute of Mathematics at

Neural Networks Still seeking flexible, non-linear models for classfication and CS 335: Neural

Deep Residual Learning for Image Recognition ILSVRC 2015 MS COCO 2015 K. He, X. Zhang, S. Ren

15-780 Graduate Artificial Intelligence: Convolutional and recurrent networks J. Zico Kolter

IMPLIC IMPLICATION &amp; TION &amp; EVIDEN EVIDENCE CE Nations - ations - 19 193 Languag

Psalm 83 War Edom Ishmaelites Moab Hagarines Gebal Ammon Amalek Philistines Tyre Assur

CRYING TO THE UNCHANGING GOD P S A L M 7 WATCH THE LIVE STREAM ON SUNDAY 9AM 1 O Lord my God,

Code Modification Forum Ashling Hotel, Dublin Wednesday, 16 January 2019 Agenda (1 of 2) 1.

Deep learning for natural language processing A short primer on deep learning Benoit Favre <

IMPLIC IMPLICATION & TION & EVIDEN EVIDENCE CE Nations - ations - 19 193 Languag