Machine Learning Lecture 01-2: Basics of Information Theory Nevin - PowerPoint PPT Presentation

Machine Learning Lecture 01-2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 30

Jensen’s Inequality Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 2 / 30

Jensen’s Inequality Concave functions A function f is concave on interval I if for any x , y ∈ I , λ f ( x ) + (1 − λ ) f ( y ) ≤ f ( λ x + (1 − λ ) y ) for any λ ∈ [0 , 1] Weighted average of function is upper bounded by function of weighted average. It is strictly concave if the equality holds only when x = y . Nevin L. Zhang (HKUST) Machine Learning 3 / 30

Jensen’s Inequality Jensen’s Inequality Theorem (1.1) Suppose function f is concave on interval I.Then For any p i ∈ [0 , 1] , � n i =1 p i = 1 and x i ∈ I. n n � � p i f ( x i ) ≤ f ( p i x i ) i =1 i =1 Weighted average of function is upper bounded by function of weighted average. If f is strictly CONCAVE, the equality holds iff p i × p j � = 0 implies x i = x j . Exercise: Prove this (using induction). Nevin L. Zhang (HKUST) Machine Learning 4 / 30

Jensen’s Inequality Logarithmic function The logarithmic function is concave in the interval (0 , ∞ ): Hence n n � � p i log ( x i ) ≤ log ( p i x i ) 0 ≤ x i i =1 i =1 In words, exchanging � i p i with log increases quantity. Or, swapping expectation and logarithm increases quantity: E [log x ] ≤ log E [ x ] . Nevin L. Zhang (HKUST) Machine Learning 5 / 30

Entropy Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 6 / 30

Entropy Entropy The entropy of a random variable X : 1 � H ( X ) = P ( X ) log P ( X ) = − E P [log P ( X )] X with convention that 0 log(1 / 0) = 0. Base of logarithm is 2, unit is bit. Sometimes, also called the entropy of the distribution, H ( P ). H ( X ) measures the amount of uncertainty about X . � For real-valued variable, replace � X . . . with . . . dx . Nevin L. Zhang (HKUST) Machine Learning 7 / 30

Entropy Entropy Example: X — result of coin tossing Y — result of dice throw Z — result of randomly pick a card from a deck of 54 Which one has the highest uncertainty? Entropy: 1 2 log 2 + 1 H ( X ) = 2 log 2 = 1(log 2) 6 log 6 + . . . + 1 1 H ( Y ) = 6 log 6 = log 6 54 log 54 + . . . + 1 1 H ( Z ) = 54 log 54 = log 54 Indeed we have: H ( X ) < H ( Y ) < H ( Z ) . Nevin L. Zhang (HKUST) Machine Learning 8 / 30

Entropy Entropy X binary. The chart on the right shows H ( X ) as a function of p = P ( X =1). The higher H ( X ) is, the more uncertainty about the value of X Nevin L. Zhang (HKUST) Machine Learning 9 / 30

Entropy Entropy Proposition (1.2) H ( X ) ≥ 0 H ( X ) = 0 iff P ( X = x ) = 1 for some x ∈ Ω X . i.e. iff no uncertainty. H ( X ) ≤ log ( | X | ) with equality iff P ( X = x )=1 / | X | . Uncertainty is the highest in the case of uniform distribution. Proof : Because log is concave, by Jensen’s inequality: 1 � H ( X ) = P ( X ) log P ( X ) X 1 � ≤ log P ( X ) P ( X ) = log | X | X Nevin L. Zhang (HKUST) Machine Learning 10 / 30

Entropy Conditional entropy The conditional entropy of Y given event X = x : Entropy of the conditional distribution P ( Y | X = x ), i.e. 1 � H ( Y | X = x ) = P ( Y | X = x ) log P ( Y | X = x ) Y The uncertainty that remains about Y when X is known to be y . It is possible that H ( Y | X = x ) > H ( Y ) Intuitively X = x might contradicts our prior knowledge about Y and increase our uncertainty about Y Exercise: Give example. Nevin L. Zhang (HKUST) Machine Learning 11 / 30

Entropy Conditional Entropy The conditional entropy of Y given variable X : � H ( Y | X ) = P ( X = x ) H ( Y | X = x ) x 1 � � = P ( X ) P ( Y | X ) log P ( Y | X ) X Y 1 � = P ( X , Y ) log P ( Y | X ) X , Y = − E [ logP ( Y | X )] The average uncertainty that remains about X when Y is known. Nevin L. Zhang (HKUST) Machine Learning 12 / 30

Divergence Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 13 / 30

Divergence Kullback-Leibler divergence Relative entropy or Kullback-Leibler divergence Measures how much a distribution Q ( X ) differs from a ”true” probability distribution P ( X ). K-L divergence of Q from P is defined as follows: P ( X ) log P ( X ) � KL ( P || Q ) = Q ( X ) X 0 log 0 0 = 0 and plog p 0 = ∞ if p � =0 Nevin L. Zhang (HKUST) Machine Learning 14 / 30

Divergence Kullback-Leibler divergence Theorem (1.2) ( Gibbs’ inequality ) KL ( P , Q ) ≥ 0 with equality holds iff P is identical to Q Proof : P ( X ) log P ( X ) P ( X ) log Q ( X ) � � = − Q ( X ) P ( X ) X X P ( X ) Q ( X ) � ≥ − log Jensen’s inequality P ( X ) X � = − log Q ( X ) = 0 . X KL divergence between P and Q is larger than 0 unless P and Q are identical. Nevin L. Zhang (HKUST) Machine Learning 15 / 30

Divergence Cross Entropy 1 Entropy: H ( P ) = � X P ( X ) log P ( X ) = − E [log P ( x )] Cross entropy : 1 � H ( P , Q ) = P ( X ) log Q ( X ) = − E P [ logQ ( X )] X Relationship with KL: P ( X ) log P ( X ) � KL ( P || Q ) = Q ( X ) = E P [ logP ( X )] − E P [ logQ ( X )] X = H ( P , Q ) − H ( P ) Or, H ( P , Q ) = KL ( P || Q ) + H ( P ) Nevin L. Zhang (HKUST) Machine Learning 16 / 30

Divergence A corollary Corollary (1.1) (Gibbs Inequality) H ( P , Q ) ≥ H ( P ) , or � � P ( X ) log Q ( X ) ≤ P ( X ) log P ( X ) X X In general, let f ( X ) be a non-negative function. Then � � f ( X ) log Q ( X ) ≤ f ( X ) log P ∗ ( X ) X X where P ∗ ( X ) = f ( X ) / � X f ( X ). Nevin L. Zhang (HKUST) Machine Learning 17 / 30

Divergence Unsupervised Learning Unknown true distribution P ( x ). sampling learning → D = { x i } N P ( x ) − − − − − − − − − → Q ( x ) i =1 Objective: Minimizing KL : KL ( P || Q ) Same as minimizing cross entropy : H ( P , Q ) Approximating the cross entropy using data: � H ( P , Q ) = − P ( x ) log Q ( x ) d x N − 1 � ≈ log Q ( x i ) N i =1 − 1 = N log Q ( D ) Same as maximizing likelihood : log Q ( D ). Nevin L. Zhang (HKUST) Machine Learning 18 / 30

Divergence Supervised Learning Unknown true distribution P ( x , y ), where y is label of input x . sampling learning → D = { x i , y i } N P ( x , y ) − − − − − − − − − → Q ( y | x ) i =1 Objective: Minimizing cross (conditional) entropy : � H ( P , Q ) = − P ( x , y ) log Q ( y | x ) d x dy N − 1 � ≈ log Q ( y i | x i ) N i =1 Same as maximizing loglikelihood : � N i =1 log Q ( y i | x i ), Or minimizing the negative loglikelihood (NLL) : − � N i =1 log Q ( y i | x i ) Nevin L. Zhang (HKUST) Machine Learning 19 / 30

Divergence Jensen-Shannon divergence KL is not symmetric: KL ( P || Q ) usually is not equal to reverse KL KL ( Q || P ). Jensen-Shannon divergence is one symmetrized version of KL: JS ( P || Q ) = 1 2 KL ( P || M ) + 1 2 KL ( Q || M ) where M = P + Q 2 Properties: 0 ≤ JS ( P || Q ) ≤ log 2 JS ( P || Q ) = 0 if P = Q JS ( P || Q ) = log 2 if P and Q has disjoint support. Nevin L. Zhang (HKUST) Machine Learning 20 / 30

Mutual Information Outline 1 Jensen’s Inequality 2 Entropy 3 Divergence 4 Mutual Information Nevin L. Zhang (HKUST) Machine Learning 21 / 30

Mutual Information Mutual information The mutual information of X and Y : I ( X ; Y ) = H ( X ) − H ( X | Y ) Average reduction in uncertainty about X from learning the value of Y , or Average amount of information Y conveys about X . Nevin L. Zhang (HKUST) Machine Learning 22 / 30

Mutual Information Mutual information and KL Divergence Note that: 1 1 � � I ( X ; Y ) = P ( X ) log P ( X ) − P ( X , Y ) log P ( X | Y ) X X , Y 1 1 � � = P ( X , Y ) log P ( X ) − P ( X , Y ) log P ( X | Y ) X , Y X , Y P ( X , Y ) log P ( X | Y ) � = P ( X ) X , Y P ( X , Y ) log P ( X , Y ) � = equivalent definition P ( X ) P ( Y ) X , Y = KL ( P ( X , Y ) || P ( X ) P ( Y )) Due to equivalent definition: I ( X ; Y ) = H ( X ) − H ( X | Y ) = I ( Y ; X ) = H ( Y ) − H ( Y | X ) Nevin L. Zhang (HKUST) Machine Learning 23 / 30

Mutual Information Property of Mutual information Theorem (1.3) I ( X ; Y ) ≥ 0 with equality holds iff X ⊥ Y . Interpretation: X and Y are independent iff X contains no information about Y and vice versa. Proof : Follows from previous slide and Theorem 1.2. Nevin L. Zhang (HKUST) Machine Learning 24 / 30

Mutual Information Conditional Entropy Revisited Theorem (1.4) H ( X | Y ) ≤ H ( X ) with equality holds iff X ⊥ Y Observation reduces uncertainty in average except for the case of independence. Proof : Follows from Theorem 1.3. Nevin L. Zhang (HKUST) Machine Learning 25 / 30

Mutual Information Mutual information and Entropy From definition of mutual information I ( X ; Y ) = H ( X ) − H ( X | Y ) and the chain rule, H ( X , Y ) = H ( Y ) + H ( X | Y ) we get H ( X ) + H ( Y ) = H ( X , Y ) + I ( X ; Y ) I ( X ; Y ) = H ( X ) + H ( Y ) − H ( X , Y ) Consequently H ( X , Y ) ≤ H ( X ) + H ( Y ) with equality holds iff X ⊥ Y . Nevin L. Zhang (HKUST) Machine Learning 26 / 30

Machine Learning Lecture 01-2: Basics of Information Theory Nevin - PowerPoint PPT Presentation

Machine Learning Lecture 01-2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 30

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Prof Anita Heis iss Wiradjuri Nation Welcoming Cities, Brisbane, 2019 Gadigal /Eora Cowra

Wireless Multimedia System (Topic 5) Wireless Link I: Multiple

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical & Electronic

Intrabody Communication: Applications and Practical Issues Kurt Partridge University of

Planning and Optimization B7. Symbolic Search: Binary Decision Diagrams Malte Helmert and Thomas

05 Mesh Animation Steve Marschner CS5625 Spring 2019 Basic surface deformation methods Blend

Data-Driven Animation Full-body animation Skin deformation Facial animation Motion

animation 1 animation shape specification as a function of time 2 animation representation

Sambuz

Useful Links

Newsletter

Mail Us

Machine Learning Lecture 01-2: Basics of Information Theory Nevin - PowerPoint PPT Presentation

Machine Learning Lecture 01-2: Basics of Information Theory Nevin L. Zhang lzhang@cse.ust.hk Department of Computer Science and Engineering The Hong Kong University of Science and Technology Nevin L. Zhang (HKUST) Machine Learning 1 / 30

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Prof Anita Heis iss Wiradjuri Nation Welcoming Cities, Brisbane, 2019 Gadigal /Eora Cowra

Wireless Multimedia System (Topic 5) Wireless Link I: Multiple

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical &amp; Electronic

Intrabody Communication: Applications and Practical Issues Kurt Partridge University of

Planning and Optimization B7. Symbolic Search: Binary Decision Diagrams Malte Helmert and Thomas

05 Mesh Animation Steve Marschner CS5625 Spring 2019 Basic surface deformation methods Blend

Data-Driven Animation Full-body animation Skin deformation Facial animation Motion

animation 1 animation shape specification as a function of time 2 animation representation

Sambuz

Useful Links

Newsletter

Mail Us

Elements of a Nonstochastic Information Theory Girish Nair Dept. Electrical & Electronic