entropy information
play

Entropy & Information Jill illes V s Vreeken 29 29 May 2015 - PowerPoint PPT Presentation

Entropy & Information Jill illes V s Vreeken 29 29 May 2015 2015 Qu Question o of f th the da day What is infor ormation tion? (and what do talking drums have to do with it?) Bit Bits a s and Piec Pieces es What are


  1. Entropy & Information Jill illes V s Vreeken 29 29 May 2015 2015

  2. Qu Question o of f th the da day What is infor ormation tion? (and what do talking drums have to do with it?)

  3. Bit Bits a s and Piec Pieces es What are  information  a bit  entropy  mutual information  divergence  information theory  …

  4. In Informatio ion Th Theo eory Field founded by Claud ude Sha Shannon nnon in 1948, ‘A Mathematical Theory of Communication’ a branch of stati tisti stics that is essentially about uncertainty in communication not wha hat you say, but what you co could ld say

  5. Th The B e Big ig In Insigh sight Communication is a ser series of discrete messages each message reduces the uncertainty of the recipient of a ) the series and b ) that message by how much is the amount of information ion

  6. Uncerta tainty nty Shannon showed that uncertainty can be quantified, linking physical entropy to messages and defined the entropy of a discrete random variable 𝑌 as 𝐼 ( 𝑌 ) = − � 𝑄 ( 𝑦 𝑗 )log 𝑄 ( 𝑦 𝑗 ) 𝑗

  7. Op Optim imal pr l pref efix ix-code odes Shannon showed that uncertainty can be quantified, linking physical entropy to messages A (key) result of Shannon entropy is that − log 2 𝑄 𝑦 𝑗 gives the length in bits of the optim imal p l prefix ix co code for a message 𝑦 𝑗

  8. Codes a s and L Lengt engths A code de 𝐷 maps a set of messages ges 𝑌 to a set of code de words 𝑍 𝑀 𝐷 ⋅ is a code length function for 𝐷 with 𝑀 𝐷 𝑦 ∈ 𝑌 = | 𝐷 𝑦 ∈ 𝑍 | the le length in in bit its of the code word y ∈ 𝑍 that 𝐷 assigns to symbol 𝑦 ∈ 𝑌 .

  9. Efficie ienc ncy Not all codes are created equal. Let 𝐷 1 and 𝐷 2 be two codes for set of messages 𝑌 We call 𝐷 1 more effic icie ient nt than 𝐷 2 if for all 𝑦 ∈ 𝑌 , 𝑀 1 𝑦 ≤ 1. 𝑀 2 ( 𝑦 ) while for at least one 𝑦 ∈ 𝑌 , 𝑀 1 𝑦 < 𝑀 2 𝑦 We call a code 𝐷 for set 𝑌 complet ete e if there does not exist a 2. code 𝐷𝐷 that is more efficient than 𝐷 A code is is comple lete when n it it does not waste any bit its

  10. Th The Mo e Most st Im Important Slid lide We only care about code lengths

  11. Th The Mo e Most st Im Important Slid lide Actual code words are of no interest to us whatsoever.

  12. Th The Mo e Most st Im Important Slid lide Our goal is measuring complexity , not to instantiate an actual compressor

  13. My My Fir irst st C Code de Let us consider a sequence 𝑇 over a discrete alphabet 𝑌 = 𝑦 1 , 𝑦 2 , … , 𝑦 𝑛 . As code 𝐷 for 𝑇 we can instantiate a bl block co code, identifying the value of 𝑡 𝑗 ∈ 𝑇 by an index over 𝑌 , which require a constant number of log 2 | 𝑌 | bits per message in 𝑇 , i.e., 𝑀 𝑦 𝑗 = log 2 | 𝑌 | We can always instantiate a prefix-free code with code words of lengths 𝑀 𝑦 𝑗 = log 2 | 𝑌 |

  14. Codes in s in a Tree ee 00 0 01 root 10 1 11

  15. Beyo yond U Uni nifo form What if we know the distribution 𝑄 ( 𝑦 𝑗 ∈ 𝑌 ) over 𝑇 and it is not uniform? We do not want to waste any bits, so using block codes is a bad idea. We do not want to introduce any undue bias, so we want an effi efficien ent code that is uniq ique uely ly dec decodable e without having to use arbitrary length stop-words. We want an optimal prefix-code.

  16. Prefi efix C Codes A code 𝐷 is a prefi fix code e iff there is no code word 𝐷 𝑦 that is an extension of another code word 𝐷 ( 𝑦 ′ ) . Or, in other words, 𝐷 defines a binary tree with the leaves as the code words. 00 0 01 root 1 How do we find the optimal tree?

  17. Shann nnon E Entropy Let 𝑄 ( 𝑦 𝑗 ) be the probability of 𝑦 𝑗 ∈ 𝑌 in 𝑇 , then average number of 𝐼 ( 𝑇 ) = − � 𝑄 ( 𝑦 𝑗 )log 𝑄 ( 𝑦 𝑗 ) bits needed per 𝑦 𝑗 ∈𝑌 message 𝑡 𝑗 ∈ 𝑇 is the Sha hanno nnon n entropy of 𝑇 (wrt 𝑌 ) the ‘weight’, how number of bits needed often we see 𝑦 𝑗 to identify 𝑦 𝑗 under 𝑄 (see Shannon 1948)

  18. Op Optim imal l Pref efix C ix Code L de Lengt engths What if the distribution of 𝑌 in 𝑇 is not uniform? Let 𝑄 ( 𝑦 𝑗 ) be the probability of 𝑦 𝑗 in 𝑇 , then 𝑀 ( 𝑦 𝑗 ) = − log 𝑄 ( 𝑦 𝑗 ) is the le length h of the optimal prefi efix code de for message 𝑦 𝑗 knowing distribution 𝑄 (see Shannon 1948)

  19. Kraft’s Inequa Inequalit lity For any code C for finite alphabet 𝑌 = 𝑦 1 , … , 𝑦 𝑛 , the code word lengths 𝑀 𝐷 ⋅ must satisfy the inequality � 2 −𝑀 ( 𝑦𝑗 ) ≤ 1. 𝑦 𝑗 ∈𝑌 when a set of code word lengths satisfies the inequality, a) ther ere ex exists a prefi efix code de with these e code de word len engths, when it holds with strict equality, the cod ode is s comp omplete, b) it does not waste any part of the coding space, when it does not hold, the code is not uniq iquely ly decodable le c)

  20. Wha hat’s a a bit bit? Binary digit  smalle lest and most fundamental piece of information  yes or no  invented by Claude Shannon in 1948  name by John Tukey Bits have been in use for a long-long time, though  Punch cards (1725, 1804)  Morse code (1844)  African ‘talking drums’

  21. Mo Morse se c code de

  22. Natural la l lang nguage ge Punishes ‘bad’ redundancy: often-used words are shorter Rewards useful redundancy: cotxent alolws mishaireng/raeding African Talking Drums have used this for efficient, fast, long-distance communication mimic vocalized sounds: tonal language very reliable means of communication

  23. Mea Measu surin ing b g bit its How much information carries a given string? How many bits? Say we have a binary string of 10000 ‘messages’ 1) 00010001000100010001…000100010001000100010001000100010001 2) 01110100110100100110…101011101011101100010110001011011100 3) 00011000001010100000…001000100001000000100011000000100110 4) 0000000000000000000000000000100000000000000000000…0000000 obviously, all four are 10000 bits long. But, are they worth those 10000 bits?

  24. So, So, h how ow man many bit bits? s? Depends on the encoding! What is the best encoding?  one that takes the entropy of the data into account  things that occur often should get short code  things that occur seldom should get long code An encoding matching Shannon Entropy is optimal

  25. T ell ell us! us! Ho How w many ny bit bits? s? Please? In our simplest example we have 𝑄 (1) = 1/100000 𝑄 (0) = 99999/100000 (1/100000) = 16.61 | 𝑑𝑑𝑑𝑑 1 | = − log (99999/100000) = 0.0000144 | 𝑑𝑑𝑑𝑑 0 | = − log So, knowing 𝑄 our string contains 1 ∗ 16.61 + 99999 ∗ 0.0000144 = 18.049 bits of information

  26. Op Optim imal… l…. Shannon lets us calculate optimal code lengths  what about actual codes? 0.0000144 bits?  Shannon and Fano invented a near-optimal encoding in 1948, within one bit of the optimal, but not lowest expected Fano gave students an option: regular exam, or invent a better encoding  David didn’t like exams; invented Huffman-codes (1952)  optimal for symbol-by-symbol encoding with fixed probs. (arithmetic coding is overall optimal, Rissanen 1976)

  27. Op Optim imali lity To encode optimally, we need optimal probabilities What happens if we don’t?

  28. Mea Measu surin ing Div g Diver ergenc gence Kullback-Leibler divergence from 𝑅 to 𝑄 , denoted by 𝐸 ( 𝑄 ‖ 𝑅 ) , measures the number of bits we ‘waste’ when we use 𝑅 while 𝑄 is the ‘true’ distribution 𝐸 𝑄 ‖ 𝑅 = � 𝑄 ( 𝑗 ) log 𝑄 𝑗 𝑅 𝑗 𝑗

  29. Mu Mult ltiv ivariate E e Ent ntropy So far we’ve been thinking about a single sequence of messages How does entropy work for multivariate data? Simple!

  30. T owa wards ds Mut Mutual Inf Informatio ion Condit itio ional l Entropy is defined as 𝐼 𝑌 𝑍 = � 𝑄 𝑦 𝐼 ( 𝑍 | 𝑌 = 𝑦 ) 𝑦∈X ‘average number of bits needed for message 𝑦 ∈ 𝑌 knowing 𝑍𝐷 Symmetric

  31. Mutua Mu ual Inf l Information the amount of information shared between two variables 𝑌 and 𝑍 𝐽 𝑌 , 𝑍 = 𝐼 𝑌 − 𝐼 𝑌 𝑍 = 𝐼 𝑍 − 𝐼 𝑍 𝑌 𝑄 𝑦 , 𝑧 = � � 𝑄 𝑦 , 𝑧 log 𝑄 𝑦 𝑄 𝑧 𝑧∈𝑍 𝑦∈𝑌 high 𝐽 ( 𝑌 , 𝑍 ) implies correlation low 𝐽 ( 𝑌 , 𝑍 ) implies independence Information is symmetric!

  32. In Informatio ion Ga Gain in (small aside) Entropy and KL are used in decision trees What is the best split in a tree? one that results in as homogeneous label distributions in the sub-nodes as possible: min inim imal l entropy How do we compare over multiple options? 𝐽𝐽 𝑈 , 𝑏 = 𝐼 𝑈 − 𝐼 ( 𝑈 | 𝑏 )

  33. Low-Entropy S Lo y Sets ts Goal Go al: Finds sets of attributes that interact st strongl gly Task: mine all sets of attributes Ta such that the entropy over their values instantiations ≤ 𝜏 Theory of Probability Computation Theory 1 No No 1887 Yes No 156 No Yes 143 Yes Yes 219 1.087 bits (Heikinheimo et al. 2007)

Recommend


More recommend