T-61.182 Information Theory and Machine Learning Data Compression (Chapters 4-6) presented by Tapani Raiko Feb 26, 2004
Contents (Data Compression) Chap. 4 Chap. 5 Chap. 6 Data Block Symbol Stream Lossy? Lossy Lossless Lossless Result Shannon’s source Huffman coding Arithmetic coding coding theorem algorithm algorithm
Weighting Problem (What is information?) • 12 balls, all equal in weight except for one • Two-pan balance to use • Determine which is the odd ball and whether it is heavier or lighter • As few uses of the balance as possible! • The outcome of a random experiment is guaranteed to be most informative if the probability distribution over outcomes is uniform
1 + � ✒ 1 � ✲ 1 + 2 + 5 − 2 + 1 + ✁✁ ✕ ❅ 2 ❅ ❘ 5 − 2 + ✁ weigh 3 + 3 + ✁ � ✒ 4 + 1 2 6 3 ✁ ✲ � ✲ 3 + 4 + 6 − 4 + 1 + ✍ ✂ ❆ ❅ 5 − 3 4 5 4 ✂ ❅ ❘ ❆ 6 − 2 + 6 − ✂ ❆ ✂ 3 + 7 − 7 − ❆❆ ✒ � ✂ 4 + 1 8 − ❯ � ✲ 7 − 8 − 8 − ✂ 5 + ❅ 7 ✂ ❘ ❅ 6 + ⋆ ✂ 7 + ✂ 4 − ✒ � ✂ 8 + 3 � ✲ 6 + 3 − 4 − 3 − ✂ 9 + ✁✁ ✕ ❅ 1 − 4 ❅ ❘ ✂ 6 + 10 + 2 − ✂ ✁ weigh weigh 11 + 3 − 2 − ✂ ✁ ✒ � 12 + 1 2 3 4 ✂ ✲ 4 − 1 2 6 ✁ ✲ 1 � ✲ 1 − 2 − 5 + 1 − ❇ 5 + ❆ ❅ 1 − 5 6 7 8 3 4 5 2 ❅ ❘ ❇ ❆ 5 + 6 + 2 − ❇ ❆ 7 + 3 − 7 + ❇ ❆❆ � ✒ 8 + 4 − ❯ 7 � ✲ ❇ 7 + 8 + 8 + ❅ 5 − ❇ 1 ❅ ❘ ⋆ ❇ 6 − ❇ 7 − 9 + ❇ � ✒ 8 − 9 ✲ � ❇ 9 + 10 + 11 + 10 + 9 + ✁✁ ✕ ❅ 9 − 10 ❇ ❅ ❘ 11 + 10 + 10 − ❇ ✁ weigh 11 + ❇ 11 − 10 − ✁ � ✒ ❇ 12 + 12 − 9 10 11 9 ◆ ❇ ✁ ✲ � ✲ 9 − 10 − 11 − 9 − ❆ ❅ 9 − 1 2 3 10 ❘ ❅ ❆ 11 − 10 − ❆ 11 − 12 + ❆❆ ✒ � 12 − 12 ❯ � ✲ 12 + 12 − 12 − ❅ 1 ❘ ❅ ⋆
Definitions • Shannon information content: 1 h ( x = a i ) ≡ log 2 p i • Entropy: 1 � H ( X ) = p i log 2 p i i • Both are additive for independent variables 1 1 p h ( p ) H 2 ( p ) h ( p ) = log 2 H 2 ( p ) 10 p 0.8 8 0.001 10.0 0.011 0.6 0.01 6.6 0.081 6 0.1 3.3 0.47 0.4 4 0.2 2.3 0.72 0.2 2 0.5 1.0 1.0 0 0 1 p 1 p 0 0.2 0.4 0.6 0.8 0 0.2 0.4 0.6 0.8
Game of Submarine • Player hides a submarine in one square of an 8 by 8 grid • Another player trys to hit it × × × × × × × × × × × × × × × × A ❥ B × × × × × × × × × × × × × × × × × × × × × C × × × × × × × × × × × × × × × × D × × × × × × × × × × × × × × × ❥ × × × × × × × × × × × × × × × × E ❥ × × × × × × × × × × × × × × × × F ❥ G × × × × × × × × × × × × × × × × × × × × × ❥ H × × × × × × × × × S × × × × × 1 2 3 4 5 6 7 8 move # 1 2 32 48 49 question G3 B1 E5 F3 H3 outcome x = n x = n x = n x = n x = y 63 62 32 16 1 P ( x ) 64 63 33 17 16 h ( x ) 0.0227 0.0230 0.0443 0.0874 4.0 Total info. 0.0227 0.0458 1.0 2.0 6.0 • Compare to asking 6 yes/no questions about the location
Raw Bit Content • A binary name is given to each outcome of a random variable X • The length of the names would be log 2 |A X | (assuming |A X | happens to be a power of 2) • Define: The raw bit content of X is H 0 ( X ) = log 2 |A X | • Simply counts the possible outcomes - no compression yet • Additive: H 0 ( X, Y ) = H 0 ( X ) + H 0 ( Y )
Lossy Compression • Let δ = 0 δ = 1 / 16 A X = { a , b , c , d , e , f , g , h } x c ( x ) x c ( x ) � 1 4 , 1 4 , 1 4 , 3 16 , 1 64 , 1 64 , 1 64 , 1 � P X = a 000 a 00 64 b 001 b 01 • The raw bit content is 3 bits (8 binary c 010 c 10 d 011 d 11 names) e 100 e − • If we are willing to run a risk of δ = 1 / 16 f 101 f − g 110 g − of not having a name for x, then we can h 111 h − get by with 2 bits (4 names)
log 2 P ( x ) − 6 − 4 − 2 . 4 − 2 ✲ S 1 S 0 16 ✻ ✻ ✻ e , f , g , h a , b , c d The outcomes of X ranked by their probability
Essential Bit Content • Allow an error with probability δ • Choose the smallest sufficient subset S δ such that P ( x ∈ S δ ) ≥ 1 − δ (arrange the elements of A X in order of decreasing probability and take enough from beginning) • Define: The essential bit content of X is H δ ( X ) = log 2 | S δ | • Note that the raw bit content H 0 is a special case of H δ
3 {a,b,c,d,e,f,g,h} {a,b,c,d,e,f,g} {a,b,c,d,e,f} 2.5 {a,b,c,d,e} H δ ( X ) 2 {a,b,c,d} {a,b,c} 1.5 1 {a,b} 0.5 {a} 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 δ The essential bit content as the function of allowed probability of error
Extended Ensembles (Blocks) • Consider a tuple of N i.i.d. random variables • Denote by X N the ensemble ( X 1 , X 2 , . . . , X N ) • Entropy is additive: H ( X N ) = NH ( X ) • Example: N flips of a bent coin: p 0 = 0 . 9 , p 1 = 0 . 1
log 2 P ( x ) − 14 − 12 − 10 − 8 − 6 − 4 − 2 0 ✲ S 0 . 01 S 0 . 1 ✻ ✻ ✻ ✻ ✻ 1101 , 1011 , . . . 0110 , 1010 , . . . 0010 , 0001 , . . . 1111 0000 Outcomes of the bent coin ensemble X 4
4 N=4 3.5 H δ ( X 4 ) 3 2.5 2 1.5 1 0.5 0 δ 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 Essential bit content of the bent coin ensemble X 4
10 N=10 8 H δ ( X 10 ) 6 4 2 0 δ 0 0.2 0.4 0.6 0.8 1 Essential bit content of the bent coin ensemble X 10
1 N=10 N=210 N=410 N H δ ( X N ) 1 0.8 N=610 N=810 N=1010 0.6 0.4 0.2 0 δ 0 0.2 0.4 0.6 0.8 1 Essential bit content per toss
Shannon’s Source Coding Theorem Given ǫ > 0 and 0 < δ < 1 , there exists a positive integer N 0 such that for N > N 0 , � � 1 � N H δ ( X N ) − H ( X ) � � < ǫ. � � � 1 N H δ ( X N ) H 0 ( X ) • Proof involves – Law of large numbers H + ǫ H – Chebyshev’s inequality H − ǫ 0 1 δ
log 2 ( P ( x )) x − 50.1 ...1...................1.....1....1.1.......1........1...........1.....................1.......11... − 37.3 ......................1.....1.....1.......1....1.........1.....................................1.... − 65.9 ........1....1..1...1....11..1.1.........11.........................1...1.1..1...1................1. − 56.4 1.1...1................1.......................11.1..1............................1.....1..1.11..... − 53.2 ...11...........1...1.....1.1......1..........1....1...1.....1............1......................... − 43.7 ..............1......1.........1.1.......1..........1............1...1......................1....... − 46.8 .....1........1.......1...1............1............1...........1......1..11........................ − 56.4 .....1..1..1...............111...................1...............1.........1.1...1...1.............1 − 37.3 .........1..........1.....1......1..........1....1..............................................1... − 43.7 ......1........................1..............1.....1..1.1.1..1...................................1. − 56.4 1.......................1..........1...1...................1....1....1........1..11..1.1...1........ − 37.3 ...........11.1.........1................1......1.....................1............................. − 56.4 .1..........1...1.1.............1.......11...........1.1...1..............1.............11.......... − 59.5 ......1...1..1.....1..11.1.1.1...1.....................1............1.............1..1.............. − 46.8 ............11.1......1....1..1............................1.......1..............1.......1......... − 15.2 .................................................................................................... − 332.1 1111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111111 Some samples from X 100 . Compare to H ( X 100 ) = 46 . 9 bits.
Typicality • A string contains r 1s and N − r 0s • Consider r as a random variable (binomial distribution) � • Mean and std: r ∼ Np 1 ± Np 1 (1 − p 1 ) • A typical string is a one with r ≃ Np 1 • In general, information content within N [ H ( X ) ± β ] 1 1 � log 2 P ( x ) = N p i log 2 ≃ NH ( X ) p i i
N = 100 N = 1000 1.2e+29 3e+299 1e+29 2.5e+299 � N � n ( r ) = 8e+28 2e+299 r 6e+28 1.5e+299 4e+28 1e+299 2e+28 5e+298 0 0 0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 9001000 0 0 -50 -500 -100 -1000 log 2 P ( x ) T T -150 -1500 -200 -2000 -250 -2500 -300 -3000 -350 -3500 0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 9001000 0.14 0.045 0.04 0.12 0.035 � N � p r 1 (1 − p 1 ) N − r 0.1 n ( r ) P ( x ) = 0.03 r 0.08 0.025 0.02 0.06 0.015 0.04 0.01 0.02 0.005 0 0 0 10 20 30 40 50 60 70 80 90 100 0 100 200 300 400 500 600 700 800 9001000 r r Anatomy of the typical set T
log 2 P ( x ) − NH ( X ) ✲ T Nβ ✻ ✻ ✻ ✻ ✻ 1111111111110 . . . 11111110111 0000100000010 . . . 00001000010 0100000001000 . . . 00010000000 0001000000000 . . . 00000000000 0000000000000 . . . 00000000000 Outcomes of X N ranked by their probability and the typical set T Nβ
Recommend
More recommend