information theory statistics and decision trees
play

Information Theory, Statistics, and Decision Trees L eon Bottou - PowerPoint PPT Presentation

Information Theory, Statistics, and Decision Trees L eon Bottou COS 424 4/6/2010 Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. L eon Bottou 2/31 COS 424 4/6/2010 I. Basic


  1. Information Theory, Statistics, and Decision Trees L´ eon Bottou COS 424 – 4/6/2010

  2. Summary 1. Basic information theory. 2. Decision trees. 3. Information theory and statistics. L´ eon Bottou 2/31 COS 424 – 4/6/2010

  3. I. Basic Information theory L´ eon Bottou 3/31 COS 424 – 4/6/2010

  4. Why do we care? Information theory – Invented by Claude Shannon in 1948 A Mathematical Theory of Communication. Bell System Technical Journal , October 1948. – The “quantity of information” measured in “bits”. – The “capacity of a transmission channel”. – Data coding and data compression. Information gain – A derived concept. – Quantify how much information we acquire about a phenomenon. – A justification for the Kullback-Leibler divergence. L´ eon Bottou 4/31 COS 424 – 4/6/2010

  5. The coding paradigm Intuition The quantity of information of a message is the length of the smallest code that can represent the message. Paradigm – Assume there are n possible messages i = 1 . . . n . – We want a signal that indicates the occurrence of one of them. – We can transmit an alphabet of r symbols. For instance a wire could carry r = 2 electrical levels. – The code for message i is a sequence of l i symbols. Properties – Codes should be uniquely decodable . – Average code length for a message: � n x =1 p i l i . L´ eon Bottou 5/31 COS 424 – 4/6/2010

  6. Prefix codes ������������ � � ������������� � � ������������ � � � – Messages 1 and 2 have codes one symbol long ( l i = 1 ). – Messages 3 and 4 have codes two symbols long ( l i = 2 ). – Messages 5 and 6,have codes three symbols long ( l i = 2 ). – There is an unused three symbol code. That’s inefficient. Properties – Prefix codes are uniquely decodable. – There are trickier kinds of uniquely decodable codes, e.g. a �→ 0 , b �→ 01 , c �→ 011 versus a �→ 0 , b �→ 10 , c �→ 110 . L´ eon Bottou 6/31 COS 424 – 4/6/2010

  7. Kraft inequality Uniquely decodable codes satisfy n � 1 � l i � ≤ 1 r x =1 – All uniquely decodable codes satisfy this inequality. – If integer code lengths l i satisfy this inequality, there exists a prefix code with such code lengths. Consequences – If some messages have short codes, others must have long codes. – To minimize the average code length: - give short codes to high probability messages. - give long codes to low probability messages. – Equiprobable messages should have similar code lengths. L´ eon Bottou 7/31 COS 424 – 4/6/2010

  8. Kraft inequality for prefix codes Prefix codes satisfy Kraft inequality         � �  � 1 � − l i  � � r l − l i ≤ r l ⇐ ⇒ ≤ 1 r  � �  � �� � � � � �� � � � i i     � �� � � � � �� � � �    � � � All uniquely decodable codes satisfy Kraft inequality – Proof must deal with infinite sequences of messages. Given integer code lengths l i : – Build a balanced r -ary tree of depth l = max i l i . – For each message, prune one subtree at depth l i . – Kraft inequality ensures that there will be enough branches left to define a code for each message. L´ eon Bottou 8/31 COS 424 – 4/6/2010

  9. Redundant codes � r − l i < 1 Assume i – There are leftover branches in the tree. – There are codes that are not used, or – There are multiple codes for each message. � r − l i = 1 For best compression, i – This is not always possible with integer code lengths l i . – But we can use this to compute a lower bound. L´ eon Bottou 9/31 COS 424 – 4/6/2010

  10. Lower bound for the average code length Choose code lengths l i such that � � r − l i = 1 , min l i > 0 p i l i subject to l 1 ...l n i i – Define s i = r − l i , that is, l i = − log r ( s i ) . – Maximize C = � p i log r ( s i ) subject to � i s i = 1 p i – We get ∂C ∂s i = s i log ( r ) = Constant , that is s i ∝ p i . – Replacing in the constraint gives s i = p i . Therefore � � l i = − log r ( p i ) and p i l i = − p i log r ( p i ) i i Fractional code lengths – What does it mean to code a message on 0 . 5 symbols? L´ eon Bottou 10/31 COS 424 – 4/6/2010

  11. Arithmetic coding – An infinite sequence of messages i 1 , i 2 , . . . can be viewed as a number x = 0 .i 1 i 2 i 3 . . . in base n . – An infinite sequence of symbols c 1 , c 2 , . . . can be viewed as a number y = 0 .c 1 c 2 c 3 . . . in base r . � ��������� �������� ��������� ���������� � � � ��������� �������� ��������� ���������� � ��������� �������� ��������� ���������� � � �������� ������� �������� ��������� � � L´ eon Bottou 11/31 COS 424 – 4/6/2010

  12. Arithmetic coding � ������� �������� ��������� ���������� � � � � � � � � � � � � ������� �������� ��������� ���������� ������� �������� ��������� ���������� � � ������ ������� �������� ��������� � � To encode a sequence of L messages i 1 , . . . , i L . L � – The code y must belong to an interval of size p i k . k =1 � L � � – It is sufficient to specify l ( i 1 i 2 . . . i L ) = log r ( p i k ) digits of y . k =1 L´ eon Bottou 12/31 COS 424 – 4/6/2010

  13. Arithmetic coding To encode a sequence of L messages i 1 , . . . , i L . L � � � – It is sufficient to specify l ( i 1 i 2 . . . i L ) = − log r ( p i k ) digits of y . k =1 – The average code length per message is   L 1 � � − log r ( p i k ) p i 1 . . . p i L   L     i 1 i 2 ...i L k =1 L log r ( p i k ) � � L →∞ − → p i 1 . . . p i L L i 1 i 2 ...i L k =1 L r 1 � � � � � � � = p i h p i k log p i k = − p i log p i L i k =1 k =1 h � = k i i 1 ...i L \ i k Arithmetic coding reaches the lower bound when L → ∞ . L´ eon Bottou 13/31 COS 424 – 4/6/2010

  14. Quantity of information Optimal code length: l i = − log r ( p i ) . Optimal expected code length: � p i l i = − � p i log r ( p i ) . Receiving a message x with probability p x : – The acquired information is h ( x ) = − log 2 ( p x ) bits. – An informative message is a surprising message! Expecting a message X with distribution p 1 . . . p n : – The expected information is H ( X ) = − � x ∈X p x log 2 ( p x ) bits. – This is also called entropy . These are two distinct definitions! Note how we switched to logarithms in base two. This is a multiplicative factor: log 2 ( p ) = log r ( p ) log 2 ( r ) . Choosing base 2 defines a unit of information: the bit. L´ eon Bottou 14/31 COS 424 – 4/6/2010

  15. Mutual information ���������� ���� ������ ��� ����� �������� ����������� ����� �� ��� �� � ����� ���������� ����� �� �� �� � ����� ���� !���� � �� �� �� ����� ���� � �� �� �� ����� �������� ����� ����� ����� ����� ����������� ���� ���������� ���� ������ ��� ����� ����� ����� ����� ���� ���� ���������� ����� ���� ���� ���� �� � !���� ��"� ���� ���� �� � #��� ����� ���� ����� ���� ����������������� ���� ������������������ ���� H ( X ) = − � i P ( X = i ) log P ( X = i ) – Expected information: H ( X, Y ) = � i,j P ( X = i, Y = j ) log P ( X = i, Y = j ) – Joint information: – Mutual information: I ( X, Y ) = H ( X ) + H ( Y ) − H ( X, Y ) ���� ������ ���� ������ L´ eon Bottou 15/31 COS 424 – 4/6/2010

  16. II. Decision trees L´ eon Bottou 16/31 COS 424 – 4/6/2010

Recommend


More recommend