information theory
play

Information Theory Maneesh Sahani Gatsby Computational Neuroscience - PowerPoint PPT Presentation

Information Theory Maneesh Sahani Gatsby Computational Neuroscience Unit University College London February 2019 Quantifying a Code How much information does a neural response carry about a stimulus? How efficient is a hypothetical


  1. Kullback-Leibler Divergence Another useful information theoretic quantity measures the difference between two distributions. � P ( s ) log P ( s ) KL [ P ( S ) � Q ( S )] = Q ( s ) s � 1 − H [ P ] = P ( s ) log Q ( s ) s � �� � cross entropy Excess cost in bits paid by encoding according to Q instead of P .

  2. Kullback-Leibler Divergence Another useful information theoretic quantity measures the difference between two distributions. � P ( s ) log P ( s ) KL [ P ( S ) � Q ( S )] = Q ( s ) s � 1 − H [ P ] = P ( s ) log Q ( s ) s � �� � cross entropy Excess cost in bits paid by encoding according to Q instead of P . � P ( s ) log Q ( s ) − KL [ P � Q ] = P ( s ) s � P ( s ) Q ( s ) ≤ log by Jensen P ( s ) s � = log Q ( s ) = log 1 = 0 s So KL [ P � Q ] ≥ 0. Equality iff P = Q

  3. Mutual Information and KL � P ( s , r ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = KL [ P ( S , R ) � P ( S ) P ( R )] s , r

  4. Mutual Information and KL � P ( s , r ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = KL [ P ( S , R ) � P ( S ) P ( R )] s , r Thus: 1. Mutual information is always non-negative I [ S ; R ] ≥ 0

  5. Mutual Information and KL � P ( s , r ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = KL [ P ( S , R ) � P ( S ) P ( R )] s , r Thus: 1. Mutual information is always non-negative I [ S ; R ] ≥ 0 2. Conditioning never increases entropy H [ S | R ] ≤ H [ S ]

  6. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently.

  7. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ]

  8. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant

  9. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent

  10. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic

  11. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic no no ? any of the above

  12. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic no no ? any of the above I 12 > max ( I 1 , I 2 ) : the second response cannot destroy information.

  13. Multiple Responses Two responses to the same stimulus, R 1 and R 2 , may provide either more or less information jointly than independently. I 12 = I [ S ; R 1 , R 2 ] = H [ R 1 , R 2 ] − H [ R 1 , R 2 | S ] R 1 ⊥ ⊥ R 2 ⇒ H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] R 1 ⊥ ⊥ R 2 | S ⇒ H [ R 1 , R 2 | S ] = H [ R 1 | S ] + H [ R 2 | S ] R 1 ⊥ ⊥ R 2 R 1 ⊥ ⊥ R 2 | S I 12 < I 1 + I 2 no yes redundant I 12 = I 1 + I 2 yes yes independent I 12 > I 1 + I 2 yes no synergistic no no ? any of the above I 12 > max ( I 1 , I 2 ) : the second response cannot destroy information. Thus, the Venn-like diagram with three variables is misleading.

  14. Data Processing Inequality

  15. Data Processing Inequality Suppose S → R 1 → R 2 form a Markov chain; that is, R 2 ⊥ ⊥ S | R 1 . Then, P ( R 2 , S | R 1 ) = P ( R 2 | R 1 ) P ( S | R 1 ) ⇒ P ( S | R 1 , R 2 ) = P ( S | R 1 )

  16. Data Processing Inequality Suppose S → R 1 → R 2 form a Markov chain; that is, R 2 ⊥ ⊥ S | R 1 . Then, P ( R 2 , S | R 1 ) = P ( R 2 | R 1 ) P ( S | R 1 ) ⇒ P ( S | R 1 , R 2 ) = P ( S | R 1 ) Thus, H [ S | R 2 ] ≥ H [ S | R 1 , R 2 ] = H [ S | R 1 ] ⇒ I [ S ; R 2 ] ≤ I [ S ; R 1 ] So any computation based on R 1 that does not have separate access to S cannot add information (in the Shannon sense) about the world.

  17. Data Processing Inequality Suppose S → R 1 → R 2 form a Markov chain; that is, R 2 ⊥ ⊥ S | R 1 . Then, P ( R 2 , S | R 1 ) = P ( R 2 | R 1 ) P ( S | R 1 ) ⇒ P ( S | R 1 , R 2 ) = P ( S | R 1 ) Thus, H [ S | R 2 ] ≥ H [ S | R 1 , R 2 ] = H [ S | R 1 ] ⇒ I [ S ; R 2 ] ≤ I [ S ; R 1 ] So any computation based on R 1 that does not have separate access to S cannot add information (in the Shannon sense) about the world. Equality holds iff S → R 2 → R 1 as well. In this case R 2 is called a sufficient statistic for S .

  18. Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series.

  19. Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = { S 1 , S 2 , S 3 . . . } form a stochastic process. H [ S 1 , S 2 , . . . , S n ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S 1 , S 2 , . . . , S n − 1 ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S n − 1 | S 1 , S 2 , . . . , S n − 2 ] + . . . + H [ S 1 ]

  20. Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = { S 1 , S 2 , S 3 . . . } form a stochastic process. H [ S 1 , S 2 , . . . , S n ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S 1 , S 2 , . . . , S n − 1 ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S n − 1 | S 1 , S 2 , . . . , S n − 2 ] + . . . + H [ S 1 ] The entropy rate of S is defined as H [ S 1 , S 2 , . . . , S n ] H [ S ] = lim N n →∞ or alternatively as H [ S ] = lim n →∞ H [ S n | S 1 , S 2 , . . . , S n − 1 ]

  21. Entropy Rate So far we have discussed S and R as single (or iid) random variables. But real stimuli and responses form a time series. Let S = { S 1 , S 2 , S 3 . . . } form a stochastic process. H [ S 1 , S 2 , . . . , S n ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S 1 , S 2 , . . . , S n − 1 ] = H [ S n | S 1 , S 2 , . . . , S n − 1 ] + H [ S n − 1 | S 1 , S 2 , . . . , S n − 2 ] + . . . + H [ S 1 ] The entropy rate of S is defined as H [ S 1 , S 2 , . . . , S n ] H [ S ] = lim N n →∞ or alternatively as H [ S ] = lim n →∞ H [ S n | S 1 , S 2 , . . . , S n − 1 ] iid If S i ∼ P ( S ) then H [ S ] = H [ S ] . If S is Markov (and stationary) then H [ S ] = H [ S n | S n − 1 ] .

  22. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy?

  23. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i

  24. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i

  25. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i

  26. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � → − ds p ( s ) log p ( s ) + ∞

  27. Continuous Random Variables The discussion so far has involved discrete S and R . Now, let S ∈ R with density p(s). What is its entropy? Suppose we discretise with length ∆ s : � H ∆ [ S ] = − p ( s i )∆ s log p ( s i )∆ s i � = − p ( s i )∆ s ( log p ( s i ) + log ∆ s ) i � � = − p ( s i )∆ s log p ( s i ) − log ∆ s p ( s i )∆ s i i � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � → − ds p ( s ) log p ( s ) + ∞ We define the differential entropy : � h ( S ) = − ds p ( s ) log p ( s ) . Note that h ( S ) can be < 0, and can be ±∞ .

  28. Continuous Random Variables We can define other information theoretic quantities similarly.

  29. Continuous Random Variables We can define other information theoretic quantities similarly. The conditional differential entropy is � h ( S | R ) = − ds dr p ( s , r ) log p ( s | r ) and, like the differential entropy itself, may be poorly behaved.

  30. Continuous Random Variables We can define other information theoretic quantities similarly. The conditional differential entropy is � h ( S | R ) = − ds dr p ( s , r ) log p ( s | r ) and, like the differential entropy itself, may be poorly behaved. The mutual information, however, is well-defined I ∆ [ S ; R ] = H ∆ [ S ] − H ∆ [ S | R ] � = − ∆ s p ( s i ) log p ( s i ) − log ∆ s i � � � � − − ∆ s p ( s i | r ) log p ( s i | r ) − log ∆ s dr p ( r ) i → h ( S ) − h ( S | R ) as are other KL divergences.

  31. Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 .

  32. Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy?

  33. Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy? Use Lagrange multipliers: � �� � �� � L = ds p ( s ) log p ( s ) − λ 0 ds p ( s ) − 1 − λ 1 ds p ( s ) f ( s ) − a δ L δ p ( s ) = 1 + log p ( s ) − λ 0 − λ 1 f ( s ) = 0 ⇒ log p ( s ) = λ 0 + λ 1 f ( s ) − 1 ⇒ p ( s ) = 1 Z e λ 1 f ( s ) The constants λ 0 and λ 1 can be found by solving the constraint equations.

  34. Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy? Use Lagrange multipliers: � �� � �� � L = ds p ( s ) log p ( s ) − λ 0 ds p ( s ) − 1 − λ 1 ds p ( s ) f ( s ) − a δ L δ p ( s ) = 1 + log p ( s ) − λ 0 − λ 1 f ( s ) = 0 ⇒ log p ( s ) = λ 0 + λ 1 f ( s ) − 1 ⇒ p ( s ) = 1 Z e λ 1 f ( s ) The constants λ 0 and λ 1 can be found by solving the constraint equations. Thus, p ( s ) = 1 Z e λ 1 s . f ( s ) = s ⇒ Exponential (need p ( s ) = 0 for s < T ) . Z e λ 1 s 2 . f ( s ) = s 2 p ( s ) = 1 ⇒ Gaussian .

  35. Maximum Entropy Distributions 1. H [ R 1 , R 2 ] = H [ R 1 ] + H [ R 2 ] with equality iff R 1 ⊥ ⊥ R 2 . � 2. Let ds p ( s ) f ( s ) = a for some function f . What distribution has maximum entropy? Use Lagrange multipliers: � �� � �� � L = ds p ( s ) log p ( s ) − λ 0 ds p ( s ) − 1 − λ 1 ds p ( s ) f ( s ) − a δ L δ p ( s ) = 1 + log p ( s ) − λ 0 − λ 1 f ( s ) = 0 ⇒ log p ( s ) = λ 0 + λ 1 f ( s ) − 1 ⇒ p ( s ) = 1 Z e λ 1 f ( s ) The constants λ 0 and λ 1 can be found by solving the constraint equations. Thus, p ( s ) = 1 Z e λ 1 s . f ( s ) = s ⇒ Exponential (need p ( s ) = 0 for s < T ) . Z e λ 1 s 2 . f ( s ) = s 2 p ( s ) = 1 ⇒ Gaussian . Both results together ⇒ maximum entropy point process (for fixed mean arrival rate) is homogeneous Poisson – independent, exponentially distributed ISIs.

  36. Channels We now direct our focus to the conditional P ( R | S ) which defines the channel linking S to R . P ( R | S ) − → R S

  37. Channels We now direct our focus to the conditional P ( R | S ) which defines the channel linking S to R . P ( R | S ) − → R S The mutual information � � P ( s , r ) P ( s ) P ( r | s ) log P ( r | s ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = P ( r ) s , r s , r depends on marginals P ( s ) and P ( r ) = � s P ( r | s ) P ( s ) as well and thus is unsuitable to characterise the conditional alone.

  38. Channels We now direct our focus to the conditional P ( R | S ) which defines the channel linking S to R . P ( R | S ) − → R S The mutual information � � P ( s , r ) P ( s ) P ( r | s ) log P ( r | s ) I [ S ; R ] = P ( s , r ) log P ( s ) P ( r ) = P ( r ) s , r s , r depends on marginals P ( s ) and P ( r ) = � s P ( r | s ) P ( s ) as well and thus is unsuitable to characterise the conditional alone. Instead, we characterise the channel by its capacity C R | S = sup I [ S ; R ] P ( s ) Thus the capacity gives the theoretical limit on the amount of information that can be transmitted over a channel. Clearly, this is limited by the properties of the noise.

  39. Joint source-channel coding theorem The remarkable central result of information theory. encoder channel decoder → � → � − − − − − − − − − − − − − − − − − − − − − − → − − − − − − − − − − − S S R T C R | � S Any source ensemble S with entropy H [ S ] < C R | � S can be transmitted (in sufficiently long blocks) with P error → 0. The proof is beyond our scope. Some of the key ideas that appear in the proof are: ◮ block coding ◮ error correction ◮ joint typicality ◮ random codes

  40. The channel coding problem encoder channel decoder → � → � − − − − − − − − − − − − − − − − − − − − − − → − − − − − − − − − − − S S R T C R | � S Given channel P ( R | � S ) and source P ( S ) , find encoding P ( � S | S ) (may be deterministic) to maximise I [ S ; R ] . By data processing inequality, and defn of capacity: I [ S ; R ] ≤ I [ � S ; R ] ≤ C R | � S By JSCT, equality can be achieved (in the limit of increasing block size). Thus I [ � S ; R ] should saturate C R | � S . See homework for an algorithm (Blahut-Arimoto) to find P ( � S ) that saturates C R | � S for a general discrete channel.

  41. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy

  42. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S

  43. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0

  44. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0 To maximise the marginal entropy, we add a Lagrange multiplier ( µ ) to enforce normalisation and then differentiate � � � − log p ( r ) − 1 − µ � r max δ r ∈ [ 0 , r max ] h ( r ) − µ p ( r ) = δ p ( r ) 0 otherwise 0

  45. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0 To maximise the marginal entropy, we add a Lagrange multiplier ( µ ) to enforce normalisation and then differentiate � � � − log p ( r ) − 1 − µ � r max δ r ∈ [ 0 , r max ] h ( r ) − µ p ( r ) = δ p ( r ) 0 otherwise 0 ⇒ p ( r ) = const for r ∈ [ 0 , r max ]

  46. Entropy maximisation � � I [ � R | � S ; R ] = H [ R ] − H S � �� � � �� � marginal entropy noise entropy � � � If noise is small and “constant” ⇒ maximise marginal entropy ⇒ maximise H S Consider a (rate coding) neuron with r ∈ [ 0 , r max ] . � r max h ( r ) = − dr p ( r ) log p ( r ) 0 To maximise the marginal entropy, we add a Lagrange multiplier ( µ ) to enforce normalisation and then differentiate � � � − log p ( r ) − 1 − µ � r max δ r ∈ [ 0 , r max ] h ( r ) − µ p ( r ) = δ p ( r ) 0 otherwise 0 ⇒ p ( r ) = const for r ∈ [ 0 , r max ] i.e. � 1 r ∈ [ 0 , r max ] p ( r ) = r max 0 otherwise

  47. Histogram Equalisation Suppose r = ˜ s + η where η represents a (relatively small) source of noise. Consider deterministic encoding ˜ s = f ( s ) . How do we ensure that p ( r ) = 1 / r max ? s ) = p ( s ) 1 ⇒ f ′ ( s ) = r max p ( s ) r max = p ( r ) ≈ p (˜ f ′ ( s ) � s ds ′ p ( s ′ ) ⇒ f ( s ) = r max −∞ 1 0.9 0.8 0.7 0.6 ˜ s 0.5 0.4 0.3 0.2 0.1 0 −3 −2 −1 0 1 2 3 s

  48. Histogram Equalisation Laughlin (1981)

  49. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm.

  50. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution:

  51. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z )

  52. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2

  53. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2 � � = 1 2 log | 2 π Σ | + 1 Σ − 1 Σ 2Tr

  54. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2 � � = 1 2 log | 2 π Σ | + 1 Σ − 1 Σ 2Tr = 1 2 log | 2 π Σ | + 1 ( log e ) 2 d

  55. Gaussian channel A similar idea of output-entropy maximisation appears in the theory of Gaussian channel coding, where it is called the water filling algorithm. We will need the differential entropy of a (multivariate) Gaussian distribution: � � Let − 1 p ( Z ) = | 2 π Σ | − 1 / 2 exp 2 ( Z − µ ) T Σ − 1 ( Z − µ ) , then, � � � − 1 2 log | 2 π Σ | − 1 2 ( Z − µ ) T Σ − 1 ( Z − µ ) h ( Z ) = − d Z p ( Z ) � � Σ − 1 ( Z − µ )( Z − µ ) T � = 1 2 log | 2 π Σ | + 1 d Z p ( Z ) Tr 2 � � = 1 2 log | 2 π Σ | + 1 Σ − 1 Σ 2Tr = 1 2 log | 2 π Σ | + 1 ( log e ) 2 d = 1 2 log | 2 π e Σ |

  56. Gaussian channel – white noise ∼ N ( 0 , k z ) Z � + R S

  57. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) � + R S

  58. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S

  59. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z )

  60. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z .

  61. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ .

  62. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1

  63. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = SZ

  64. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ

  65. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z ))

  66. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z )

  67. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z ) S ; R ] ≤ 1 2 log 2 π e ( P + k z ) − 1 ⇒ I [ � 2 log 2 π e k z

  68. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z ) � � S ; R ] ≤ 1 2 log 2 π e ( P + k z ) − 1 2 log 2 π e k z = 1 1 + P ⇒ I [ � 2 log 2 π e k z

  69. Gaussian channel – white noise I [ � S ; R ] = h ( R ) − h ( R | � ∼ N ( 0 , k z ) Z S ) = h ( R ) − h ( � S + Z | � S ) � + R S = h ( R ) − h ( Z ) � S 2 � � ≤ P S ; R ] = h ( R ) − 1 ⇒ I [ � 2 log 2 π e k z . Without constraint, h ( R ) → ∞ and C R | � S = ∞ . n � Therefore, constrain 1 s 2 ˜ i ≤ P. n i = 1 Then, � S + Z ) 2 � � � � R 2 � S 2 + Z 2 + 2 � ( � � = = ≤ P + k z + 0 SZ ⇒ h ( R ) ≤ h ( N ( 0 , P + k z )) = 1 2 log 2 π e ( P + k z ) � � S ; R ] ≤ 1 2 log 2 π e ( P + k z ) − 1 2 log 2 π e k z = 1 1 + P ⇒ I [ � 2 log 2 π e k z � � S = 1 1 + P C R | � 2 log 2 π e k z ⇒ � The capacity is achieved iff R ∼ N ( 0 , P + k z ) S ∼ N ( 0 , P ) .

  70. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S

  71. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log ,

  72. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P .

  73. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P . Diagonalise K z ⇒ K ˜ s is diagonal in same basis.

  74. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P . Diagonalise K z ⇒ K ˜ s is diagonal in same basis. For stationary noise (wrt dimension indexed by d ) this can be achieved by a Fourier transform ⇒ index diagonal elements by ω .

  75. Gaussian channel – correlated noise Now consider a vector Gaussian channel: = ( Z 1 , . . . , Z d ) ∼ N ( 0 , K z ) Z � + R = ( R 1 , . . . , R d ) S = ( S 1 , . . . , S d ) � T � S � � 1 ≤ P d Tr S Following the same approach as before: � � � � S ; R ] = h ( R ) − h ( Z ) ≤ 1 − 1 ( 2 π e ) d | K ˜ ( 2 π e ) d | K z | I [ � s + K z | 2 log 2 log , ⇒ C R | S achieved when � s + K z | max given 1 S (and thus R ) ∼ N , with | K ˜ d Tr [ K ˜ s ] ≤ P . Diagonalise K z ⇒ K ˜ s is diagonal in same basis. For stationary noise (wrt dimension indexed by d ) this can be achieved by a Fourier transform ⇒ index diagonal elements by ω . � � such that 1 k ∗ s ( ω ) = argmax ( k ˜ s ( ω ) + k z ( ω )) s ( ω ) ≤ P k ˜ ˜ d ω

  76. Water filling Assume that optimum is achieved for max. input power. �� � �� � 1 k ∗ s ( ω ) = argmax log ( k ˜ s ( ω ) + k z ( ω )) − λ k ˜ s ( ω ) − P ˜ d ω ω

Recommend


More recommend