information learning and falsification
play

Information, Learning and Falsification David Balduzzi December 17, - PowerPoint PPT Presentation

Effective information Algorithmic information Learning theory Falsification Conclusion Information, Learning and Falsification David Balduzzi December 17, 2011 Max Planck Institute for Intelligent Systems T ubingen, Germany Effective


  1. Effective information Algorithmic information Learning theory Falsification Conclusion Information, Learning and Falsification David Balduzzi December 17, 2011 Max Planck Institute for Intelligent Systems T¨ ubingen, Germany

  2. Effective information Algorithmic information Learning theory Falsification Conclusion Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description.

  3. Effective information Algorithmic information Learning theory Falsification Conclusion Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble.

  4. Effective information Algorithmic information Learning theory Falsification Conclusion Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. “Prediction”. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm.

  5. Effective information Algorithmic information Learning theory Falsification Conclusion Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. “Prediction”. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm. Can these be related?

  6. Effective information Algorithmic information Learning theory Falsification Conclusion Three main theories of information: Algorithmic information. “Description”. The information embedded in a single string depends on its shortest description. Shannon information. “Transmission”. The information transmitted by symbols depends on the transmission probabilities of other symbols in an ensemble. Statistical learning theory. “Prediction”. The information about the world embedded in a classifier (its expected error) depends on the complexity of the learning algorithm. Can these be related? Effective information. “Discrimination.” The information produced by a physical process when it produces an output depends on how sharply it discriminates between inputs.

  7. Effective information Algorithmic information Learning theory Falsification Conclusion Effective information

  8. Effective information Algorithmic information Learning theory Falsification Conclusion Nature decomposes into specific, bounded physical systems which we model as deterministic functions f : X → Y or more generally as Markov matrices p m ( y | x ), where X and Y are finite sets.

  9. Effective information Algorithmic information Learning theory Falsification Conclusion Physical processes discriminate between inputs thermometer

  10. Effective information Algorithmic information Learning theory Falsification Conclusion Definition The discrimination given by Markov matrix m outputting y is � � · p unif ( x ) � � ˆ p m x | y := p m y | do ( x ) p m ( y ) , � � where p m ( y ) := � x p m y | do ( x ) · p unif ( x ) is the effective distribution . Definition Effective information is the Kullback-Leibler divergence � � � � � � ei ( m , y ) := H ˆ p m X | y � p unif ( X ) Balduzzi and Tononi, PLoS Computational Biology , 2008

  11. Effective information Algorithmic information Learning theory Falsification Conclusion Special case: deterministic f : X → Y Definition The discrimination given by f outputting y assigns equal probability to all elements of pre-image f − 1 ( y ) . Definition Effective information is ei ( f , y ) := − log | f − 1 ( y ) | | X |

  12. Effective information Algorithmic information Learning theory Falsification Conclusion discrimination thermometer input when thermometer outputs is [ [ size ei = -log size

  13. Effective information Algorithmic information Learning theory Falsification Conclusion Algorithmic information

  14. Effective information Algorithmic information Learning theory Falsification Conclusion Definition Given universal prefix Turing machine T, the Kolmogorov complexity of string s is K T ( s ) := { i : T ( i )= s •} len ( i ) min the length of the shortest program that generates s. For any Turing machine U � = T , there exists a constant c such that K U ( s ) − c ≤ K T ( s ) ≤ K U ( s ) + c for all s .

  15. Effective information Algorithmic information Learning theory Falsification Conclusion Definition Given T, the (unnormalized) Solomonoff prior probability of string s is � 2 − len ( i ) , p T ( s ) := { i | T ( i )= s •} where the sum is over strings i that cause T to output s as a prefix, and no proper prefix of i outputs s. The Turing machine discriminates between programs according to which strings they output; Solomonoff prior counts programs are in each class (weighted by length).

  16. Effective information Algorithmic information Learning theory Falsification Conclusion Kolmogorov complexity = Algorithmic probability Theorem (Levin) For all s − log P T ( s ) = K T ( s ) . up to an additive constant c. Upshot: for my purposes, Solomonoff’s formulation of Kolmogorov complexity is the right one K T ( s ) := − log p T ( s ) .

  17. Effective information Algorithmic information Learning theory Falsification Conclusion Recall, the effective distribution was the denominator when computing discriminations using Bayes’ rule: � � · p unif ( x ) � � p m ˆ x | y := p m y | do ( x ) p m ( y ) .

  18. Effective information Algorithmic information Learning theory Falsification Conclusion Solomonoff prior → Effective distribution Proposition The effective distribution on Y induced by f is � 2 − len ( x ) p f ( y ) = { x : f ( x )= y } Compare with Solomonoff distribution: � 2 − len( i ) p T ( s ) := { i | T ( i )= s •} Compute effective distribution by replacing universal Turing machine T with f : X → Y ; and giving inputs len( x ) = log | X | in the optimal code for the uniform distribution on X .

  19. Effective information Algorithmic information Learning theory Falsification Conclusion Kolmogorov Complexity → Effective information Proposition For function f : X → Y , effective information equals   � 2 − len ( x ) ei ( f , y ) = − log p f ( y ) = − log   { x : f ( x )= y } Compare with Kolmogorov complexity:   � 2 − len( i ) K T ( s ) = − log p T ( s ) = − log   { i | T ( i )= s •}

  20. Effective information Algorithmic information Learning theory Falsification Conclusion Statistical learning theory

  21. Effective information Algorithmic information Learning theory Falsification Conclusion Hypothesis space Given unlabeled data D = ( x 1 , . . . , x l ) ⊂ X l , let hypothesis space � � Σ D = σ : D → ± 1 be the set of all possible labelings. +1 -1 -1 +1 -1 +1 HYPOTHESIS SPACE

  22. Effective information Algorithmic information Learning theory Falsification Conclusion Setup Suppose data D = ( x 1 , . . . , x l ) is drawn from unknown probability distribution P X and labeled y i = σ ( x i ) by an unknown supervisor σ ∈ Σ X . The learning problem: Find a classifier ˆ f guaranteed to perform well on future (unseen) data sampled via P X and labeled by σ .

  23. Effective information Algorithmic information Learning theory Falsification Conclusion Empirical risk minimization Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ ∈ Σ D , find classifier ˆ f ∈ F ⊂ Σ D , that minimizes empirical risk: l 1 � ˆ f := arg min I f ( x i ) � = σ ( x i ) l f ∈F i =1

  24. Effective information Algorithmic information Learning theory Falsification Conclusion Empirical risk minimization Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ ∈ Σ D , find classifier ˆ f ∈ F ⊂ Σ D , that minimizes empirical risk: l 1 � ˆ f := arg min I f ( x i ) � = σ ( x i ) l f ∈F i =1 Key step. Reformulate algorithm as function between finite sets:

  25. Effective information Algorithmic information Learning theory Falsification Conclusion Empirical risk minimization Suppose we are given a class F of functions to work with. A simple algorithm for tackling the learning problem is: Algorithm: Given data labeled by σ ∈ Σ D , find classifier ˆ f ∈ F ⊂ Σ D , that minimizes empirical risk: l 1 � ˆ f := arg min I f ( x i ) � = σ ( x i ) l f ∈F i =1 Key step. Reformulate algorithm as function between finite sets: Empirical risk minimization : R F , D : HYPOTHESIS SPACE − → EMPIRICAL RISK Σ D − → R � l min f ∈F 1 σ �→ i =1 I f ( x i ) � = σ ( x i ) l

Recommend


More recommend