An Alphabet-Size Bound for the Information Bottleneck Function ISIT 2020 Christoph Hirche , Andreas Winter
What for? DNNs video processing clustering C. Hirche – IBM bounds 2/16
Sufficient Statistics Sufficient statistics are maps or partitions of X , S ( X ) , that capture all the information that X has on Y . Namely, I ( S ( X ); Y ) = I ( X ; Y ) . C. Hirche – IBM bounds 3/16
Sufficient Statistics Sufficient statistics are maps or partitions of X , S ( X ) , that capture all the information that X has on Y . Namely, I ( S ( X ); Y ) = I ( X ; Y ) . Minimal sufficient statistics, T(X), are the simplest sufficient statistics. T ( X ) = arg min I ( S ( X ); X ) . S ( X ): I ( S ( X ); Y )= I ( X ; Y ) C. Hirche – IBM bounds 3/16
Sufficient Statistics Sufficient statistics are maps or partitions of X , S ( X ) , that capture all the information that X has on Y . Namely, I ( S ( X ); Y ) = I ( X ; Y ) . Minimal sufficient statistics, T(X), are the simplest sufficient statistics. T ( X ) = arg min I ( S ( X ); X ) . S ( X ): I ( S ( X ); Y )= I ( X ; Y ) Approximate minimal sufficient statistics ⇔ Information Bottleneck S ( X ): I ( S ( X ); Y ) ≥ a I ( S ( X ); X ) min C. Hirche – IBM bounds 3/16
Application in ML C. Hirche – IBM bounds 4/16
IB optimality? From Schwartz-Ziv, Tishby : The DNN layers converge to fixed-points of the IB equations. C. Hirche – IBM bounds 5/16
Dimension Bounds Generally known: | W | ≤ | X | + 1 . C. Hirche – IBM bounds 6/16
Dimension Bounds Generally known: | W | ≤ | X | + 1 . But can we get bounds in terms of | Y | ? C. Hirche – IBM bounds 6/16
Dimension Bounds Generally known: | W | ≤ | X | + 1 . But can we get bounds in terms of | Y | ? Maybe approximate? I XY ( R , N ) ≤ I XY ( R ) ≤ I XY ( R , N ) + δ ( ǫ, | Y | ) for some δ ( ǫ, | Y | ) and | W | ≤ N ( ǫ, | Y | ) . C. Hirche – IBM bounds 6/16
Recoverability Lemma Given a joint distribution P XY of two random variables X and Y, and assuming that there exist N probability distributions Q 1 , . . . , Q N on Y , and a function f : X − → [ N ] with the property that 1 ∀ x 2 � P Y | X = x − Q f ( x ) � 1 ≤ ǫ, for some ǫ > 0 . Then there exists a recovery channel S : [ N ] − → X such that the Markov chain Y − X − X ′ − � X defined by X ′ = f ( X ) and P � X | X ′ = S satisfies XY � 1 ≤ ǫ ′ = 2 ǫ . X and 1 P X = P � 2 � P XY − P � C. Hirche – IBM bounds 7/16
Bounds on N ? How large does N need to be? C. Hirche – IBM bounds 8/16
Bounds on N ? How large does N need to be? Easy: N ≤ | X | , but that’s still too big. C. Hirche – IBM bounds 8/16
Bounds on N ? How large does N need to be? Easy: N ≤ | X | , but that’s still too big. In the worst case, we need to choose an ǫ -net of the probability simplex P ( Y ) of all probability distributions on Y with respect to the total variational distance, which results in � 2 � | Y | N ≤ . ǫ Generally, one can do much better (e.g. for deterministic data sets). C. Hirche – IBM bounds 8/16
IBM Bound Lemma Let Y − X − � X be a Markov chain. Then the IB function of P XY dominates the IB function of P � XY pointwise: I XY ( R ) ≥ I � XY ( R ) ∀ R . C. Hirche – IBM bounds 9/16
Alphabet-Size bounds Corollary Under the assumptions of our main lemma, I X ′ Y ( R ) ≤ I XY ( R ) ≤ I X ′ Y ( R ) + δ ( ǫ, | Y | ) , � � where δ ( ǫ, | Y | ) := ǫ ′ log | Y | + ( 1 + ǫ ′ ) h ǫ ′ . 1 + ǫ ′ Corollary Under the assumptions of our main lemma, I XY ( R , N ) ≤ I XY ( R ) ≤ I XY ( R , N ) + δ ( ǫ, | Y | ) , � � � 2 � | Y | . where δ ( ǫ, | Y | ) := ǫ ′ log | Y | + ( 1 + ǫ ′ ) h ǫ ′ and N ≤ 1 + ǫ ′ ǫ C. Hirche – IBM bounds 10/16
Quantum IB C. Hirche – IBM bounds 11/16
QIB For a quantum state ρ XY , we define R q ( a ) = inf I ( YR ; W ) σ N X → W I ( Y ; W ) σ ≥ a with, σ WYR := ( N X → W ⊗ id YR )Ψ XYR and Ψ XYR a purification of ρ XY . C. Hirche – IBM bounds 12/16
QIB Lemma For X and Y quantum, and W classical, an optimal solution for the quantum information bottleneck can be achieved with | W | ≤ | Y | 2 | R | 2 + 1 . Lemma For Y quantum, but X and W classical, an optimal solution for the quantum information bottleneck can be achieved with | W | ≤ | X | + 1 . C. Hirche – IBM bounds 13/16
QIB Lemma Given a classical-quantum state � p ( x ) | x �� x | ⊗ ρ x ρ XY = Y , (1) x and assume that there exist N quantum states σ 1 Y , . . . , σ N Y and a function f : X − → [ N ] with the property that 1 Y − σ f ( x ) 2 � ρ x ∀ x � 1 ≤ ǫ, (2) Y for given ǫ > 0 . Then there exists a recovery channel S : [ N ] − → X such that the Markov chain Y − X − X ′ − � X defined by X ′ = f ( X ) and P � X | X ′ = S satisfies XY � 1 ≤ ǫ ′ = 2 ǫ . X and 1 P X = P � 2 � ρ XY − ρ � C. Hirche – IBM bounds 14/16
QIB For Y quantum, but X and W classical: Corollary Under the assumptions of the previous lemma, I cq X ′ Y ( R ) ≤ I cq XY ( R ) ≤ I cq X ′ Y ( R ) + δ ( ǫ, | Y | ) , � � where δ ( ǫ, | Y | ) := ǫ ′ log | Y | + ( 1 + ǫ ′ ) h ǫ ′ . 1 + ǫ ′ Corollary Under the assumptions of the previous lemma, I cq XY ( R , N ) ≤ I cq XY ( R ) ≤ I cq XY ( R , N ) + δ ( ǫ, | Y | ) , � 3 � 2 | Y | 2 where δ ( ǫ, | Y | ) is as before and N ≤ . ǫ C. Hirche – IBM bounds 15/16
The End Summary: New approach to alphabet-size bounds via recoverability. New bounds on approximating the IB with alphabet-size limited by | Y | (instead of | X | ). Open Questions: Other applications to recoverability approach. Fully quantum case. (Stay tuned for more on this soon 1 .) Thanks!! 1 M. Christandl, CH, AW, in preparation , 2020 C. Hirche – IBM bounds 16/16
Recommend
More recommend