On Data-Processing and Majorization Inequalities for f -Divergences Igal Sason EE Department, Technion - Israel Institute of Technology IZS 2020 Zurich, Switzerland February 26-28, 2020 I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 1 / 20
Introduction f -Divergences f -divergences form a general class of divergence measures which are commonly used in information theory, learning theory and related fields. I. Csisz´ ar, “Eine Informationstheoretische Ungleichung und ihre Anwendung auf den Bewis der Ergodizit¨ at von Markhoffschen Ketten,” Publ. Math. Inst. Hungar. Acad. Sci. , vol. 8, pp. 85–108, Jan. 1963. I. Csisz´ ar, “On topological properties of f -divergences,” Studia Scientiarum Mathematicarum Hungarica , vol. 2, pp. 329–339, Jan. 1967. I. Csisz´ ar, “A class of measures of informativity of observation channels,” Periodica Mathematicarum Hungarica , vol. 2, pp. 191–213, Mar. 1972. S. M. Ali and S. D. Silvey, “A general class of coefficients of divergence of one distribution from another,” Journal of the Royal Statistics Society , series B, vol. 28, no. 1, pp. 131–142, Jan. 1966. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 2 / 20
Introduction This Talk is Restricted to the Discrete Setting f : (0 , ∞ ) �→ R is a convex function with f (1) = 0 ; P, Q are probability mass functions defined on a (finite or countably infinite) set X . f -Divergence: Definition The f -divergence from P to Q is given by � P ( x ) � � D f ( P � Q ) := Q ( x ) f Q ( x ) x ∈X with the convention that f (0) := lim t ↓ 0 f ( t ) , � 0 � a � a f ( u ) � � � 0 f := 0 , 0 f := lim t ↓ 0 tf = a lim u , a > 0 . 0 0 t u →∞ I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 3 / 20
Introduction f -divergences: Examples Relative entropy f ( t ) = t log t, t > 0 = ⇒ D f ( P � Q ) = D ( P � Q ) , f ( t ) = − log t, t > 0 = ⇒ D f ( P � Q ) = D ( Q � P ) . Total variation (TV) distance f ( t ) = | t − 1 | , t ≥ 0 � ⇒ D f ( P � Q ) = | P − Q | := | P ( x ) − Q ( x ) | . x ∈X Chi-Squared Divergence f ( t ) = ( t − 1) 2 , t ≥ 0 � 2 � P ( x ) − Q ( x ) ⇒ D f ( P � Q ) = χ 2 ( P � Q ) := � . Q ( x ) x ∈X I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 4 / 20
Introduction f -divergences: Examples (cont.) E γ divergence (Polyanskiy, Poor and Verd´ u, IEEE T-IT, 2010) For γ ≥ 1 , E γ ( P � Q ) := D f γ ( P � Q ) (1) with f γ ( t ) = ( t − γ ) + , for t > 0 , and ( x ) + := max { x, 0 } . E 1 ( P � Q ) = 1 2 | P − Q | = ⇒ E γ divergence generalizes TV distance. � � E γ ( P � Q ) = max P ( E ) − γ Q ( E ) . E∈ F Other Important f -divergences Triangular Discrimination (Vincze-Le Cam distance ’81; Topsøe 2000); Jensen-Shannon divergence (Lin 1991; Topsøe 2000); DeGroot statistical information (DeGroot ’62; Liese & Vajda ’06); see later. Marton’s divergence (Marton 1996; Samson 2000). I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 5 / 20
Introduction Data-Processing Inequality for f -Divergences Let X and Y be finite or countably infinite sets; P X and Q X be probability mass functions that are supported on X ; W Y | X : X → Y be a stochastic transformation; Output distributions: P Y := P X W Y | X , Q Y := Q X W Y | X ; f : (0 , ∞ ) → R be a convex function with f (1) = 0 . Then, D f ( P Y � Q Y ) ≤ D f ( P X � Q X ) . I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 6 / 20
Introduction Contraction Coefficient for f -Divergences Let Q X be a probability mass function defined on a set X , and which is not a point mass; W Y | X : X → Y be a stochastic transformation. The contraction coefficient for f -divergences is defined as D f ( P Y � Q Y ) µ f ( Q X , W Y | X ) := sup D f ( P X � Q X ) . P X : D f ( P X � Q X ) ∈ (0 , ∞ ) I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 7 / 20
Introduction Strong Data Processing Inequalities (SDPI) If µ f ( Q X , W Y | X ) < 1 , then D f ( P Y � Q Y ) ≤ µ f ( Q X , W Y | X ) D f ( P X � Q X ) . Contraction coefficients for f -divergences play a key role in strong data-processing inequalities: Ahlswede and G´ acs (’76); Cohen et al. (’93); Raginsky (’16); Polyanskiy and Wu (’16, ’17); Makur, Polyanskiy and Wu (’18). I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 8 / 20
New Results: SDPI for f -divergences Theorem 1: SDPI for f -divergences Let P X ( x ) P X ( x ) ξ 1 := inf Q X ( x ) ∈ [0 , 1] , ξ 2 := sup Q X ( x ) ∈ [1 , ∞ ] . x ∈X x ∈X c f := c f ( ξ 1 , ξ 2 ) ≥ 0 and d f := d f ( ξ 1 , ξ 2 ) ≥ 0 satisfy 2 c f ≤ f ′ + ( v ) − f ′ + ( u ) ≤ 2 d f , ∀ u, v ∈ I , u < v v − u where f ′ + is the right-side derivative of f , and I := [ ξ 1 , ξ 2 ] ∩ (0 , ∞ ) . Then, χ 2 ( P X � Q X ) − χ 2 ( P Y � Q Y ) � � d f ≥ D f ( P X � Q X ) − D f ( P Y � Q Y ) χ 2 ( P X � Q X ) − χ 2 ( P Y � Q Y ) � � ≥ c f ≥ 0 . I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 9 / 20
New Results: SDPI for f -divergences Theorem 1: SDPI (Cont.) If f is twice differentiable on I , then the best coefficients are given by c f = 1 t ∈I ( ξ 1 ,ξ 2 ) f ′′ ( t ) , d f = 1 f ′′ ( t ) . inf sup 2 2 t ∈I ( ξ 1 ,ξ 2 ) I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 10 / 20
New Results: SDPI for f -divergences Theorem 1: SDPI (Cont.) If f is twice differentiable on I , then the best coefficients are given by c f = 1 t ∈I ( ξ 1 ,ξ 2 ) f ′′ ( t ) , d f = 1 f ′′ ( t ) . inf sup 2 2 t ∈I ( ξ 1 ,ξ 2 ) This SDPI is Locally Tight Let P ( n ) P ( n ) X ( x ) X ( x ) n →∞ inf lim Q X ( x ) = 1 , n →∞ sup lim Q X ( x ) = 1 . x ∈X x ∈X If f has a continuous second derivative at unity, then D f ( P ( n ) X � Q X ) − D f ( P ( n ) � Q Y ) Y = 1 2 f ′′ (1) . lim χ 2 ( P ( n ) X � Q X ) − χ 2 ( P ( n ) n →∞ � Q Y ) Y I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 10 / 20
New Results: SDPI for f -divergences Advantage: Tensorization of the Chi-Squared Divergence m � � � χ 2 ( P 1 × . . . × P m � Q 1 × . . . × Q m ) = 1 + χ 2 ( P i � Q i ) − 1 . i =1 I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 11 / 20
New Results: SDPI for f -divergences Theorem 2: SDPI for f -divergences Let f : (0 , ∞ ) → R satisfy the conditions: f is a convex function, differentiable at 1, f (1) = 0 , and f (0) := lim t → 0 + f ( t ) < ∞ ; The function g : (0 , ∞ ) → R , defined by g ( t ) := f ( t ) − f (0) for all t t > 0 , is convex. Let f ( t ) + f ′ (1) (1 − t ) κ ( ξ 1 , ξ 2 ) := sup . ( t − 1) 2 t ∈ ( ξ 1 , 1) ∪ (1 ,ξ 2 ) Then, f (0) + f ′ (1) · χ 2 ( P Y � Q Y ) D f ( P Y � Q Y ) κ ( ξ 1 , ξ 2 ) D f ( P X � Q X ) ≤ χ 2 ( P X � Q X ) . I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 12 / 20
New Results: SDPI for f -divergences Numerical Results The tightness of the bounds (SDPI inequalities) in Theorems 1 and 2 was exemplified numerically for transmission over a BEC and BSC. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 13 / 20
Application: List Decoding Error Bounds List Decoding Decision rule outputs a list of choices. The extension of Fano’s inequality to list decoding, expressed in terms of H ( X | Y ) , was initiated by Ahlswede, Gacs and K¨ orner (’66). Useful to prove converse results (jointly with the blowing-up lemma). Generalized Fano’s Inequality for Fixed List Size � � P L � 1 − L H ( X | Y ) ≤ log M − d M where d ( ·�· ) denotes the binary relative entropy: � x � � 1 − x � d ( x � y ) := x log + (1 − x ) log , x, y ∈ (0 , 1) . y 1 − y I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 14 / 20
List Decoding Error Bounds Theorem 3: Tightened Bound by Strong DPI (SDPI) Let P XY be a probability measure defined on X × Y with |X| = M . � X � X � � Consider a decision rule L : Y → , where stands for the set of L L subsets of X with cardinality L , and L < M is fixed. � � Denote the list decoding error probability by P L := P X / ∈ L ( Y ) . If the L most probable elements from X are selected, given Y ∈ Y , then − 1 − P L � � � � · E P X | Y ( X | Y ) P L � 1 − L − log e L H ( X | Y ) ≤ log M − d . M 2 sup P X | Y ( x | y ) ( x,y ) ∈X×Y Proof: Use Theorem 1 (our first SDPI) with f ( t ) = t log t, t > 0 , P X | Y = y , and Q X | Y = y be equiprobable over { 1 , . . . , M } , W Z | X,Y = y be 1 or 0 if X ∈ L ( y ) or X / ∈ L ( y ) , and average over Y . Numerical experimentation exemplifies this improvement. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 15 / 20
List Decoding Error Bounds Generalized Fano’s Inequality for Variable List Size (1975) Let P XY be a probability measure defined on X × Y with |X| = M ; Consider a decision rule L : Y → 2 X ; Let the (average) list decoding error probability be given by � � P L := P X / ∈ L ( Y ) with |L ( y ) | ≥ 1 for all y ∈ Y . Then, H ( X | Y ) ≤ h ( P L ) + E [log |L ( Y ) | ] + P L log M. I. Sason IZS 2020, Zurich, Switzerland Feb. 26-28, 2020 16 / 20
Recommend
More recommend