Comparison of Local and Global Contraction Coefficients for KL Divergence Anuran Makur and Lizhong Zheng EECS Department, Massachusetts Institute of Technology 5 November 2015 A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 1 / 32
Outline Introduction to Contraction Coefficients 1 Measuring Ergodicity Contraction Coefficients of Strong Data Processing Inequalities Motivation from Inference 2 Contraction Coefficients for KL and χ 2 -Divergences 3 Bounds between Contraction Coefficients 4 A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 2 / 32
Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32
Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32
Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π aperiodic ⇒ W k → π 1 T (rank 1 matrix) A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32
Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π aperiodic ⇒ W k → π 1 T (rank 1 matrix) Rate of convergence? A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32
Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π aperiodic ⇒ W k → π 1 T (rank 1 matrix) Rate of convergence? Perron-Frobenius: 1 = λ 1 ( W ) > | λ 2 ( W ) | ≥ · · · ≥ | λ n ( W ) | A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32
Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π aperiodic ⇒ W k → π 1 T (rank 1 matrix) Rate of convergence? Perron-Frobenius: 1 = λ 1 ( W ) > | λ 2 ( W ) | ≥ · · · ≥ | λ n ( W ) | Rate of convergence determined by | λ 2 ( W ) | ← − coefficient of ergodicity A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32
Measuring Ergodicity Consider an ergodic Markov chain with n × n column stochastic transition matrix W . irreducible ⇒ unique stationary distribution π : W π = π aperiodic ⇒ W k → π 1 T (rank 1 matrix) Rate of convergence? Perron-Frobenius: 1 = λ 1 ( W ) > | λ 2 ( W ) | ≥ · · · ≥ | λ n ( W ) | Rate of convergence determined by | λ 2 ( W ) | ← − coefficient of ergodicity Want: A guarantee on the relative improvement i.e. for any distribution p , W k +1 p is “closer” to π than W k p . A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 3 / 32
Measuring Ergodicity Let d : P × P → [0 , ∞ ] be a divergence measure on the simplex P . Want: ∀ p ∈ P , d ( Wp , W π ) ≤ η d ( π, W ) d ( p , π ) ���� = π for some contraction coefficient η d ( π, W ) ∈ [0 , 1]. A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 4 / 32
Measuring Ergodicity Let d : P × P → [0 , ∞ ] be a divergence measure on the simplex P . Want: ∀ p ∈ P , d ( Wp , W π ) ≤ η d ( π, W ) d ( p , π ) ���� = π for some contraction coefficient η d ( π, W ) ∈ [0 , 1]. This would mean that: d ( W k p , π ) ≤ η d ( π, W ) k d ( p , π ) . ∀ p ∈ P , A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 4 / 32
Measuring Ergodicity Let d : P × P → [0 , ∞ ] be a divergence measure on the simplex P . Want: ∀ p ∈ P , d ( Wp , W π ) ≤ η d ( π, W ) d ( p , π ) ���� = π for some contraction coefficient η d ( π, W ) ∈ [0 , 1]. This would mean that: d ( W k p , π ) ≤ η d ( π, W ) k d ( p , π ) . ∀ p ∈ P , d η d ( π, W ) < 1 ⇒ W k p − → π geometrically fast with rate η d ( π, W ). A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 4 / 32
Measuring Ergodicity Let d : P × P → [0 , ∞ ] be a divergence measure on the simplex P . Want: ∀ p ∈ P , d ( Wp , W π ) ≤ η d ( π, W ) d ( p , π ) ���� = π for some contraction coefficient η d ( π, W ) ∈ [0 , 1]. This would mean that: d ( W k p , π ) ≤ η d ( π, W ) k d ( p , π ) . ∀ p ∈ P , d η d ( π, W ) < 1 ⇒ W k p − → π geometrically fast with rate η d ( π, W ). So, η d ( π, W ) is a coefficient of ergodicity, and we define it as: d ( Wp , W π ) η d ( π, W ) � sup . d ( p , π ) p : p � = π A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 4 / 32
Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32
Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? Does the ℓ 2 -norm work? A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32
Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? Does the ℓ 2 -norm work? � W π − Wp � 2 = � W ( π − p ) � 2 ≤ � W � 2 � π − p � 2 where the spectral norm � W � 2 is the largest singular value of W . A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32
Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? Does the ℓ 2 -norm work? � W π − Wp � 2 = � W ( π − p ) � 2 ≤ � W � 2 � π − p � 2 where the spectral norm � W � 2 is the largest singular value of W . � W � 2 > 1 is possible ... � A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32
Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? Does the ℓ 2 -norm work? � W π − Wp � 2 = � W ( π − p ) � 2 ≤ � W � 2 � π − p � 2 where the spectral norm � W � 2 is the largest singular value of W . � W � 2 > 1 is possible ... � Dobrushin-Doeblin Coefficient of Ergodicity: The ℓ 1 -norm (total variation distance) works! � A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32
Measuring Ergodicity Can we define notions of distance between distributions which make W a contraction? Does the ℓ 2 -norm work? � W π − Wp � 2 = � W ( π − p ) � 2 ≤ � W � 2 � π − p � 2 where the spectral norm � W � 2 is the largest singular value of W . � W � 2 > 1 is possible ... � Dobrushin-Doeblin Coefficient of Ergodicity: The ℓ 1 -norm (total variation distance) works! � � W π − Wp � 1 = � W ( π − p ) � 1 ≤ η TV ( π, W ) � π − p � 1 � W π − Wp � 1 where η TV ( π, W ) � sup p : p � = π ∈ [0 , 1] is the Dobrushin-Doeblin � π − p � 1 contraction coefficient. A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 5 / 32
Csisz´ ar f -Divergence Definition (Csisz´ ar f -Divergence) Given distributions R X and P X on X , we define their f -divergence as: � R X ( x ) � � D f ( R X || P X ) � P X ( x ) f P X ( x ) x ∈X where f : R + → R is convex and f (1) = 0. A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 6 / 32
Csisz´ ar f -Divergence Definition (Csisz´ ar f -Divergence) Given distributions R X and P X on X , we define their f -divergence as: � R X ( x ) � � D f ( R X || P X ) � P X ( x ) f P X ( x ) x ∈X where f : R + → R is convex and f (1) = 0. Non-negativity: D f ( R X || P X ) ≥ 0 with equality iff R X = P X . A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 6 / 32
Csisz´ ar f -Divergence Definition (Csisz´ ar f -Divergence) Given distributions R X and P X on X , we define their f -divergence as: � R X ( x ) � � D f ( R X || P X ) � P X ( x ) f P X ( x ) x ∈X where f : R + → R is convex and f (1) = 0. Non-negativity: D f ( R X || P X ) ≥ 0 with equality iff R X = P X . Data Processing Inequality: For a fixed channel P Y | X : ∀ R X , P X , D f ( R Y || P Y ) ≤ D f ( R X || P X ) where R Y and P Y are output pmfs corresponding to R X and P X . A. Makur & L. Zheng (MIT) Local and Global Contraction Coefficients 5 November 2015 6 / 32
Recommend
More recommend