Vacuous Mutual Information & Mis-Estimation Proposition (Informal) Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid) ⇒ I ( X ; T ℓ ) is independent of the DNN parameters = I ( X ; T ℓ ) a.s. infinite (continuous X ) or constant H ( X ) (discrete X ) Past Works: Use binning-based proxy of I ( X ; T ℓ ) (aka quantization) � � For non-negligible bin size I X ; Bin ( T ℓ ) � = I ( X ; T ℓ ) 1 � � I X ; Bin ( T ℓ ) highly sensitive to user-defined bin size: 2 bin size = 0.0001 bin size = 0.001 bin size = 0.01 bin size = 0.1 8 Layer 1 MI (nats) Layer 2 4 Layer 3 Layer 4 Layer 5 0 10 0 10 1 10 2 10 3 10 4 Epoch 5/11
Vacuous Mutual Information & Mis-Estimation Proposition (Informal) Det. DNNs with strictly monotone nonlinearities (e.g., tanh or sigmoid) ⇒ I ( X ; T ℓ ) is independent of the DNN parameters = I ( X ; T ℓ ) a.s. infinite (continuous X ) or constant H ( X ) (discrete X ) Past Works: Use binning-based proxy of I ( X ; T ℓ ) (aka quantization) � � For non-negligible bin size I X ; Bin ( T ℓ ) � = I ( X ; T ℓ ) 1 � � I X ; Bin ( T ℓ ) highly sensitive to user-defined bin size: 2 bin size = 0.0001 bin size = 0.001 bin size = 0.01 bin size = 0.1 8 Layer 1 MI (nats) Layer 2 4 Layer 3 Layer 4 Layer 5 0 10 0 10 1 10 2 10 3 10 4 Epoch ⊛ ⊛ ⊛ Real Problem: Mismatch between I ( X ; T ℓ ) measurement and model 5/11
Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output 6/11
Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 6/11
Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 = ⇒ X �→ T ℓ is a parametrized channel (by DNN’s parameters) 6/11
Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 = ⇒ X �→ T ℓ is a parametrized channel (by DNN’s parameters) = ⇒ I ( X ; T ℓ ) is a function of parameters! 6/11
Auxiliary Framework - Noisy Deep Neural Networks Modification: Inject (small) Gaussian noise to neurons’ output Formally: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) T 2 · · · f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 X Z 1 Z 1 Z 2 Z 2 = ⇒ X �→ T ℓ is a parametrized channel (by DNN’s parameters) = ⇒ I ( X ; T ℓ ) is a function of parameters! ⊛ ⊛ Challenge: How to accurately track I ( X ; T ℓ ) ? ⊛ 6/11
High-Dim. & Nonparametric Functional Estimation 7/11
High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (non- parametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). 7/11
High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (non- parametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd 7/11
High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (non- parametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd n Structured Estimator ⋆ : ˆ h ( S n , σ ) � h ( ˆ P n ∗ N σ ) , where ˆ P n = 1 � δ S i n i =1 ⋆ Efficient and parallelizable 7/11
High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (non- parametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd n Structured Estimator ⋆ : ˆ h ( S n , σ ) � h ( ˆ P n ∗ N σ ) , where ˆ P n = 1 � δ S i n i =1 Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For F ( SG ) � P is K -subgaussian in R d � , d ≥ 1 and σ > 0 , we have d,K � � P � σ,K · n − 1 � � � h ( P ∗ N σ ) − ˆ h ( S n , σ ) � ≤ c d sup P ∈F ( SG ) d,K E S n � � 2 7/11
High-Dim. & Nonparametric Functional Estimation Distill I ( X ; T ℓ ) Estimation into Noisy Differential Entropy Estimation: Estimate h ( P ∗ N σ ) from n i.i.d. samples S n � ( S i ) n i =1 of P ∈ F d (non- parametric class) and knowledge of N σ (Gaussian measure N (0 , σ 2 I d ) ). Theorem (ZG-Greenewald-Polyanskiy-Weed’19) � � 2 d Sample complexity of any accurate estimator (additive gap η ) is Ω ηd n Structured Estimator ⋆ : ˆ h ( S n , σ ) � h ( ˆ P n ∗ N σ ) , where ˆ P n = 1 � δ S i n i =1 Theorem (ZG-Greenewald-Polyanskiy-Weed’19) For F ( SG ) � P is K -subgaussian in R d � , d ≥ 1 and σ > 0 , we have d,K � � P � σ,K · n − 1 � � � h ( P ∗ N σ ) − ˆ h ( S n , σ ) � ≤ c d sup P ∈F ( SG ) d,K E S n � � 2 Optimality: ˆ h ( S n , σ ) attains sharp dependence on both n and d ! 7/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Z ∼ N (0 , σ 2 ) 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 3 3 3 3 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 3 3 3 3 ⊛ ⊛ ⊛ Center & sharpen transition ( ⇐ ⇒ increase w and keep b = − 2 w ) 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 S 5 , − 10 3 3 3 3 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) S 1 , 0 S 5 , − 10 3 3 3 3 ✓ Correct classification performance 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) � 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols � − S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) → {± 1 } 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols � − S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) → {± 1 } 8/11
I ( X ; T ℓ ) Dynamics - Illustrative Minimal Example Single Neuron Classification: tanh( wX + b ) S w,b X T Input: X ∼ Unif {± 1 , ± 3 } X y = − 1 � {− 3 , − 1 , 1 } , X y =1 � { 3 } Z ∼ N (0 , σ 2 ) Mutual Information: I ( X ; T ) = I ( S w,b ; S w,b + Z ) = ⇒ I ( X ; T ) is # bits (nats) transmittable over AWGN with symbols � − S w,b � � tanh( − 3 w + b ) , tanh( − w + b ) , tanh( w + b ) , tanh(3 w + b ) → {± 1 } 1.5 ln(4) Mutual information ln(3) 1 ln(2) 0.5 0 10 0 10 2 10 4 10 6 Epoch 8/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: 9/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 9/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 9/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments 9/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments = ⇒ Compression of I ( X ; T ℓ ) driven by clustering of representations 9/11
Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering 10/11
Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � 10/11
Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H 10/11
Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) ⋆ When bin size chosen ∝ noise std. 10/11
Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) = ⇒ Past works not measuring MI but clustering (via binned-MI)! 10/11
Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) = ⇒ Past works not measuring MI but clustering (via binned-MI)! By-Product Result: 10/11
Circling Back to Deterministic DNNs ⇒ I ( X ; T ℓ ) is constant/infinite = Doesn’t measure clustering � = H Reexamine Measurements: Computed I � X ; Bin ( T ℓ ) � Bin ( T ℓ ) � � measures clustering (maximized by uniform distribution) � Bin ( T ℓ ) H � highly correlated in noisy DNNs ⋆ Test: I ( X ; T ℓ ) and H � Bin ( T ℓ ) = ⇒ Past works not measuring MI but clustering (via binned-MI)! By-Product Result: Refute ‘compression (tight clustering) improves generalization’ claim [Come see us at poster #96 for details] 10/11
Summary Reexamined Information Bottleneck Compression: 11/11
Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible 11/11
Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training 11/11
Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs 11/11
Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation 11/11
Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression 11/11
Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression Clarify Past Observations of Compression: in fact show clustering 11/11
Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression Clarify Past Observations of Compression: in fact show clustering ◮ Compression/clustering and generalization and not necessarily related 11/11
Summary Reexamined Information Bottleneck Compression: ◮ I ( X ; T ) fluctuations in det. DNNs are theoretically impossible ◮ Yet, past works presented (binned) I ( X ; T ) dynamics during training Noisy DNN Framework: Studying IT quantities over DNNs ◮ Optimal estimator (in n and d ) for accurate MI estimation ◮ Clustering of learned representations is the source of compression Clarify Past Observations of Compression: in fact show clustering ◮ Compression/clustering and generalization and not necessarily related Thank you! 11/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: 11/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 11/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 11/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP 11/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP ⊛ ⊛ ⊛ weight orthonormality regularization [Cisse et al. ’17] 11/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments 11/11
Clustering of Representations - Larger Networks Noisy version of DNN from [Shwartz-Tishby’17]: Binary Classification: 12-bit input & 12– 10 – 7 – 5 – 4 – 3 –2 tanh MLP Verified in multiple additional experiments = ⇒ Compression of I ( X ; T ℓ ) driven by clustering of representations 11/11
Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 11/11
Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − 11/11
Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ 11/11
Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ 11/11
Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ Structure: S ℓ ⊥ Z ℓ = ⇒ 11/11
Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ 11/11
Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ 11/11
Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ ⊛ ⊛ Extremely complicated P = ⇒ Treat as unknown 11/11
Mutual Information Estimation in Noisy DNNs Noisy DNN: T ℓ = S ℓ + Z ℓ , where S ℓ � f ℓ ( T ℓ − 1 ) and Z ℓ ∼ N (0 , σ 2 I d ) f 1 f 1 S 1 S 1 T 1 T 1 f 2 f 2 S 2 S 2 T 2 · · · X Z 1 Z 1 Z 2 Z 2 � d P X ( x ) h ( T ℓ | X = x ) Mutual Information: I ( X ; T ℓ ) = h ( T ℓ ) − Structure: S ℓ ⊥ Z ℓ = ⇒ T ℓ = S ℓ + Z ℓ ∼ P ∗ N σ ⊛ ⊛ Know the distribution N σ of Z ℓ (noise injected by design) ⊛ ⊛ ⊛ ⊛ Extremely complicated P = ⇒ Treat as unknown ⊛ ⊛ ⊛ Easily get i.i.d. samples from P via DNN forward pass 11/11
Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). 11/11
Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) 11/11
Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) Goal: Simple & parallelizable for efficient implementation 11/11
Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) Goal: Simple & parallelizable for efficient implementation n Estimator: ˆ h ( S n , σ ) � h ( ˆ P S n ∗ N σ ) , where ˆ P S n � 1 � δ S i n i =1 11/11
Structured Estimator (with Implementation in Mind) Differential Entropy Estimation under Gaussian Convolutions Estimate h ( P ∗ N σ ) via n i.i.d. samples S n � ( S i ) n i =1 from unknown P ∈ F d (nonparametric class) and knowledge of N σ (noise distribution). Nonparametric Class: Specified by DNN architecture ( d = T ℓ ‘width’) Goal: Simple & parallelizable for efficient implementation n Estimator: ˆ h ( S n , σ ) � h ( ˆ P S n ∗ N σ ) , where ˆ P S n � 1 � δ S i n i =1 Plug-in: ˆ h is plug-in est. for the functional T σ ( P ) � h ( P ∗ N σ ) 11/11
Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . 11/11
Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: 11/11
Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations 11/11
Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � ≤ C σ,d,K · n − 1 � � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations � n − 1 2 � Minimax Rate Optimal: Attains parametric estimation rate O 11/11
Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � � ≤ C σ,d,K · n − 1 � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations � n − 1 2 � Minimax Rate Optimal: Attains parametric estimation rate O Proof (initial step): Based on [Polyanskiy-Wu’16] � � � h ( P ∗ N σ ) − h ( ˆ � � W 1 ( P ∗ N σ , ˆ P S n ∗ N σ ) P S n ∗ N σ ) � � 11/11
Structured Estimator - Convergence Rate Theorem (ZG-Greenewald-Weed-Polyanskiy’19) For any σ > 0 , d ≥ 1 , we have � � � ≤ C σ,d,K · n − 1 � h ( P ∗ N σ ) − h ( ˆ P S n ∗ N σ ) sup E � � 2 P ∈F ( SG ) d,K where C σ,d,K = O σ,K ( c d ) for a constant c . Comments: Explicit Expression: Enables concrete error bounds in simulations � n − 1 2 � Minimax Rate Optimal: Attains parametric estimation rate O Proof (initial step): Based on [Polyanskiy-Wu’16] � � � h ( P ∗ N σ ) − h ( ˆ � � W 1 ( P ∗ N σ , ˆ P S n ∗ N σ ) P S n ∗ N σ ) � � = ⇒ Analyze empirical 1-Wasserstein distance under Gaussian convolutions 11/11
Empirical W 1 & The Magic of Gaussian Convolution p -Wasserstein Distance: For two distributions P and Q on R d and p ≥ 1 � E � X − Y � p � 1 /p W p ( P, Q ) � inf infimum over all couplings of P and Q 11/11
Empirical W 1 & The Magic of Gaussian Convolution p -Wasserstein Distance: For two distributions P and Q on R d and p ≥ 1 � E � X − Y � p � 1 /p W p ( P, Q ) � inf infimum over all couplings of P and Q Empirical 1-Wasserstein Distance: 11/11
Recommend
More recommend