learning additive noise channels generalization bounds
play

Learning Additive Noise Channels: Generalization Bounds and - PowerPoint PPT Presentation

Learning Additive Noise Channels: Generalization Bounds and Algorithms Nir Weinberger Massachusetts Institute of Technology, MA, USA IEEE International Symposium on Information Theory June 2020 1/22 In an nutshell An additive noise channel


  1. Learning Additive Noise Channels: Generalization Bounds and Algorithms Nir Weinberger Massachusetts Institute of Technology, MA, USA IEEE International Symposium on Information Theory June 2020 1/22

  2. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise 2/22

  3. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . 2/22

  4. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . 2/22

  5. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ 2/22

  6. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for 2/22

  7. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 2/22

  8. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 Applies to empirical risk minimization (ERM). 2/22

  9. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 Applies to empirical risk minimization (ERM). learning under a surrogate error probability loss. 2 2/22

  10. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 Applies to empirical risk minimization (ERM). learning under a surrogate error probability loss. 2 New alternating optimization algorithm. 2/22

  11. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 Applies to empirical risk minimization (ERM). learning under a surrogate error probability loss. 2 New alternating optimization algorithm. a “codeword-expurgating” Gibbs learning algorithm. 3 2/22

  12. In an nutshell An additive noise channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Z ∼ µ , but µ is unknown and non-parametric . ✞ ☎ Can we learn to efficiently communicate from ( Z 1 ,...,Z n ) i . i . d . ∼ µ ? ✝ ✆ Generalization bounds for learning under error probability loss 1 Applies to empirical risk minimization (ERM). learning under a surrogate error probability loss. 2 New alternating optimization algorithm. a “codeword-expurgating” Gibbs learning algorithm. 3 Caveat: a distilled learning-theoretic framework. 2/22

  13. Motivation Why? 3/22

  14. Motivation Why? Justification of learning-based methods: 3/22

  15. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 3/22

  16. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. 3/22

  17. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. 3/22

  18. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. High-dimensional parameter, e.g. massive MIMO. 3/22

  19. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. High-dimensional parameter, e.g. massive MIMO. 3 Existing theory on learning-based quantizer design [LLZ94; LLZ97; BLL98; Lin02]. 3/22

  20. Motivation Why? Justification of learning-based methods: 1 The success of deep neural networks (DNN) [OH17; Gru+17]. 2 Avoid channel modeling [Wan+17; OH17; FG17; Shl+19]. Interference, jamming signals, non-linearities [Sch08], finite-resolution quantization. High-dimensional parameter, e.g. massive MIMO. 3 Existing theory on learning-based quantizer design [LLZ94; LLZ97; BLL98; Lin02]. 4 Exploit efficient optimization methods, e.g., for the design of low latency codes [Kim+18; Jia+19]. 3/22

  21. Outline Learning to Minimize Error Probability 1 Learning to Minimize a Surrogate to the Error Probability 2 Learning by Codebook Expurgation 3 4/22

  22. Model Channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise 5/22

  23. Model Channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Encoder: A codebook C = { x j } j ∈ [ m ] ⊂ C ⊆ ( R d ) m . 5/22

  24. Model Channel ⊥ X ∈ R d Y = X + Z , Z ⊥ ���� ���� ���� output input noise Encoder: A codebook C = { x j } j ∈ [ m ] ⊂ C ⊆ ( R d ) m . Decoder: minimal (Mahalanobis) distance decoder ˆ j ( y ) ∈ argmin � x − y � S j ∈ [ m ] w.r.t. inverse covariance matrix S ∈ S ⊆ S d + . 5/22

  25. ✶ Expected and empirical error probability Expected average error probability: m p µ ( C,S ) := 1 � p µ ( C,S | j ) , m j =1 with � � �� p µ ( C,S | j ) := E µ j ′ ∈ [ m ] ,j ′ � = j � x j + Z − x j ′ � S < � Z � S min . ✶ 6/22

  26. ✶ Expected and empirical error probability Expected average error probability: m p µ ( C,S ) := 1 � p µ ( C,S | j ) , m j =1 with � � �� p µ ( C,S | j ) := E µ j ′ ∈ [ m ] ,j ′ � = j � x j + Z − x j ′ � S < � Z � S min . ✶ Ultimate goal: find argmin C,S p µ ( C,S ) . 6/22

  27. Expected and empirical error probability Expected average error probability: m p µ ( C,S ) := 1 � p µ ( C,S | j ) , m j =1 with � � �� p µ ( C,S | j ) := E µ j ′ ∈ [ m ] ,j ′ � = j � x j + Z − x j ′ � S < � Z � S min . ✶ Ultimate goal: find argmin C,S p µ ( C,S ) . Empirical average error probability: replace n E µ [ ℓ ( Z )] → E Z [ ℓ ( Z )] := 1 � ℓ ( Z i ) n i =1 so that � � �� m p Z ( C,S ) := 1 � E Z ✶ j ′ ∈ [ m ] \ j � x j + Z − x j ′ � S < � Z � S min . m j =1 6/22

  28. Uniform error bound and ERM Theorem Assume that n ≥ d +1 . With probability of at least 1 − δ sup | p µ ( C,S ) − p Z ( C,S ) | C ⊂ ( R d ) m , S ∈ S d + � � � � � en � 2( d +1)log � 2log(2 /δ ) d +1 ≤ 4 m + . n n 7/22

  29. Uniform error bound and ERM Theorem Assume that n ≥ d +1 . With probability of at least 1 − δ sup | p µ ( C,S ) − p Z ( C,S ) | C ⊂ ( R d ) m , S ∈ S d + � � � � � en � 2( d +1)log � 2log(2 /δ ) d +1 ≤ 4 m + . n n Holds for the output ( C Z ,S Z ) of any learning algorithm. 7/22

  30. Uniform error bound and ERM Theorem Assume that n ≥ d +1 . With probability of at least 1 − δ sup | p µ ( C,S ) − p Z ( C,S ) | C ⊂ ( R d ) m , S ∈ S d + � � � � � en � 2( d +1)log � 2log(2 /δ ) d +1 ≤ 4 m + . n n Holds for the output ( C Z ,S Z ) of any learning algorithm. Specifically, for ERM ( C Z ,S Z ) ERM ∈ argmin p Z ( C,S ) , O ( m 2 d +log(1 /δ ) n = ˜ ) samples guarantees ǫ 2 p µ ( C Z ,S Z ) ERM ) ≤ inf C,S p µ ( C,S )+ ǫ. 7/22

  31. Uniform error bound and ERM - cont. Open questions: The term ˜ O ( log(1 /δ ) ) can be shown to be minimax tight. ǫ 2 8/22

Recommend


More recommend