Bayesian Sparsification of Deep Complex-valued networks Ivan Nazarov, Evgeny Burnaev ADASE Skoltech Moscow, Russia
Synopsis Motivation for C -valued neural networks ◮ perform better for naturally C -valued data ◮ use half as much storage, but the same number of flops 1 / 14
Synopsis Motivation for C -valued neural networks ◮ perform better for naturally C -valued data ◮ use half as much storage, but the same number of flops Propose Sparse Variational Dropout for C -valued neural networks ◮ Bayesian sparsification method with C -valued distributions ◮ empirically explore the compression-performance trade-off 1 / 14
Synopsis Motivation for C -valued neural networks ◮ perform better for naturally C -valued data ◮ use half as much storage, but the same number of flops Propose Sparse Variational Dropout for C -valued neural networks ◮ Bayesian sparsification method with C -valued distributions ◮ empirically explore the compression-performance trade-off Conclusions ◮ C -valued methods compress similarly to R -valued predecessors ◮ final performance benefits from fine-tuning sparsified network ◮ compress a SOTA C VNN on MusicNet by 50 − 100 × at a moderate performance penalty 1 / 14
C -valued neural networks: Applications Data with natural C -valued representation ◮ radar and satellite imaging [Hirose, 2009, H¨ ansch and Hellwich, 2010, Zhang et al., 2017] ◮ magnetic resonance imaging [Hui and Smith, 1995, Wang et al., 2020] ◮ radio signal classification [Yang et al., 2019, Tarver et al., 2019] ◮ spectral speech modelling and music transcription [Wisdom et al., 2016, Trabelsi et al., 2018, Yang et al., 2019] 2 / 14
C -valued neural networks: Applications Data with natural C -valued representation ◮ radar and satellite imaging [Hirose, 2009, H¨ ansch and Hellwich, 2010, Zhang et al., 2017] ◮ magnetic resonance imaging [Hui and Smith, 1995, Wang et al., 2020] ◮ radio signal classification [Yang et al., 2019, Tarver et al., 2019] ◮ spectral speech modelling and music transcription [Wisdom et al., 2016, Trabelsi et al., 2018, Yang et al., 2019] Exploring benefits beyond C -valued data ◮ sequence modelling, dynamical system identification [Danihelka et al., 2016, Wisdom et al., 2016] ◮ image classification, road / lane segmentation [Popa, 2017, Trabelsi et al., 2018, Gaudet and Maida, 2018] ◮ unitary transition matrices in recurrent networks [Arjovsky et al., 2016, Wisdom et al., 2016] 2 / 14
C -valued neural networks: Implementation Geometric representation C ≃ R 2 ◮ z = ℜ z + ℑ z , 2 = − 1 ◮ ℜ z and ℑ z are real and imaginary parts of z An intricate double- R network that respects C -arithmetic x 1 x 1 − W 21 W 11 W 12 W 11 × × x 2 x 2 W 21 W 22 W 21 W 11 R VNN linear operation C VNN linear operation Activations z �→ σ ( z ) , e.g re φ �→ σ ( r , φ ) or z �→ σ ( ℜ z ) + σ ( ℑ z ) . 3 / 14
Sparsity and compression Improve power, storage or throughput efficiency of deep nets ◮ Knowledge distillation [Hinton et al., 2015, Balasubramanian, 2016] ◮ Network pruning [LeCun et al., 1990, Seide et al., 2011, Zhu and Gupta, 2018] ◮ Low-rank matrix / tensor decomposition [Denton et al., 2014, Novikov et al., 2015] ◮ Quantization and fixed point arithmetic [Courbariaux et al., 2015, Han et al., 2016, Chen et al., 2017] Applications to C VNN: ◮ C modulus pruning, quantization with k -means in R 2 , [Wu et al., 2019] ◮ ℓ 1 regularization for hyper-complex-valued networks, [Vecchi et al., 2020] 4 / 14
Sparse Variational Dropout [Molchanov et al., 2017] Variational Inference with automatic relevance determination effect maximize E w ∼ q log p ( D | w ) − KL ( q � π ) (ELBO) q ∈Q � �� � � �� � data model likelihood variational regularization prior π → data model likelihood → posterior q (close to p ( w | D ) ) 5 / 14
Sparse Variational Dropout [Molchanov et al., 2017] Variational Inference with automatic relevance determination effect maximize E w ∼ q log p ( D | w ) − KL ( q � π ) (ELBO) q ∈Q � �� � � �� � data model likelihood variational regularization prior π → data model likelihood → posterior q (close to p ( w | D ) ) Factorized Gaussian dropout posterior family Q � � µ ij , α ij µ ij 2 ) , α ij > 0, and µ ij ∈ R ◮ w ij ∼ q ( w ij ) = N ( w ij Factorized prior 1 ◮ (VD) π ( w ij ) ∝ [Molchanov et al., 2017] | w ij | � � 0 , 1 ◮ (ARD) π ( w ij ) = N ( w ij τ ij ) [Kharitonov et al., 2018] 5 / 14
real 0.2 imag 0.9 1.0 45 0 90 C -valued Variational Dropout Our proposal Factorized complex-valued posterior q ( w ) = � q ( w ij ) � � µ, σ 2 , σ 2 ξ ) , σ 2 = α | µ | 2 , | ξ | ≤ 1 ◮ w ij are independent CN ( w � ℜ w � �� ℜ µ � � 1 + ℜ ξ �� , σ 2 ℑ ξ ∼ N 2 ℑ w ℑ µ ℑ ξ 1 − ℜ ξ 2 6 / 14
real 0.2 imag 0.9 1.0 45 0 90 C -valued Variational Dropout Our proposal Factorized complex-valued posterior q ( w ) = � q ( w ij ) � � µ, σ 2 , σ 2 ξ ) , σ 2 = α | µ | 2 , | ξ | ≤ 1 ◮ w ij are independent CN ( w � ℜ w � �� ℜ µ � � 1 + ℜ ξ �� , σ 2 ℑ ξ ∼ N 2 ℑ w ℑ µ ℑ ξ 1 − ℜ ξ 2 ◮ w ij are circularly symmetric about µ ij ( ξ ij = 0) α ij and 2 | w ij − µ ij | 2 1 ◮ relevance ∝ is χ 2 α ij | µ ij | 2 2 6 / 14
C -valued Variational Dropout Our proposal Factorized complex-valued posterior q ( w ) = � q ( w ij ) � � µ, σ 2 , σ 2 ξ ) , σ 2 = α | µ | 2 , | ξ | ≤ 1 ◮ w ij are independent CN ( w � ℜ w � �� ℜ µ � � 1 + ℜ ξ �� , σ 2 ℑ ξ ∼ N 2 ℑ w ℑ µ ℑ ξ 1 − ℜ ξ 2 ◮ w ij are circularly symmetric about µ ij ( ξ ij = 0) real α ij and 2 | w ij − µ ij | 2 1 ◮ relevance ∝ is χ 2 α ij | µ ij | 2 2 0.2 Factorized complex-valued priors π imag 0.9 ◮ ( C -VD) π ( w ij ) ∝ | w ij | − ρ , ρ ≥ 1 ◮ ( C -ARD) π ( w ij ) = CN ( 0 , 1 τ ij , 0 ) 1.0 45 0 90 CN ( 0 , 1 , η e φ ) , | η | ≤ 1 6 / 14
C -valued Variational Dropout KL ( q � π ) term in (ELBO) � KL ( q � π ) = KL ( q ( w ij ) � π ( w ij )) ij ( C -VD) improper prior 2 log | µ ij | 2 + log 1 ρ − 2 α ij − ρ 2 Ei ( − 1 ∝ α ij ) KL ij � x e t t − 1 dt Ei ( x ) = −∞ ( C -ARD) prior is optimized w.r.t. τ ij in empirical Bayes − 1 − log σ 2 ij τ ij + τ ij ( σ 2 ij + | µ ij | 2 ) KL ij = � � 1 + 1 min = log τ ij KL ij α ij 7 / 14
Experiments: Goals and Setup We conduct numerous experiments on various datasets to ◮ validate the proposed C -valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R -valued Sparse Variational Dropout 8 / 14
Experiments: Goals and Setup We conduct numerous experiments on various datasets to ◮ validate the proposed C -valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R -valued Sparse Variational Dropout ‘pre-train’ → ‘compress’ → ‘fine-tune’ ◮ ‘compress’ with R / C -Variational Dropout layers ◮ ‘fine-tune’ pruned network ( log α ij ≤ − 1 2 ) 8 / 14
Experiments: Goals and Setup We conduct numerous experiments on various datasets to ◮ validate the proposed C -valued sparsification methods ◮ explore the compression-performance profiles ◮ compare to the R -valued Sparse Variational Dropout ‘pre-train’ → ‘compress’ → ‘fine-tune’ ◮ ‘compress’ with R / C -Variational Dropout layers ◮ ‘fine-tune’ pruned network ( log α ij ≤ − 1 2 ) max E w ∼ q log p ( D | w ) − β KL ( q � π ) ( β -ELBO) q 8 / 14
Experiments: Datasets Four MNIST-like datasets ◮ channel features ( R ֒ → C ) or 2d Fourier features ◮ fixed random subset of 10 k train samples ◮ simple dense and convolutional nets 9 / 14
Experiments: Datasets Four MNIST-like datasets ◮ channel features ( R ֒ → C ) or 2d Fourier features ◮ fixed random subset of 10 k train samples ◮ simple dense and convolutional nets CIFAR10 dataset ( R 3 ֒ → C 3 ) ◮ random cropping and horizontal flipping ◮ C -valued variant of VGG16 [Simonyan and Zisserman, 2015] 9 / 14
Experiments: Datasets Four MNIST-like datasets ◮ channel features ( R ֒ → C ) or 2d Fourier features ◮ fixed random subset of 10 k train samples ◮ simple dense and convolutional nets CIFAR10 dataset ( R 3 ֒ → C 3 ) ◮ random cropping and horizontal flipping ◮ C -valued variant of VGG16 [Simonyan and Zisserman, 2015] Music transcription on MusicNet [Thickstun et al., 2017] ◮ audio dataset of 330 annotated musical compositions ◮ use power spectrum to tell which piano keys are pressed ◮ compress deep C VNN proposed by [Trabelsi et al., 2018] 9 / 14
Results: CIFAR10 Trade-off on CIFAR10 (raw) ( = 0.5) 0.92 0.90 0.88 accuracy 0.86 0.84 C VGG ARD R VGG ARD 0.82 C VGG VD R VGG VD 0.80 ×100 ×1000 compression C -valued version of VGG16 [Simonyan and Zisserman, 2015] 10 / 14
Results: MusicNet Trade-off on MusicNet (fft) ( = 0.5) 0.750 Trabelsi et al. (2018) 0.725 average precision 0.700 0.675 0.650 C DeepConvNet ARD 0.625 C DeepConvNet VD C DeepConvNet k3 VD 0.600 ×10 ×100 ×1000 compression The C VNN of Trabelsi et al. [2018] 11 / 14
Recommend
More recommend