jianchao yang toutiao ai lab in silicon valley
play

Jianchao Yang Toutiao AI Lab in Silicon Valley Joint work with - PowerPoint PPT Presentation

Jianchao Yang Toutiao AI Lab in Silicon Valley Joint work with Xiaojie Jin (NUS), Ning Xu (Snap), Yingzhen Yang Quest for compact and efficient deep models Memory usage Computation cost App size Quest for compact and


  1. Jianchao Yang Toutiao AI Lab in Silicon Valley Joint work with Xiaojie Jin (NUS), Ning Xu (Snap), Yingzhen Yang

  2. • Quest for compact and efficient deep models • Memory usage • Computation cost • App size

  3. • Quest for compact and efficient deep models • Memory usage • Computation cost • App size • WSNet: Compact and efficient network design • Smaller model (e.g., up to 180x smaller on ESC50) • Faster computation (e.g., up to 18x faster on ESC50) • Accuracy comparable to state of the arts

  4. • Conventional convolution filters are initialized and trained separately. ! # ! "

  5. • Conventional convolution filters are initialized and trained separately. • Convolution filters are highly redundant. ! # ! "

  6. • Conventional convolution filters are initialized and trained separately. • Convolution filters are highly redundant. ! # Model quantization Model pruning Low rank Signal sparsity … ! "

  7. • Main Idea : The convolution filters generated from a compact learnable parameter set (low-dimensional manifold), instead of learned separately. ! # ! "

  8. • Main Idea : The convolution filters generated from a compact learnable parameter set (low-dimensional manifold), instead of learned separately. Φ ! 1 % ( = ! ( (Φ) % ' Φ : learnable compact parameter set ! # ! , : mapping function to generate the ( -. convolution filter % &

  9. • Main Idea : The convolution filters generated from a compact learnable parameter set (low-dimensional manifold), instead of learned separately. Φ ! 1 % ( = ! ( (Φ) % ' Φ : learnable compact parameter set ! # ! , : mapping function to generate the ( -. convolution filter We focus on weight sampling for function % & ! , in this work: weight tying!

  10. • Model quantization (e.g., Han et al. 2015) • Weight tying as a result of weight quantization on a learnt model • HashedNet (Chen et al. 2015) • Random weight tying with hashing before model training • Epitome (Jojic et al. 2003) • A statistical model that tie pixel values in overlapping patches Jojic et al. Epitomic analysis of appearance and shape. ICCV 2003. Chen et al. Compressing convolutional neural networks with the hashing trick. ICML 2015.

  11. • Simplest case: 1D Convolution with single channel • Shift sampling Φ • ! : projection matrix • Φ : condensed parameter set # $ = ! $ (Φ)

  12. • Simplest case: 1D Convolution with single channel • Shift sampling Φ • ! : projection matrix • Φ : condensed parameter set # $ = ! $ (Φ)

  13. • Simplest case: 1D Convolution with single channel • Shift sampling Φ • ! : projection matrix • Φ : condensed parameter set # $ = ! $ (Φ)

  14. • Simplest case: 1D Convolution with single channel • Shift sampling Φ • ! : projection matrix • Φ : condensed parameter set # $ = ! $ (Φ)

  15. • Simplest case: 1D Convolution with single channel • Shift sampling Φ • ! : projection matrix • Φ : condensed parameter set # $ = ! $ (Φ)

  16. • Simplest case: 1D Convolution with single channel • Shift sampling Φ • ! : projection matrix • Φ : condensed parameter set # $ = ! $ (Φ) 7 weights to generate 5 1x3 filters (15 weights)

  17. • 1D convolution • Input feature map ! ∈ ℝ $×& , where ((, *) denotes length of input and number of channels • Output feature map * ∈ ℝ $×, , where - denotes number of filters • Convolution kernel . ∈ ℝ /×&×, , where 0 denotes filter length • Number of Multi-Adds: (*0-

  18. • Weight sampling overview ! ∗ : Length of Φ $ ∗ : Channel number of Φ ! : Length of % $ : Channel number of % & : Number of filters ' : Sampling stride ( : Repeating factor !$& Compactness = ! ∗ $ ∗

  19. • Weight shift sampling in spatial dimension • Conventional CNN: ! independent filters with size " , #params = !" • WSNet: Condensed filter with size of " ∗ , #params = " + (! − 1)) Compactness = *+ + ∗ ≈ + -

  20. • Repeating weight sampling in channel dimension • Conventional CNN: Each filter with channel ! • WSNet: Condensed filter with channel ! ∗ % Compactness = C = % ∗

  21. • Example Stride ! = 1 Channel repeating $ = 4 times Filter length & = 16 Compactness = ()* ( ∗ ) ∗ ≈ &$ = 64 Same idea can be generalized to fully connected layers!

  22. • Sample more filters with larger condensed filter (bigger ! ∗ ) and a small stride to increase capacity. Sampling Stride: S Sampling Stride: Ŝ • Increased computation?

  23. • Recap of conventional 1D convolution • Input feature map ! ∈ ℝ $×& • Output feature map ' ∈ ℝ $×( • Convolution kernel ) ∈ ℝ *×&×( • Number of Multi-Adds: +',-

  24. • Re-use the convolution results between overlapped input and filters

  25. • An efficient variant of the integral image method

  26. • Acceleration in terms of Multi-Adds: #!$ # ∗ 3 + ! ∗ − 1 + ! ∗ + $ • Example convolution layer • Conv kernel size !, #, $ = (8, 64, 128) • Condensed kernel ! ∗ , # ∗ = 135, 16 • Input feature map (0, #) • Computation acceleration of ~27 for this layer

  27. • Direct extension • Spatial sampling : shifting patch sampling from a 2D condensed filter • Channel sampling : repeat sampling in the channel dimension

  28. • Direct extension • Spatial sampling : shifting patch sampling from a 2D condensed filter • Channel sampling : repeat sampling in the channel dimension

  29. • Direct extension • Spatial sampling : shifting patch sampling from a 2D condensed filter • Channel sampling : repeat sampling in the channel dimension

  30. • Direct extension • Spatial sampling : shifting patch sampling from a 2D condensed filter • Channel sampling : repeat sampling in the channel dimension • Compactness • Conventional filter: ( !, ℎ, $, % ) • Condensed filter: ( &, ', $ ∗ ) • Sampling strides: ) ! , ) ℎ

  31. • Tensor decomposition extension • Decompose 3D weight tensors into three 1D vectors (Jin et al. 2015) • Apply WSNet on each 1D vector as in 1D CNN. 3D convolution 1D convolution over three directions Jin et al. Flattened neural networks for feedforward acceleration. ICLR 2015.

  32. • Channel dimension dominates model size and computation. • Channel reordering to reduce computation cost.

  33. • Tasks and datasets • WSNet-1D: Audio classification • ESC-50 • UrbanSound8K • DCASE • WSNet-2D: Image classification • CIFAR 10 • MNIST • ImageNet

  34. • Notation settings of WSNet • Name of WSNet model in the form of ! " # $ % & ' ( • ! " denotes compactness ) in spatial dimension • # $ denotes channel repeating * times • % & denotes ratio of filters as + between WSNet vs baseline through dense sampling • ' ( denotes compression ratio of , by weight quantization when used.

  35. • Baseline network for ESC-50, UrbanaSound8K, DCASE • Network adopted from SoundNet (Aytar et al. 2016) for fair comparison Aytar et al. SoundNet: Learning sound representation from unlabeled videos. NIPS 2016.

  36. • ESC-50 : A collection of 2000 short environmental recordings comprising 50 equally balanced classes of sound events (e.g., animals, water sounds, urban noises, human non-speech sounds, etc.)

  37. • UrbanSound8K : A collection of 8732 short recordings of various urban sound sources (air conditioner, car horn, playing children, etc.)

  38. • DCASE : Detection and Classification of Acoustic Scenes and Events Challenge. It contains 10 acoustic scene categories, with ten samples of 30s recording for training per category.

  39. • Direct 2D extension on CIFAR 10 and MNIST • Same baseline network as HashedNet (Chen et al. 2015) Chen et al. Compressing convolutional neural networks with the hashing trick. ICML 2015.

  40. • Tensor decomposition extension on ImageNet • Single view test • Baseline network is Res34 Model #Params #Multi-Adds Top-1 Res18 11.2M 1800M 70.6 Res34 21.3M 3600M 73.1 MobileNet 4.2M 575M 70.6 WSNet 2.7M 540M 70.4

  41. • WSNet provides a novel design scheme for convolutional neural networks to learn compact and efficient models. • Achieve comparable accuracy with STOA, but with much fewer parameters and computation cost for CNNs. • For future work, explore more filter generation methods, e.g., learning a generative statistical model or low-dimensional basis.

  42. We are hiring research scientist, software engineer, and intern in • Areas • Sites • Computer vision • Beijing • Computer graphics • Silicon Valley (USA) • Machine learning • Seattle (USA) • Natural language processing • Knowledge discover and data mining • Speech and audio processing • Recommender system Send resume to lab-hr@bytedance.com for Beijing Positions and rdus.staffing@bytedance.com for Silicon Valley Positions

  43. Thank You! Reference: https://arxiv.org/abs/1711.10067?context=cs

Recommend


More recommend