understanding and robustifying differentiable
play

Understanding and Robustifying Differentiable Architecture Search - PowerPoint PPT Presentation

Understanding and Robustifying Differentiable Architecture Search Arber Zela 1 , Thomas Elsken 2 , 1 , Tonmoy Saikia 1 , Yassine Marrakchi 1 , Thomas Brox 1 & Frank Hutter 1 , 2 1 Department of Computer Science, University of Freiburg { zelaa,


  1. Understanding and Robustifying Differentiable Architecture Search Arber Zela 1 , Thomas Elsken 2 , 1 , Tonmoy Saikia 1 , Yassine Marrakchi 1 , Thomas Brox 1 & Frank Hutter 1 , 2 1 Department of Computer Science, University of Freiburg { zelaa, saikiat, marrakch, brox, fh } @cs.uni-freiburg.de 2 Bosch Center for Artificial Intelligence Thomas.Elsken@de.bosch.com February 19, 2020 Accepted as Oral at ICLR 2020 Arber Zela RobustDARTS February 19, 2020 1

  2. The Choice of Architecture Matters Performance improvements on various tasks mostly due to novel architectural design choices Figure: Larger circles, more network parameters [Canziani et al. 2017] Arber Zela RobustDARTS February 19, 2020 2

  3. The Choice of Architecture Matters Performance improvements on various tasks mostly due to novel architectural design choices Figure: Inception-v4 modules [Szegedy et al. ‘17] Designing network architectures is hard, requiring lots of human efforts - Can we automate this design process? Arber Zela RobustDARTS February 19, 2020 2

  4. Towards efficient Neural Architecture Search (NAS) RL & Evolution for NAS by Google Brain [Quoc Le’s group, ‘16-’18] New state-of-the-art results for CIFAR-10, ImageNet, Penn Treebank Large computational demands – 800 GPUs for 2 weeks; 12800 architectures evaluated Code not public Figure taken from FastAI Arber Zela RobustDARTS February 19, 2020 3

  5. Towards efficient Neural Architecture Search (NAS) RL & Evolution for NAS by Google Brain [Quoc Le’s group, ‘16-’18] New state-of-the-art results for CIFAR-10, ImageNet, Penn Treebank Large computational demands – 800 GPUs for 2 weeks; 12800 architectures evaluated Code not public Weight sharing/One-shot NAS [Pham et al,’18; Bender et al, ’18; Liu et al, ‘19; Xie et al. ’19; Cai et al. ’19, Zhang et al. ’19] All possible architectures are subgraphs of a large supergraph (the one-shot model) Weights are shared between different architectures with common edges/nodes in the supergraph Search costs reduced to < 1 GPU day. Arber Zela RobustDARTS February 19, 2020 3

  6. Differentiable NAS (DARTS) [Liu et al. ‘19] Neural Network as Directed Acyclic Graph - Nodes: fixed operators (element-wise addition, concatenation) on feature maps - Edges: operations ( sep conv 3 × 3 , sep conv 5 × 5 , dil conv 3 × 3 , dil conv 5 × 5 , max pool 3 × 3 , avg pool 3 × 3 , identity and zero ) Between 2 nodes: Categorical choice for which operation to use - Relax this discrete space to a continuous representation using a convex combination of these choices (MixedOps) − → one-shot model - Use SGD to search in the space of architectures. Arber Zela RobustDARTS February 19, 2020 4

  7. 0 1 2 Differentiable Architecture Search (DARTS) [Liu et al. ‘19] e α ( i,j ) x ( j ) = � o ( i,j ) ( x ( i ) ) = � o o ( x ( i ) ) i<j ˜ � o ∈O i<j α ( i,j ) o ′ � o ′∈O e 0 0 0.33 0.84 0.33 0.03 0.33 0.13 1 1 0.33 0.24 0.33 0.71 0.33 0.33 0.45 0.05 0.33 0.38 0.33 0.17 2 2 (a) Search start (b) Search end Arber Zela RobustDARTS February 19, 2020 5

  8. Differentiable Architecture Search (DARTS) [Liu et al. ‘19] e α ( i,j ) x ( j ) = � o ( i,j ) ( x ( i ) ) = � o o ( x ( i ) ) i<j ˜ � o ∈O i<j α ( i,j ) o ′ � o ′∈O e o ( i,j ) ∈ arg max o ∈O α ( i,j ) o 0 0 0 0.33 0.84 0.33 0.03 0.33 0.13 1 1 1 0.33 0.24 0.33 0.71 0.33 0.33 0.45 0.05 0.33 0.38 0.33 0.17 2 2 2 (d) Search start (e) Search end (f) Final cell Arber Zela RobustDARTS February 19, 2020 5

  9. DARTS: Architecture Optimization Optimizing both L train and L valid corresponds to a bilevel optimization problem: α { f ( α ) � L valid ( w ∗ ( α ) , α ) } min s.t. w ∗ ( α ) = arg min L train ( w, α ) , w where - α − → architectural weights - w − → operation weights Arber Zela RobustDARTS February 19, 2020 6

  10. DARTS: Architecture Optimization Optimizing both L train and L valid corresponds to a bilevel optimization problem: α { f ( α ) � L valid ( w ∗ ( α ) , α ) } min s.t. w ∗ ( α ) = arg min L train ( w, α ) , w where - α − → architectural weights - w − → operation weights Approximate w ∗ ( α ) ≈ w − ξ ∇ w L train ( w, α ) The optimization alternates between: Update w by ∇ w L train ( w, α ) 1 Update α by ∇ α L valid ( w − ξ ∇ w L train ( w, α ) , α ) 2 Arber Zela RobustDARTS February 19, 2020 6

  11. Works quite well on many benchmarks Original CNN space: 8 operations on each MixedOp 28 MixedOPs in total > 10 23 possible architectures < 3% on CIFAR-10 in less than 1 GPU day of search Arber Zela RobustDARTS February 19, 2020 7

  12. But not always... S1: This search space uses a different set of two operators per edge, derived by iteratively running DARTs and pruning unimportant operations. S2: { 3 × 3 SepConv , SkipConnect } . S3: { 3 × 3 SepConv , SkipConnect , Zero } , S4: { 3 × 3 SepConv , Noise } . Arber Zela RobustDARTS February 19, 2020 8

  13. But not always... S1: This search space uses a different set of two operators per edge, derived by iteratively running DARTs and pruning unimportant operations. S2: { 3 × 3 SepConv , SkipConnect } . S3: { 3 × 3 SepConv , SkipConnect , Zero } , S4: { 3 × 3 SepConv , Noise } . skip_connect skip_connect skip_connect 1 c_{k-1} skip_connect 2 skip_connect skip_connect 2 sep_conv_3x3 0 c_{k} c_{k-2} skip_connect skip_connect 3 skip_connect skip_connect skip_connect c_{k} 3 skip_connect 0 skip_connect skip_connect c_{k-1} c_{k-2} skip_connect 1 noise noise 3 noise c_{k-2} 2 sep_conv_3x3 skip_connect c_{k} skip_connect c_{k-1} 1 noise skip_connect 2 sep_conv_3x3 0 skip_connect 1 noise c_{k} skip_connect c_{k-1} 0 c_{k-2} skip_connect sep_conv_3x3 skip_connect 3 skip_connect Arber Zela RobustDARTS February 19, 2020 8

  14. Architecture overfitting S5: Very small search space with known global optimum. 81 possible architectures trained 3 independent times using the default DARTS settings. Arber Zela RobustDARTS February 19, 2020 9

  15. Architecture overfitting S5: Very small search space with known global optimum. 81 possible architectures trained 3 independent times using the default DARTS settings. Architectural parameters start overfitting to the validation set. L 2 factor: 0.0003 7 60 DARTS test regret DARTS one-shot val. error 6 RS-ws test regret 50 Validation error (%) 5 Test regret (%) 40 4 3 30 2 20 1 0 10 0 10 20 30 40 50 Search epoch Arber Zela RobustDARTS February 19, 2020 9

  16. Architecture overfitting What would be a good feature that would detect overfitting without training and evaluating the architectures from scratch (too expensive!)? Arber Zela RobustDARTS February 19, 2020 10

  17. Architecture overfitting What would be a good feature that would detect overfitting without training and evaluating the architectures from scratch (too expensive!)? HINT: flatness/sharpness of minimas, e.g. in large vs. small batch size training of NN is a good indicator of generalization. 2 2Hessian-based Analysis of Large Batch Training and Robustness to Adversaries. Yao et al. NeurIPS ‘18 Arber Zela RobustDARTS February 19, 2020 10

  18. Generalization of architectures and sharpness of minimas Compute the full Hessian ∇ 2 α L val on a randomly sampled mini-batch from the validation set. Arber Zela RobustDARTS February 19, 2020 11

  19. Generalization of architectures and sharpness of minimas Compute the full Hessian ∇ 2 α L val on a randomly sampled mini-batch from the validation set. The dominant EV starts increasing at the point where the architecture generalization error starts increasing. 30 8 One-shot validation error (%) S1 0.8 7 Dominant Eigenvalue S2 25 Test error (%) S3 6 0.6 S4 20 5 0.4 4 15 3 0.2 10 2 0 10 20 30 40 50 0 10 20 30 40 50 0 10 20 30 40 50 Search epoch Search epoch Search epoch Arber Zela RobustDARTS February 19, 2020 11

  20. Generalization of architectures and sharpness of minimas Compute the full Hessian ∇ 2 α L val on a randomly sampled mini-batch from the validation set. The dominant EV starts increasing at the point where the architecture generalization error starts increasing. High correlation between generalization and the dominant eigenvalue (EV) S1 C10 (Average over the EV trajectory) Pearson corr. coef.: 0.867, p-value: 0.00000 5.5 5.0 Test error (%) 4.5 4.0 3.5 3.0 0.15 0.20 0.25 0.30 0.35 0.40 Average Dominant Eigenvalue Arber Zela RobustDARTS February 19, 2020 11

Recommend


More recommend