convolutional neural networks cnns recurrent neural
play

Convolutional Neural Networks (CNNs) Recurrent Neural Networks - PowerPoint PPT Presentation

Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs) L1 Scalar Processor L0 L0 M*V Processor N N N N N N Project Brainwave Pretrained DNN Model Scalable DNN Hardware Neural Processing Unit Microservice Extract


  1. Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs)

  2. L1 Scalar Processor L0 L0 M*V Processor N N N N N N Project Brainwave Pretrained DNN Model Scalable DNN Hardware Neural Processing Unit Microservice

  3. Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

  4. Serial dependence

  5. Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

  6. O(0) O(N 2 ) Batched RNNs

  7. O(0) O(N) O(N 2 ) Batched RNNs

  8. Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

  9. 1. void LSTM(int steps) { 16. mv_mul(Uc); 2. for (int t = 0; t < steps; t++) { 17. vv_add(xWc); 3. v_rd(NetQ); 18. v_tanh(); 4. v_wr(InitialVrf, xt); 19. vv_mul(it); 5. v_rd(InitialVrf, xt); 20. vv_add(ft_mod); 6. mv_mul(Wf); 21. v_wr(MultiplyVrf, c_prev); 7. vv_add(bf); 22. v_wr(InitialVrf, ct); 8. v_wr(AddSubVrf, xWf); 23. v_rd(InitialVrf, ct); 9. v_rd(InitialVrf, h_prev); 24. v_tanh(); 10. mv_mul(Uf); 25. vv_mul(ot); 11. vv_add(xWf); 26. v_wr(InitialVrf, h_prev); 12. v_sigm(); 27. v_wr(NetQ); 13. vv_mul(c_prev); 28. } 14. v_wr(AddSubVrf, ft_mod); 29. } 15. v_rd(InitialVrf, h_prev);

  10. Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

  11. + ×

  12. × + × + × + × + × + × + × + ×

  13. × + × + × + × + × + × + × + × × + × + × + × + × + × + × + × × + × + × + × + × + × + × + × × + × + × + × + × + × + × + ×

  14. 1 2 3 4

  15. Dispatcher D D D D

  16. Scalar Instructions Processor MVM Top Level Scheduler Scheduler

  17. Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization

  18. DeepBench RNN: GRU-2816, batch=1, 71B OPs/serve Device Node Latency Effective TFLOPS Utilization BW-NPU (Stratix 10, FP8) Intel 2 ms 35.9 74.8% 14nm CNN: ResNet-50, batch=1, 7.7B OPs/serve Device Node Latency Effective TFLOPS Utilization BW-NPU (Arria 10, FP11) TSMC 1.64 ms 4.7 66% 20nm

  19. https://github.com/Azure/aml-real-time-ai

Recommend


More recommend