Convolutional Neural Networks (CNNs) Recurrent Neural Networks (RNNs)
L1 Scalar Processor L0 L0 M*V Processor N N N N N N Project Brainwave Pretrained DNN Model Scalable DNN Hardware Neural Processing Unit Microservice
Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization
Serial dependence
Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization
O(0) O(N 2 ) Batched RNNs
O(0) O(N) O(N 2 ) Batched RNNs
Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization
1. void LSTM(int steps) { 16. mv_mul(Uc); 2. for (int t = 0; t < steps; t++) { 17. vv_add(xWc); 3. v_rd(NetQ); 18. v_tanh(); 4. v_wr(InitialVrf, xt); 19. vv_mul(it); 5. v_rd(InitialVrf, xt); 20. vv_add(ft_mod); 6. mv_mul(Wf); 21. v_wr(MultiplyVrf, c_prev); 7. vv_add(bf); 22. v_wr(InitialVrf, ct); 8. v_wr(AddSubVrf, xWf); 23. v_rd(InitialVrf, ct); 9. v_rd(InitialVrf, h_prev); 24. v_tanh(); 10. mv_mul(Uf); 25. vv_mul(ot); 11. vv_add(xWf); 26. v_wr(InitialVrf, h_prev); 12. v_sigm(); 27. v_wr(NetQ); 13. vv_mul(c_prev); 28. } 14. v_wr(AddSubVrf, ft_mod); 29. } 15. v_rd(InitialVrf, h_prev);
Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization
+ ×
× + × + × + × + × + × + × + ×
× + × + × + × + × + × + × + × × + × + × + × + × + × + × + × × + × + × + × + × + × + × + × × + × + × + × + × + × + × + ×
1 2 3 4
Dispatcher D D D D
Scalar Instructions Processor MVM Top Level Scheduler Scheduler
Extract parallelism from a single thread of execution Achieve high utilization without batching Scale to O(100k) spatial units Synthesis specialization
DeepBench RNN: GRU-2816, batch=1, 71B OPs/serve Device Node Latency Effective TFLOPS Utilization BW-NPU (Stratix 10, FP8) Intel 2 ms 35.9 74.8% 14nm CNN: ResNet-50, batch=1, 7.7B OPs/serve Device Node Latency Effective TFLOPS Utilization BW-NPU (Arria 10, FP11) TSMC 1.64 ms 4.7 66% 20nm
https://github.com/Azure/aml-real-time-ai
Recommend
More recommend