professor media ic system lab graduate institute of
play

Professor Media IC & System Lab Graduate Institute of - PowerPoint PPT Presentation

Shao-Yi Chien () Professor Media IC & System Lab Graduate Institute of Electronics Engineering National Taiwan University Outline AI edge: distributed intelligence Tensor transform for memory-efficient operations


  1. Shao-Yi Chien (簡韶逸) Professor Media IC & System Lab Graduate Institute of Electronics Engineering National Taiwan University

  2. Outline — AI edge: distributed intelligence — Tensor transform for memory-efficient operations — Implementation results — Conclusion

  3. Internet-of-AI-Things AI Big IoT Data

  4. Where Should Computing be Located? — Data from Internet: big data Cloud Servers — Data from IoT: Ultra-big data ! — AI on the cloud? Aggregator — AI on the edge? Aggregator Smart Devices

  5. Distributed Intelligence AI Edge Senso sor Aggregator/ Ag Cloud Cl Ga Gate teway Data from Large La Small Sm Each Sensor Data Filtering Process Hi High Low Low Semantic Level Context Inferring Process Light-We Li Weight Learning/Reco cognition Cloud Serve vers rs with HSA, NPU, DSP, P, En Engine CPU/GPU PU/FPG PGA Neura ral Proce cesso ssors rs

  6. ��������� ������� �������� �������� ����������� ����������������� ����� ������������� ����� �������� ����������������������� Deep Learning Ecosystem Memory efficient is the most important target for optimization

  7. ������ ������ ����� ������� Unroll: Fast and Simple 7

  8. ������ Formulation of Unrolling 8

  9. ������� ��������� ������� ���������� ������ � � �������������� ������ ��������� ��������� ��������� ������� ������� ������� ������� ����� ����� ����� ����� �������� �������� �������� �������� Unroll: More than Conv. 9

  10. Unrolling: Where and Who? — Where the unrolling operation is employed? — Everywhere in optimized parallel computing systems! — CPU, GPU, DSP, VPU, ASIC — Who will execute unrolling in a system — General purpose processors: the software developers need to handle it — VPU and ASIC: it is embedded in the hardware for specific applications

  11. ���� ������ Problem of Unrolling Main memory Main memory 11

  12. Unroll is a Fast Blackbox Unroll Blackbox Main memory Processors 12

  13. Efficient Blackbox: Unroll as Last as Possible 13

  14. ������������ ������ Naïve Unrolling 14

  15. ���������� ��������������� ������������ ������ Unroll at Shared Memory 15

  16. � ������������������ ��������������� ����������������� ������������� ��������������������������������� ������������������������������������� Unroll Upon Computation 16

  17. ��������� ����� ������� �������������� ������ ��������� ������ Useful Unrolling Framework Requires — Formulation of unrolling — Build algorithms by unrolling — DNN — CV, ML — … — Memory efficient unrolling — GPUs — ASICs 17

  18. �������� ������������ ��� �������� ������ ����������� ����� ����� ����� ����� � � ����� ����� ����� ����� ������ UMI (Unrolled Memory Inner-Products) Operator — You simply write code for — Describing the unroll pattern and — Defining what to do for each row. — Efficient blackbox make you code fast. 18

  19. � � Memory Efficient Unrolling — Smooth dataflow must consider: DRAM reuse 1. Bank conflict 2. — Both can be analyzed by the formula: 19

  20. UMI: Experimental Results — UMI blackbox Baseline: OpenCV, Parboil and Caffe — CUDA version is available on Github — Code reduction 2--4x — Speed-up 1.4--26x — Hardware implementation is coming soon Ref: Y. S. Lin, W. C. Chen and S. Y. Chien, "Unrolled Memory Inner-Products: An Abstract GPU Operator for Efficient Vision-Related Computations," ICCV 2017 .

  21. ����������������� ������������� ����������������� ������������������� ��������������� ����� ASIC Design — TAU: 32-core parallel processor — Scaled up linearly 21

  22. Conclusion — AI edge: distributed intelligence — Memory access optimization is the key for efficient CNN computing — Unrolling plays an important role for memory optimization, which can also benefit other operations — A unrolling framework, tensor transform for memory- efficient operations, is developed to decouple unrolling operations — Implementation results: code reduction 2--4x; speed- up 1.4--26x

  23. ����� ������ �������� ������� ������� ������ ����������� ������������ �������� ������ ������ ��������������� ��� �������� �������� ������� ��������� ������ ����� Using UMI Operator is… 23

Recommend


More recommend