Ara Design and implementation of a 1GHz+ 64-bit RISC-V Vector - PowerPoint PPT Presentation

Ara Design and implementation of a 1GHz+ 64-bit RISC-V Vector Processor in 22 nm FD-SOI Matheus CAVALCANTE PhD Student – ETH Zurich Fabian SCHUIKI, Florian ZARUBA, Michael SCHAFFNER, Luca BENINI Matheus CAVALCANTE | 2 octobre 2019 | 1

Ariane 1GHz 2 DP-GFLOPS 8 GB/s I$, D$ Instruction Data 64b 64b Interconnect 64b Matheus CAVALCANTE | 2 octobre 2019 | 2

Instruction Queue ARA Ariane 1GHz 1GHz 2 DP-GFLOPS 8 DP-GFLOPS 8 GB/s ACK/TRAP 16 GB/s I$, D$ MMU Data Instruction Data 64b 64b 128b Interconnect 128b Matheus CAVALCANTE | 2 octobre 2019 | 3

Memory Bandwidth and Performance: Rooflines Arithmetic Intensity  Operations per byte : data reuse of an Memory Bound Compute Bound  algorithm One FMA is two operations  Memory-bound and compute-bound  Peak perf. per memory width ratio  Ara targets 0.5 DP-FLOP/B  Memory bandwidth scales with the  number of FMAs Matheus CAVALCANTE | 2 octobre 2019 | 4

Ara: High-performance vector processor GlobalFoundries’ GF22 FD-SOI process  Work initiated at my Master’s Thesis  First presented at the 1 st RISC-V Summit, last year  Will be open-sourced still in 2019 within the PULP Platform (as usual!)  Snapshot of the current development  Challenges we faced  Results we achieved  Insights we gained  Matheus CAVALCANTE | 2 octobre 2019 | 5

RISC-V Vector Extension RISC-V “V” Extension: “Cray-like” vector-SIMD approach  Ara is based on version 0.5   Work being done to update it to the latest version of the spec (0.7)  Open-sourcing later this year Not fully-compliant   Limited support to fixed-point and vector atomics (not our focus)  Limited support for type promotions (e.g., 8b + 8b ← 64b) – hardware cost Matheus CAVALCANTE | 2 octobre 2019 | 6

State-of-the-art Fujitsu’s A64FX  Based on ARM SVE  2.7 DP-TFLOPS at a 7 nm process  Hwacha  Vector-fetch architecture  More complex: vector unit fetches its own instructions and threads can diverge  Predecessor to RISC-V “V” with its own ISA  Later version should be compliant with the vector extension  64 DP-GFLOPS at TSMC 16 nm  40 DP-GFLOPS/W at 28 nm process  Matheus CAVALCANTE | 2 octobre 2019 | 7

Microarchitecture First name Surname (edit via “View” > “Header & Footer”) | 12.12.2014 | 8

Ara with N identical lanes Matheus CAVALCANTE | 2 octobre 2019 | 9

Ara with N identical lanes Memory width W  Keep the peak perf. per memory width at 0.5 DPFLOP/B  Matheus CAVALCANTE | 2 octobre 2019 | 10

Ara with N identical lanes Memory width W  Keep the peak perf. per memory width at 0.5 DPFLOP/B  Vector instruction dispatching  Ara executes instructions non-speculatively  Sequencer acknowledges instructions as soon as they  are deemed “safe” Matheus CAVALCANTE | 2 octobre 2019 | 11

Ara with N identical lanes Memory width W  Keep the peak perf. per memory width at 0.5 DPFLOP/B  Vector instruction dispatching  Ara executes instructions non-speculatively  Sequencer acknowledges instructions as soon as they  are deemed “safe” Identical lanes  Each lane holds part of the computing units and part of  the Vector Register File (VRF): scalability! Matheus CAVALCANTE | 2 octobre 2019 | 12

Lane microarchitecture Multibanked Vector Register File  Sustains high throughput without multiple ports  Requires an VRF Arbiter (banking conflicts)  Word width: 64 bits (aka operand width)  Matheus CAVALCANTE | 2 octobre 2019 | 13

Lane microarchitecture Multibanked Vector Register File  Sustains high throughput without multiple ports  Requires an VRF Arbiter (banking conflicts)  Word width: 64 bits (aka operand width)  Operand queues  Queues needed to sustain maximum throughput for  the lock-step operation of the FUs, while hiding the latency caused by banking conflicts in the VRF Matheus CAVALCANTE | 2 octobre 2019 | 14

Trans-precision funcional units FPU can handle 1 x 64b, 2 x 32b, 4 x 16b and  8 x 8b per cycle FMA is pipelined (5 cycles) to meet the fmax constraint  Design by Stefan Mach et al.  Idea embedded in the ISA  CSR holds the “standard element width” of the vectors  Matheus CAVALCANTE | 2 octobre 2019 | 15

Performance Evaluation First name Surname (edit via “View” > “Header & Footer”) | 12.12.2014 | 16

Main kernel under evaluation: MATMUL DP-MATMUL: n x n double-precision  matrix multiplication C ← AB + C 32 n 2 bytes of memory transfers and  2 n 3 operations  n /16 FLOP/B  Compute-bound on Ara for n > 8 Matheus CAVALCANTE | 2 octobre 2019 | 17

Up to 98% efficiency @MATMUL (always?) Matheus CAVALCANTE | 2 octobre 2019 | 18

Efficiency drop to 49% for a 16x16 MATMUL vld vB, 0(a0) vmadd s are issued at best every  ld t0, 0(a1) four cycles add a1, a1, a2 Ariane is single-issue core vins vA, t0, zero  vmadd vC0, vA, vB, vC0 ld t0, 0(a1) If the vmadd takes less than four  add a1, a1, a2 cycles to execute, the FPUs starve vins vA, t0, zero waiting for instructions vmadd vC1, vA, vB, vC1 ld t0, 0(a1) add a1, a1, a2 This translates to the “issue rate”  vins vA, t0, zero boundary on the roofline plot vmadd vC2, vA, vB, vC2 Vector processor becomes more and ...  more like an array processor Matheus CAVALCANTE | 2 octobre 2019 | 19

Ara: 4 lanes GF 22FDX 1.25 GHz implementation (TT, 0.80V, 25 ºC) Lane 2 Lane 3 Ariane SLDU Front-end VLSU Lane 1 Lane 0 Matheus CAVALCANTE | 2 octobre 2019 | 20

Figures of merit Area breakdown Clock frequency:   1.25 GHz (nominal), 0.92 GHz (worst)  Area: 3430 kGE (0.68 mm 2 )  256x256 MATMUL  Performance: 9.80 DP-GFLOPS  Power: 259 mW  Efficiency: 38 DP-GFLOPS/W  Matheus CAVALCANTE | 2 octobre 2019 | 21

Ara’s scalability Each lane is almost independent  Contains part of the VRF and a FMA unit  Scalability limitations  SLDU VLSU and SLDU: needs to communicate  with all lanes, writing at all VRF banks Instance with 16 lanes achieves  VLSU 1.04 GHz (nominal), 0.78 GHz (worst)  10.7 MGE (2.13mm 2 )  32.4 DP-GFLOPS  40.8 DP-GFLOPS/W  Ariane Matheus CAVALCANTE | 2 octobre 2019 | 22

More details? More details available in arXiv paper  Ara: A 1 GHz+ Scalable and Energy-Efficient RISC-V Vector Processor with Multi-Precision  Floating Point Support in 22 nm FD-SOI arxiv.org/abs/1906.00478  Open-sourcing within PULP Platform  Planned for before the end of this year!  Contact me at matheusd at iis.ee.ethz.ch :)  Matheus CAVALCANTE | 2 octobre 2019 | 23

Ara Design and implementation of a 1GHz+ 64-bit RISC-V Vector Processor in 22 nm FD-SOI Matheus CAVALCANTE PhD Student – ETH Zurich Fabian SCHUIKI, Florian ZARUBA, Michael SCHAFFNER, Luca BENINI Matheus CAVALCANTE | 2 octobre 2019 | 24

Ara Design and implementation of a 1GHz+ 64-bit RISC-V Vector - PowerPoint PPT Presentation

Ara Design and implementation of a 1GHz+ 64-bit RISC-V Vector Processor in 22 nm FD-SOI Matheus CAVALCANTE PhD Student ETH Zurich Fabian SCHUIKI, Florian ZARUBA, Michael SCHAFFNER, Luca BENINI Matheus CAVALCANTE | 2 octobre 2019 | 1

ARA LOGOS Logistics Trust Non Deal Roadshow Presentation 19 June 2020 Agenda 1 ARA LOGOS

Logistics Trust 10 th Annual General Meeting Presentation 3 June 2020 Agenda 1 ARA LOGOS

Mr. Melvin D. Young, ARA, Mr. Duane E. Webb, ARA & Mr. Steven D. Pendleton Pinal County

SMAR ART T AN ANGAN ANWAD ADI I VAD ADODAR ARA VADOD ODARA ARA MAHA HANA NAGAR AR

PROPOSED PRIVATISATION OF ARA JLIG, Straits Trading and Cheung Kong Property to partner with

The Optional Protocol Its Benefits & Potentials Ferdous Ara Begum Ferdous Ara Begum

ARA US Hospitality Trust 2Q 2019 Financial Results (9 May 2019 to 30 June 2019) 30 July 2019

Ara pa hoe / Doug la s Works! a nd Colora do Works/ T ANF Pa rtne rship History of Workforc e

Experimental calibra/on of the ARA radio neutrino telescope with an electron beam in ice R. Gaior,

ARA detector calibration with Telescope Array Electron Light Source Romain Gaior for the Chiba

Experimental simulation trial for human lung's flow mechanism ARA Hi Hiroyu oyuki

Rick Hauge, ARA Rick Hauge s Land & Appraisal Co . Why worry about CERs, Corn

ARA Investor Presentation Q4 2017 Disclaimers Forward-Looking Statements This presentation

SUPERFICIAL FUNGAL INFECTION DR. HOSNE ARA BEGUM ASSOCIATE PROFESSOR DEPT. OF DERMATOLOGY &

ARA Investor Presentation Q3 2018 Disclaimers Forward-Looking Statements This presentation

San Diego City Attorneys Office SMART C HIEF D EPUTY C ITY A TTORNEY L ARA E ASTON SMART San

Y A L MI P : O p t i m i z a t i o n Ma d e E a s y ! a n d M o d e

Industry 4.0 Mondi Stambolijski sets new standards with Virtual Digester House Ninth National

Surface-enhanced Raman spectroscopy stu tudy of f commercial fr fruit ju juic ices Carlo

1

PYTHON FOR OPTIMIZATION BEN MORAN @BENM HTTP://BENMORAN.WORDPRESS.COM/ 2014-02-22

HiGHS High-performance open-source software for linear optimization Julian Hall School of

Concepts of programming languages PureScript Christian Stuart, Douwe van Gijn, Martijn Fleuren

Ultra L Low Power er Infer eren ence a e at the v very e edge o of the n netw twork

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Ara Design and implementation of a 1GHz+ 64-bit RISC-V Vector - PowerPoint PPT Presentation

Ara Design and implementation of a 1GHz+ 64-bit RISC-V Vector Processor in 22 nm FD-SOI Matheus CAVALCANTE PhD Student ETH Zurich Fabian SCHUIKI, Florian ZARUBA, Michael SCHAFFNER, Luca BENINI Matheus CAVALCANTE | 2 octobre 2019 | 1

ARA LOGOS Logistics Trust Non Deal Roadshow Presentation 19 June 2020 Agenda 1 ARA LOGOS

Logistics Trust 10 th Annual General Meeting Presentation 3 June 2020 Agenda 1 ARA LOGOS

Mr. Melvin D. Young, ARA, Mr. Duane E. Webb, ARA &amp; Mr. Steven D. Pendleton Pinal County

SMAR ART T AN ANGAN ANWAD ADI I VAD ADODAR ARA VADOD ODARA ARA MAHA HANA NAGAR AR

PROPOSED PRIVATISATION OF ARA JLIG, Straits Trading and Cheung Kong Property to partner with

The Optional Protocol Its Benefits &amp; Potentials Ferdous Ara Begum Ferdous Ara Begum

ARA US Hospitality Trust 2Q 2019 Financial Results (9 May 2019 to 30 June 2019) 30 July 2019

Ara pa hoe / Doug la s Works! a nd Colora do Works/ T ANF Pa rtne rship History of Workforc e

Experimental calibra/on of the ARA radio neutrino telescope with an electron beam in ice R. Gaior,

ARA detector calibration with Telescope Array Electron Light Source Romain Gaior for the Chiba

Experimental simulation trial for human lung's flow mechanism ARA Hi Hiroyu oyuki

Rick Hauge, ARA Rick Hauge s Land &amp; Appraisal Co . Why worry about CERs, Corn

ARA Investor Presentation Q4 2017 Disclaimers Forward-Looking Statements This presentation

SUPERFICIAL FUNGAL INFECTION DR. HOSNE ARA BEGUM ASSOCIATE PROFESSOR DEPT. OF DERMATOLOGY &amp;

ARA Investor Presentation Q3 2018 Disclaimers Forward-Looking Statements This presentation

San Diego City Attorneys Office SMART C HIEF D EPUTY C ITY A TTORNEY L ARA E ASTON SMART San

Y A L MI P : O p t i m i z a t i o n Ma d e E a s y ! a n d M o d e

Industry 4.0 Mondi Stambolijski sets new standards with Virtual Digester House Ninth National

Surface-enhanced Raman spectroscopy stu tudy of f commercial fr fruit ju juic ices Carlo

1

PYTHON FOR OPTIMIZATION BEN MORAN @BENM HTTP://BENMORAN.WORDPRESS.COM/ 2014-02-22

HiGHS High-performance open-source software for linear optimization Julian Hall School of

Concepts of programming languages PureScript Christian Stuart, Douwe van Gijn, Martijn Fleuren

Ultra L Low Power er Infer eren ence a e at the v very e edge o of the n netw twork

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

Mr. Melvin D. Young, ARA, Mr. Duane E. Webb, ARA & Mr. Steven D. Pendleton Pinal County

The Optional Protocol Its Benefits & Potentials Ferdous Ara Begum Ferdous Ara Begum

Rick Hauge, ARA Rick Hauge s Land & Appraisal Co . Why worry about CERs, Corn

SUPERFICIAL FUNGAL INFECTION DR. HOSNE ARA BEGUM ASSOCIATE PROFESSOR DEPT. OF DERMATOLOGY &