ARM FPUs: Low Latency is Low Energy David Lutz 1 Every computer - PowerPoint PPT Presentation

ARM FPUs: Low Latency is Low Energy David Lutz 1

Every computer has a power budget device simple phone smartphone tablet laptop supercomputer total power 3W 5W 15W 35W 20 megawatts budget screen size 3” 4-5” 10” 13” � Power limited by heat generated � Performance increases over time, but power budget does not � Active research area: how to get more performance within a power budget 2

Low Latency is Low Energy Power Energy � Energy = Power * Time Time � Datapaths consume little power on out-of-order cores � Current ARM FPUs consume about 7% of “big” core power running DAXPY � Decreasing latency can decrease time � Energy savings is not just datapath energy 3

Typical 5-cycle FMA � all 3 operands needed at the beginning of the operation � sum of 4 products: s = a*x + b*y + c*z + d*w 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 fmul s,a,x M M M M M fma s,b,y F F F F F fma s,c,z F F F F F fma s,d,w F F F F F 4

ARM 6-cycle FMA with separate multiply and add � 3-cycle multiply followed by 3-cycle add � Note that a single FMA is slower � sum of 4 products: s = a*x + b*y + c*z + d*w 1 2 3 4 5 6 7 8 9 10 11 12 13 fmul s,a,x M1 M2 M3 fma s,b,y M1 M2 M3 A1 A2 A3 fma s,c,z M1 M2 M3 A1 A2 A3 fma s,d,w M1 M2 M3 A1 A2 A3 5

opa[63:0] opb[63:0] 3-cycle multiplier 3x siga CLZ siga CLZ sigb V1 0-63 bit left shift 0-63 bit left shift 0-63 bit left shift radix 8 Booth encoder � V1 normalized siga normalized 3x siga BM[17:0] � normalization computed exponent Booth 8 mux � Booth encoding shift, round, V2 and mask 18->12->8->6->4->3->2 (3:2 compressors) generation � V2 D[105:0] E[105:0] mask shift ovfl round round � Booth mux � 18:2 reduction 3:2 3:2 � compute shift,round,mask V3 ovfl sum sum[105:0] � V3 0-63 bit right shift 0-63 bit right shift � add and round (2) last bit and flags last bit and flags sign,ovfl exp sign,exp � subnormal shift rounded ovfl sum rounded sum � select specials sum[105], special 6 product[63:0]

3-cycle adder opa sources opb sources opa sources[63:0] opb sources[116:0] opa_mux opb_mux V1 comparison LZAs � V1 LZAs LZAs LZAs/exp compares � compare/swap larger,shift1 opl,ops opl,ops 4:1 � 4xLZA rshift1 � compute exponent lshift[6:0] opl[106:0] ops[106:0] � compute lshift, rshift right shift exp_diff left shift left shift � V2 V2 � Left and right shift 3:1 3:1 ls,rs,subnormal � select round1 round0 3:2 FA 3:2 FA � 3:2 for rounding c1[107:0] s1[107:0] c0[107:0] s0[107:0] � V3 add1 add0 � add and round V3 specials � select overflow, overflow2 4:1 special sum[63:0] 7

Faster FPU = higher performance and lower energy � Suppose lower latency FPU is 15% faster than higher latency FPU � Takes 1/1.15 = .87 of the time to complete SpecFP time FP power non-FP power energy = time * power Slower FPU 1 7 93 1.0 * (7+93) = 100 Faster FPU 0.87 p 93 .87 * (p+93) = .87p + 80.9 � New scheme lower energy if 100 > .87p + 80.9 � if p < 22 � if p < 3 times slower FPU power 8

Faster FPU can lead to lower area � Fewer (flip)flops vs. more logic � Where is the area going? 9

Strategy for out-of-order cores � Do the execution as quickly as possible to save energy � Be suspicious of slower execution, e.g. � double pumped multipliers � slower dividers � Execution units are where you want to spend power 10

Conclusions � Low execution latency has an outsized effect on performance � Low latency can improve area � Low latency is low energy 11

ARM FPUs: Low Latency is Low Energy David Lutz 1 Every computer - PowerPoint PPT Presentation

ARM FPUs: Low Latency is Low Energy David Lutz 1 Every computer has a power budget device simple phone smartphone tablet laptop supercomputer total power 3W 5W 15W 35W 20 megawatts budget screen size 3 4-5 10 13

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

ARM Reports Maja Talevska Milenkovska ERP Functional Consultant, Acumatica Class Syllabus Day

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

ARM A55 Cortex Austin Bae, Harrison Ding 12/5/2018 Introduction Implements the ARM v8.2-A

Porting FreeBSD on Xen on ARM How to support your OS as Xen ARM guest Julien Grall

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

FPGA security Nele Mentens nele.mentens@kuleuven.be Design and security of cryptographic

Transfer entropy for network reconstruction in a simple dynamical model Roy Goodman NJIT Dept.

CS 327E Lecture 4 Shirley Cohen February 3, 2016 Agenda Announcements Homework for

Protein Hypernetworks Johannes K oster, Eli Zamir, Sven Rahmann August 20, 2012 1 / 14

cl a simple form of computation used widely one way to find patterns with thanks to

PROCESSOR SYSTEM 387 Acknowledgements Results originate in project "Supercomputer in the

AIRS Tuning and Performance Tests Larry McMillin Climate Research and Applications Division

Bas Basic ic El Elec. ec. En Engr gr. . Lab Lab ECS EC S 210 Dr. Prapun Suksompong

ARM FPUs: Low Latency is Low Energy David Lutz 1 Every computer - PowerPoint PPT Presentation

ARM FPUs: Low Latency is Low Energy David Lutz 1 Every computer has a power budget device simple phone smartphone tablet laptop supercomputer total power 3W 5W 15W 35W 20 megawatts budget screen size 3 4-5 10 13

Systems Architecture The ARM Processor The ARM Processor p. 1/14 The ARM Processor ARM:

ARM Software Suite Powered by GDM Why use ARM Software? ARM is the software solution to plan,

ARM Advanced RISC Machines The ARM Instruction Set The ARM Instruction Set - ARM University

ARM Cortex-M4 Programming Model ARM = Advanced RISC Machines, Ltd. ARM licenses IP to other

Asynchronous I/O Stack: A Low-latency Kernel I/O Stack for Ultra-Low Latency SSDs Jinkyu Jeong

ARM Microprocessor and ARM-Based Microcontrollers Nguatem William 24th May 2006 1 / 40 A

Verifying the Motion of a Robot Arm Akul Penugonda 1 /6 Akul Penugonda - Robot Arm Motion 2

ARM v4T CS2253 Owen Kaser, UNBSJ ARM v4T History of ARM processors R is for RISC

Low Latency Live Video Streaming over HTTP 2.0 Sheng Wei, Vishy Swaminathan | Adobe Research

ARM Reports Maja Talevska Milenkovska ERP Functional Consultant, Acumatica Class Syllabus Day

It's finally time for Arm in the Datacenter- and beyond [TUT1143] Jay Kruemcke Sr. Product

ARM A55 Cortex Austin Bae, Harrison Ding 12/5/2018 Introduction Implements the ARM v8.2-A

Porting FreeBSD on Xen on ARM How to support your OS as Xen ARM guest Julien Grall

STORM AND LOW-LATENCY PROCESSING www.inf.ed.ac.uk Low latency processing Similar to data

Lets talk locks! @kavya719 kavya locks. locks are slow locks are slow latency

FAILURE AT NETFLIX VELOCITY Cannot Connect to the Netflix Service 0 0 Ms % IMPACT LATENCY

FPGA security Nele Mentens nele.mentens@kuleuven.be Design and security of cryptographic

Transfer entropy for network reconstruction in a simple dynamical model Roy Goodman NJIT Dept.

CS 327E Lecture 4 Shirley Cohen February 3, 2016 Agenda Announcements Homework for

Protein Hypernetworks Johannes K oster, Eli Zamir, Sven Rahmann August 20, 2012 1 / 14

cl a simple form of computation used widely one way to find patterns with thanks to

PROCESSOR SYSTEM 387 Acknowledgements Results originate in project &quot;Supercomputer in the

AIRS Tuning and Performance Tests Larry McMillin Climate Research and Applications Division

Bas Basic ic El Elec. ec. En Engr gr. . Lab Lab ECS EC S 210 Dr. Prapun Suksompong

PROCESSOR SYSTEM 387 Acknowledgements Results originate in project "Supercomputer in the