Vector Processors CS252 • I nit ially developed f or super -comput ing applicat ions, Graduate Computer Architecture t oday impor t ant f or mult imedia. Lecture 20 • Vect or pr ocessor s have high- level oper at ions t hat Vector Processing => Multimedia wor k on linear ar r ays of number s: "vect or s" VECTOR SCALAR (N operations) David E. Culler (1 operation) r1 r2 v1 v2 + + Many slides due to Christ of oros E. Kozyrakis r3 v3 vector length add r3, r1, r2 vadd.vv v3, v1, v2 CS252/ Culler Lec 20. 2 4/ 9/ 02 Properties of Vector Processors Styles of Vector Architectures • Single vect or inst r uct ion implies lot s of wor k (loop) • Memory- memor y vect or pr ocessor s – Fewer inst ruct ion f et ches – All vect or operat ions are memory t o memory • Each r esult independent of pr evious r esult • Vect or-r egist er pr ocessor s – Mult iple operat ions can be execut ed in parallel – All vect or operat ions bet ween vect or regist ers (except vect or load and st ore) – Simpler design, high clock rat e – Vect or equivalent of load- st ore archit ect ures – Compiler (programmer) ensures no dependencies – I ncludes all vect or machines since lat e 1980s • Reduces br anches and br anch pr oblems in pipelines – We assume vect or-regist er f or rest of t he lect ure • Vect or inst r uct ions access memor y wit h known pat t ern – Ef f ect ive pref et ching – Amort ize memory lat ency of over large number of element s – Can exploit a high bandwidt h memory syst em – No (dat a) caches required! CS252/ Culler CS252/ Culler Lec 20. 3 Lec 20. 4 4/ 9/ 02 4/ 9/ 02 Historical Perspective Cray- 1 Breakthrough • Mid-60s f ear per f . st agnat es • Fast , simple scalar processor – 80 MHz! • SI MD pr ocessor ar r ays – single-phase, lat ches act ively developed dur ing lat e • Exquisit e elect rical and mechanical design 60’s – mid 70’s • Semiconduct or memory – bit - parallel machines f or image processing • Vect or regist er concept • pepe, st aran, mpp – vast simplif icat ion of inst ruct ion set – wor d- parallel f or scient if ic – r educed necc. memory bandwidt h • I lliac I V • Tight int egrat ion of vect or and scalar • Cr ay develops f ast scalar • P iggy- back of f 7600 st acklib – CDC 6600, 7600 • Lat er vect orizing compilers developed • CDC bet s of vect or s wit h • Owned high- perf ormance comput ing f or a decade St ar-100 – what happened t hen? • Amdahl ar gues against vect or – VLI W compet it ion CS252/ Culler CS252/ Culler Lec 20. 5 Lec 20. 6 4/ 9/ 02 4/ 9/ 02 1
Components of a Vector Processor Cray- 1 Block • Scalar CPU: regist ers, dat apat hs, inst ruct ion f et ch logic • Vect or r egist er Diagram – Fixed lengt h memor y bank holding a single vect or – Typically 8-32 vect or regist ers, each holding 1 t o 8 Kbit s – Has at least 2 r ead and 1 wr it e por t s • Simple 16-bit RR inst r – MM: Can be viewed as array of 64b, 32b, 16b, or 8b element s • 32-bit wit h immed • Vect or f unct ional unit s (FUs) – Fully pipelined, st ar t new oper at ion ever y clock • Nat ur al combinat ions of – Typically 2 t o 8 FUs: int eger and FP scalar and vect or – Mult iple dat apat hs (pipelines) used f or each unit t o process mult iple element s per cycle • Scalar bit- vect or s • Vect or load-st ore unit s (LSUs) mat ch vect or lengt h – Fully pipelined unit t o load or st or e a vect or • Gat her / scat t er M-R – Mult iple element s f et ched/ st or ed per cycle – May have mult iple LSUs • Cond. mer ge • Cross-bar t o connect FUs , LSUs, regist ers CS252/ Culler CS252/ Culler Lec 20. 7 Lec 20. 8 4/ 9/ 02 4/ 9/ 02 Basic Vector I nstructions Vector Memory Operations I nst r. Operands Operat ion Comment • Load/ st or e oper at ions move gr oups of dat a bet ween r egist er s and memor y VADD.VV V1,V2,V3 V1=V2+V3 vect or + vect or • Thr ee t ypes of addr essing VADD.SV V1,R0,V2 V1=R0+V2 scalar + vect or V1=V2xV3 vect or x vect or – Unit st r ide VMUL.VV V1,V2,V3 • Fast est V1=R0xV2 scalar x vect or VMUL.SV V1,R0,V2 – Non- unit (const ant ) st r ide V1=M[R1..R1+63] load, st ride=1 VLD V1,R1 – I ndexed (gat her- scat t er) V1=M[R1..R1+63*R2] load, st ride=R2 VLDS V1,R1,R2 • Vect or equivalent of regist er indirect VLDX V1,R1,V2 V1=M[R1+V2i,i=0..63] indexed("gat her") • Good f or sparse arrays of dat a M[R1..R1+63]=V1 st ore, st ride=1 VST V1,R1 • I ncreases number of programs t hat vect or ize V1=M[R1..R1+63*R2] st ore, st ride=R2 VSTS V1,R1,R2 • compress/ expand variant also V1=M[R1+V2i,i=0..63] indexed(“scat t er") VSTX V1,R1,V2 • Suppor t f or var ious combinat ions of dat a widt hs in memory + all t he regular scalar inst ruct ions (RI SC st yle)… – {.L,.W,.H.,.B} x {64b, 32b, 16b, 8b} CS252/ Culler CS252/ Culler Lec 20. 9 Lec 20. 10 4/ 9/ 02 4/ 9/ 02 Vector Code Example Vector Length • A vect or r egist er can hold some maximum number of Y[0:63] = Y[0:653] + a*X[0:63] element s f or each dat a widt h (maximum vect or lengt h 64 element SAXPY: scalar 64 element SAXPY: vect or or MVL) LD R0,a LD R0,a #load scalar a • What t o do when t he applicat ion vect or lengt h is not ADDI R4,Rx,#512 VLD V1,Rx #load vector X exact ly MVL? loop: LD R2, 0(Rx) VMUL.SV V2,R0,V1 #vector mult • Vect or-lengt h (VL) r egist er cont r ols t he lengt h of any MULTD R2,R0,R2 VLD V3,Ry #load vector Y LD R4, 0(Ry) vect or oper at ion, including a vect or load or st or e VADD.VV V4,V2,V3 #vector add ADDD R4,R2,R4 – E.g. vadd.vv wit h VL=10 is VST Ry,V4 #store vector Y SD R4, 0(Ry) for (I=0; I<10; I++) V1[I]=V2[I]+V3[I] ADDI Rx,Rx,#8 ADDI Ry,Ry,#8 • VL can be anyt hing f r om 0 t o MVL SUB R20,R4,Rx • How do you code an applicat ion wher e t he vect or BNZ R20,loop lengt h is not known unt il r un- t ime? CS252/ Culler CS252/ Culler Lec 20. 11 Lec 20. 12 4/ 9/ 02 4/ 9/ 02 2
Strip Mining Optimization 1: Chaining • Suppose applicat ion vect or lengt h > MVL • Suppose: • St rip mining vmul.vv V1,V2,V3 vadd.vv V4,V1,V5 # RAW hazard – Gener at ion of a loop t hat handles MVL element s per it er at ion – A set operat ions on MVL element s is t ranslat ed t o a single vect or • Chaining inst r uct ion – Vect or regist er (V1) is not as a single ent it y but as a • Example: vect or saxpy of N element s group of individual regist ers – First loop handles (N mod MVL) element s, t he rest handle MVL – P ipeline f orwarding can work on individual vect or element s • Flexible chaining: allow vect or t o chain t o any ot her VL = (N mod MVL); // set VL = N mod MVL act ive vect or oper at ion => mor e r ead/ wr it e por t s for (I=0; I<VL; I++) // 1 st loop is a single set of Y[I]=A*X[I]+Y[I]; // vector instructions Unchained low = (N mod MVL); vadd Cray X-mp vmul VL = MVL; // set VL to MVL introduces for (I=low; I<N; I++) // 2 nd loop requires N/MVL memory chaining vmul Y[I]=A*X[I]+Y[I]; // sets of vector instructions Chained CS252/ Culler CS252/ Culler vadd Lec 20. 13 Lec 20. 14 4/ 9/ 02 4/ 9/ 02 Chaining & Multi- lane Example Optimization 2: Multi- lane I mplementation Scalar LSU FU0 FU1 Pipelined Lane Dat apat h vld vmul.vv Vect or Reg. Par t it ion vadd.vv Funct ional addu Unit Time vld vmul.vv To/ Fr om Memor y Syst em vadd.vv • Element s f or vect or regist ers int erleaved across t he lanes addu • Each lane receives ident ical cont rol • Mult iple element operat ions execut ed per cycle • Modular, scalable design Element Oper at ions: I nst r. I ssue: • No need f or int er - lane communicat ion f or most vect or inst ruct ions • VL=16, 4 lanes, 2 FUs, 1 LSU, chaining -> 12 ops/ cycle CS252/ Culler CS252/ Culler • J ust one new inst r uct ion issued per cycle !!!! Lec 20. 15 Lec 20. 16 4/ 9/ 02 4/ 9/ 02 Two Ways to View Vectorization Optimization 3: Conditional Execution • Suppose you want t o vect orize t his: • I nner loop vect or izat ion (Classic appr oach) for (I=0; I<N; I++) – Think of machine as, say, 32 vect or regist ers each wit h 16 if (A[I]!= B[I]) A[I] -= B[I]; element s • Solut ion: vect or condit ional execut ion – 1 inst ruct ion updat es 32 element s of 1 vect or regist er – Add vect or f lag regist ers wit h single-bit element s – Good f or vect orizing single- dimension arrays or regular kernels (e.g. saxpy) – Use a vect or compare t o set t he a f lag r egist er • Out er loop vect or izat ion (post-CM2) – Use f lag r egist er as mask cont r ol f or t he vect or sub • Addit ion execut ed only f or vect or element s wit h – Think of machine as 16 “virt ual processors” (VP s) cor r esponding f lag element set each wit h 32 scalar regist ers! (- mult it hreaded processor) – 1 inst ruct ion updat es 1 scalar regist er in 16 VPs • Vect or code – Good f or irregular kernels or kernels wit h loop- carried vld V1, Ra dependences in t he inner loop vld V2, Rb vcmp.neq.vv F0, V1, V2 # vector compare • These ar e j ust t wo compiler per spect ives vsub.vv V3, V2, V1, F0 # conditional vadd – The hardware is t he same f or bot h vst V3, Ra – Cray uses vector mask & merge CS252/ Culler CS252/ Culler Lec 20. 17 Lec 20. 18 4/ 9/ 02 4/ 9/ 02 3
Recommend
More recommend