i tanium power programming
play

I tanium Power Programming Sverre Jarp CERN openlab 1 Summer - PowerPoint PPT Presentation

S.Jarp CERN I tanium Power Programming Sverre Jarp CERN openlab 1 Summer 2005 Lesson 1 S.Jarp a) I ntroduction CERN b) Overview of Architecture and Conventions Lesson 2 a) Standard I nstruction Set b) Our first real


  1. S.Jarp CERN “I tanium Power Programming” Sverre Jarp CERN openlab 1 Summer 2005

  2. Lesson 1 S.Jarp a) I ntroduction CERN b) Overview of Architecture and Conventions Lesson 2 a) Standard I nstruction Set b) Our first “real” example Agenda: Lesson 3 a) Secrets of Speed b) An improved version our example Lesson 4 a) Multimedia I nstructions b) A top-notch version of our example Lesson 5 a) Floating-point I nstructions b) Changing our example to handle floating-point Lesson 6 a) Compilers and Assemblers: Peaceful coexistence? b) Conclusions Appendices 2 Summer 2005

  3. Part 1a S.Jarp CERN I ntroduction 3 Summer 2005

  4. Presentation Objectives S.Jarp CERN � Offer programmers � Comprehension of the architecture � I nstruction set and other features � Working Understanding of I tanium machine code � Compiler-generated code � Hand-written assembler code � I nspiration for writing code � Well-targeted assembler routines � Highly optimized routines � I n-line assembly code � Full control of architectural features 4 Summer 2005

  5. Part 1b S.Jarp CERN Overview of Architecture and Conventions 5 Summer 2005

  6. Architectural Highlights S.Jarp CERN � (Some of the) Main I nnovations: � Rich I nstruction Set � Bundled Execution � Predicated I nstructions � Large Register Files � Register Stack � Rotating Registers � Software Pipelined Loops � Control/ Data Speculation � Cache Control I nstructions � High-precision Floating-Point 6 Summer 2005

  7. A simple example S.Jarp CERN � Lots of details Application registers Register � Many questions allocation .proc getval: alloc r3= ar.pfs,R_input,R_local,R_output,R_rotating (p0) movl r2= Table / / Base table address Enforced (p0) and in0= 7,in0 / / Choice is 0 – 7 Instruction ;; (p0) shladd r2= in0,3,r2 / / I ndex table Separation ;; (p0) ldfd f8= [r2] / / Load value (p0) br.ret.sptk.few rp / / return Predicated execution Branch return 7 Summer 2005

  8. User Register Overview S.Jarp CERN 128 16 Kernel Integer Registers Backup Registers 128 8 Floating Point Registers Region Registers 64 128 Predicate Registers Control Registers 8 Instruction Pointer Branch Registers 128 NN Debug Application Registers Breakpoint Registers 5 NN Perf. Mon. CPUID Registers Data Reg’s 8 Summer 2005

  9. I A64 Common Registers S.Jarp CERN � I nteger registers � 128 in total; Width is 64 bits + 1 bit (NaT); r0 = 0 � I nteger, Logical and Multimedia data � Floating point registers � 128 in total; 82 bits wide � 17-bit exponent, 64-bit significand � f0 = 0.0; f1 = 1.0 � Significand also used for two SI MD floats � Predicate registers � 64 in total; 1 bit each (fire/ do not fire) � p0 = 1 (default value) � Branch registers � 8 in total; 64 bits wide (for address) 9 Summer 2005

  10. Rotating Registers S.Jarp CERN � Upper 75% rotate (when activated): � General registers (r32-r127) � Floating Point Registers (f32-f127) � Predicate Registers (p16-p63) � Formula: � Virtual Register = Physical Register – Register Rotation Base (RRB) ……. f28 f29 f30 f31 f32 f33 f34 f35 ……. f124 f125 f126 f127 10 Summer 2005

  11. Register Convention S.Jarp CERN � Run-time: � Branch Registers: � B0: Call register [rp] � B1-B5: Must be preserved � B6-B7: Scratch � General Registers: � R1: Global Data Pointer [gp] � R2-R3: scratch � R4-R7: Must be preserved � R8-R11: Procedure Return Values [ret0, ret1, ret2, ..] � R12: Stack Pointer [sp] � R13: (Reserved as) Thread Pointer � R14-R31: Scratch � R32-Rxx: Argument Registers [in0, in1, in2, ..] 11 Summer 2005

  12. Register Convention (2) S.Jarp CERN � Run-time convention � Floating-Point: � F2-F5: Preserved � F6-F7: Scratch � F8-F15: Argument/ Return Registers � F16-F31: Must be preserved � F32-F127: Scratch � Predicates: � P1-P5: Must be preserved � P6-P15: Scratch � P16-P63: Must be preserved � Additionally: � Ar.lc: Must be preserved 12 Summer 2005

  13. Register Stack Rules S.Jarp CERN � The rotating integer registers serve as a stack � Each routine allocates via ”alloc” instruction: � I nput + Local + Output � “R_rotate” < = “R_input + R_local” may rotate (in a multiple of 8 registers) Proc A Local A Output A Proc B Input B + Local B Output B Proc C Further Calls Proc B Proc A Local A Output A 13 Summer 2005

  14. I nstruction Types S.Jarp CERN � M � Memory/ Move Operations � I � Complex I nteger/ Multimedia Operations � A � Simple I nteger/ Logic/ Multimedia Operations � F � Floating Point Operations (Normal/ SI MD) � B � Branch Operations � L � Special instructions with 64-bit immediate 14 Summer 2005

  15. I nstruction Bundle S.Jarp CERN � Bundle as “Packaging entity”: � 3 * 41 bit I nstruction Slots � 5 bits for Template (of I nst. types) � Typical examples: MFI or MI B � I ncluding bit for I nstruction Group Separation “S” � A bundle is 16B: � Basic unit for expressing parallelism � The unit that the I nstruction Pointer points to � The unit you branch to � Actually executed may be less, equal, or more Slot 2 Slot 1 Slot 0 T 15 Summer 2005

  16. I nstruction Group Separation (Stop bit) S.Jarp CERN � Necessary to avoid “Dependency Violations” � For ALL registers: I nteger, FP, Predicate, Branch, App., etc. � Two out of four possibilities (Forbidden): � Read-After-Write (RAW): Good � add r22= 1,r21 ; add r23= 1,r22 ;; assemblers will issue � Write-After-Write (WAW): necessary � add r22= 1,r21 ; add r22= 1,r23 ;; warnings! � Two out of four (OK): � Read-After-Read (RAR): � add r22= 1,r21 ; add r23= 1,r21 ;; � Write-After-Read (WAR): � add r23= 1,r22 ; add r22= 1,r21 ;; 16 Summer 2005

  17. Conventions S.Jarp CERN � I nstruction syntax � (qp) ops[.comp 1 ] r 1 = r 2 , r 3 � Execution is always right-to-left � Result(s) on left-hand side of equal-sign. � Almost all instructions have a qualifying predicate � Many have further completers: Unsigned, left, double, etc. � 7 6 5 4 3 2 1 0 � Numbering � Also right-to left 63 0 � I mmediates At execution time, sign bit is � Various sizes exist extended all the � I mm 8 (Signed immediate – 7 bits plus sign) way to bit 63 17 Summer 2005

  18. Part 2a S.Jarp CERN Standard I nstruction Set 18 Summer 2005

  19. The Total I nstruction Set S.Jarp CERN � Many I nstruction Categories: � Logical operations (e.g. and) � Arithmetic operations (e.g. add) � Compare operations � Shift operations � Branches, including loop control � Memory and cache operations � Move operations � Multimedia operations (e.g. padd) � Floating Point operations (e.g. fma) � SI MD Floating Point operations (e.g. fpma) See documentation for complete reference set 19 Summer 2005

  20. Arithmetic Operations S.Jarp CERN � I nstruction format: X86 I nc/ Dec replaced with � (qp) ops 1 r 1 = r 2 , r 3 [,1] (qp) ops r 1 = r 2 ,r0,1 � (qp) ops 2 r 1 = imm x , r 3 � (qp) ops 3 r 1 = r 2 , count 2 , r 3 Z = Y – imm becomes (qp) Add r 1 = -imm, r 3 � Valid Operations: � ops 1 : add, sub � ops 2 : sub, adds/ addl (imm 14 , imm 22 ) � ops 3 : shladd Loading an immediate value (qp) Add r 1 = imm, r0 � NB: I nteger multiply is an FLP operation 20 Summer 2005

  21. Compare Operations S.Jarp CERN � I nstruction format: � (qp) cmp.crel.ctype p 1 , p 2 = r 2 , r 3 � (qp) cmp.crel.ctype p 1 , p 2 = imm 8 , r 3 Parallel � (qp) cmp.crel.ctype p 1 , p 2 = r0, r 3 inequality form � Valid Relationships: � eq, ne, lt, le, gt, ge, ltu, leu gtu, geu, � Types: � none , unc, and, or, or.andcm, orcm, andcm, and.orcm 21 Summer 2005

  22. Load Operations S.Jarp CERN � Standard instructions: � (qp) ld sz .ldtype.ldhint r 1 = [r 3 ], r 2 Always � (qp) ld sz . ldtype.ldhint r 1 = [r 3 ], imm 9 post- � (qp) ldf fsz .fldtype.ldhint f 1 = [r 3 ], r 2 modify � (qp) ldf fsz .fldtype.ldhint f 1 = [r 3 ], imm 9 � Valid Sizes: Sign-bit is NOT � sz: 1/ 2/ 4/ 8 [bytes] extended for � fsz: s(ingle)/ d(double)/ e(extended)/ 8(as integer) 1/ 2/ 4 bytes In the case � Types: of integer � s/ a/ sa/ c.nc/ c.clr/ c.clr.acq/ acq/ bias multiply (for instance) � Advanced options (not discussed here!) Also “fill” variants More complex usage (see Manuals) 22 Summer 2005

  23. Branch Operations S.Jarp CERN � Several different types: � Conditional or Call branches � Relative offset (I P-relative) or I ndirect (via branch registers) � Triggered by predication � Return branches � I ndirect + Qualifying Predicate (QP) � Loop controlling branches: � Simple Counted Loops (br.cloop) � I P-relative with AR.LC � Software-pipelined Counted Loop (br.ctop) � I P-relative with AR.LC and AR.EC � Software-pipelined While Loops (br.wtop) � I P-relative with QP and AR.EC 23 Summer 2005

  24. Simple Counted Loop S.Jarp CERN � Works as ‘expected’ � ar.lc counts down the loop (automatically) � No need to use a general register mov ar.lc= 5 ;; / / NB: 6 iterations loop: { work } ……. { much more work } br.cloop.sptk.few loop ;; � Software-pipelined loops are more advanced � Uses Epilogue Count (as well as Loop Count) � … and Rotating Registers We will deal with such loops later 24 Summer 2005

Recommend


More recommend