32 bit to 64 bit
play

32-bit to 64-bit Matthew Gretton-Dann Technical Lead - Toolchain - PowerPoint PPT Presentation

Porting & Optimising Code 32-bit to 64-bit Matthew Gretton-Dann Technical Lead - Toolchain Working Group Linaro Connect, Dublin July 2013 A Presentation of Four Parts Register Files Structure Layout & Data Models Atomics


  1. Porting & Optimising Code 32-bit to 64-bit Matthew Gretton-Dann Technical Lead - Toolchain Working Group Linaro Connect, Dublin July 2013

  2. A Presentation of Four Parts • Register Files • Structure Layout & Data Models • Atomics • Vectorization & Neon Intrinsics www.linaro.org

  3. Simplification • View for those writing apps • No complicated kernel stuff Little Endian • www.linaro.org

  4. Bias Warning Assembler Compiler www.linaro.org

  5. Why 64-bit? Memory www.linaro.org

  6. General Purpose Registers – 32-bit ARM r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 (SP) r14 (LR) r15 (PC) www.linaro.org

  7. General Purpose Registers r0 r1 r2 r3 r4 r5 r6 r7 r8 r9 r10 r11 r12 r13 (SP) r14 (LR) r15 (PC) www.linaro.org

  8. General Purpose Registers r0 r16 r1 r17 r2 r18 r3 r19 r4 r20 r5 r21 r6 r22 r7 r23 r8 r24 r9 r25 r10 r26 r11 r27 r12 r28 r13 (SP) r29 r14 (LR) r30 r15 (PC) www.linaro.org

  9. General Purpose Registers – 64-bit ARM r0 r16 r1 r17 r2 r18 r3 r19 r4 r20 r5 r21 r6 r22 r7 r23 r8 r24 r9 r25 r10 r26 r11 r27 r12 r28 r13 r29 r14 r30 (LR) r15 SP PC www.linaro.org

  10. General Purpose Registers Bit 63 Bit 0 rN { wN xN www.linaro.org

  11. General Purpose Registers – Consequences • Easier to do 64-bit arithmetic! • Less need to spill to the stack • Spare registers to keep more temporaries www.linaro.org

  12. Structure Layout – 32-bit struct foo { int32_t a; 0: a 4: void* p; p 8: int32_t x; x 12: }; www.linaro.org

  13. Structure Layout – 64-bit struct foo { int32_t a; 0: a 4: void* p; <hole> 8: int32_t x; 12: p }; 16: x 20: www.linaro.org

  14. Structure Layout – 64-bit struct foo { void* p; 0: 4: p int32_t a; 8: int32_t x; a 12: }; x 16: www.linaro.org

  15. Brief Aside • API: Application Programming Interface Defines the interfaces a programmer may use o High level o • ABI: Application Binary Interface Defines how to call functions, layout memory &c. o Low level o www.linaro.org

  16. Data Models ILP32 int long long long pointer www.linaro.org

  17. Data Models ILP32 int long long long pointer LP64 int long long long pointer www.linaro.org

  18. Data Models ILP32 int long long long pointer LP64 LLP64 int int long long long long long long pointer pointer www.linaro.org

  19. Data Models ILP32 LP64 LLP64 struct foo { 0: a a a int a; 4: l <hole> l long l; 8: int x; x x 12: l }; 16: x 20: www.linaro.org

  20. That’s It... www.linaro.org

  21. One more thing... www.linaro.org

  22. One more thing... • Remove conditionalisation www.linaro.org

  23. Two more things... • Remove conditionalisation • Add some new load/store semantics www.linaro.org

  24. Three more things... • Remove conditionalisation • Add some new load/store semantics • Change the register layout for the floating-point/SIMD registers s3 s2 s1 s0 d1 d0 q0 s7 s6 s5 s4 d3 d2 q1 www.linaro.org

  25. Three more things... • Remove conditionalisation • Add some new load/store semantics • Change the register layout for the floating-point/SIMD registers s0 d0 v0 s1 d1 v1 www.linaro.org

  26. Four more things... • Remove conditionalization • Add some new load/store semantics • Change the register layout for the float-point/SIMD registers • Add some more SIMD instructions www.linaro.org

  27. Many more things... • Remove conditionalization • Add some new load/store semantics • Change the register layout for the float-point/SIMD registers • Add some more SIMD instructions • ... www.linaro.org

  28. Atomics #if defined(__GNUC__) && #else (defined(__i386__) || int AtomicAdd (volatile int* ptr, int defined(__x86_64__)) increment) int { AtomicAdd(volatile int* ptr, int increment) *ptr += increment; { return *ptr; int temp = increment; } __asm__ __volatile__( #endif "lock; xaddl %0,%1” : "+r" (temp), "+m" (*ptr) : : "memory"); return temp + increment; } www.linaro.org

  29. Atomics type __atomic_add_fetch (type *ptr, type val, int memmodel) These built-in functions perform the operation suggested by the name, and return the result of the operation. That is, { *ptr op= val; return *ptr; } All memory models are valid. www.linaro.org

  30. Atomics int AtomicAdd(volatile int* ptr, int increment) { return __atomic_add_fetch (ptr, increment, memmodel); } www.linaro.org

  31. Atomics • There are basically three types of memory model defined by C++11 which GCC’s support is based upon: Sequentially Consistent o Acquire/Release o Relaxed o www.linaro.org

  32. Atomics – Sequentially Consistent a = 1; if (x.load() == 20) x.store(20); assert (a == 1); www.linaro.org

  33. Atomics – Relaxed a = 1; if ( x.load(memory_order_relaxed)) == 20) x.store(20, memory_order_relaxed); assert (a == 1); www.linaro.org

  34. Atomics – Acquire/Release x.store (10, memory_order_release); y.store (20, memory_order_release); assert (y.load (memory_order_acquire) assert (y.load (memory_order_acquire) == 20 && == 0 && x.load (memory_order_acquire) == x.load (memory_order_acquire) == 0) 10) www.linaro.org

  35. Atomics – Sequentially Consistent x.store (10); y.store (20); assert (y.load () == 20 && assert (y.load () == 0 && x.load () == 0) x.load () == 10) www.linaro.org

  36. Atomics – Sequentially Consistent x.store (10); y.store (20); assert (y.load () == 20 && assert (y.load () == 0 && x.load () == 0) x.load () == 10) www.linaro.org

  37. Atomics – Acquire/Release a = 1; if (x.load(memory_order_acquire) == 20) x.store(20, memory_order_release); assert (a == 1); www.linaro.org

  38. Atomics int AtomicAdd(volatile int* ptr, int increment) { return __atomic_add_fetch (ptr, increment, __ATOMIC_SEQ_CST); } www.linaro.org

  39. And Now For Something Completely Different... add: vld1.32 {q9}, [r1]! vld1.32 {q8}, [r2]! vadd.i32 q8, q9, q8 + subs r3, r3, #4 vst1.32 {q8}, [r0]! + bne add + bx lr + www.linaro.org

  40. Autovectorisation void add(int *a, const int *b, const int *c, unsigned n) { unsigned i; for (i = 0; i < n; ++i) a[i] = b[i] + c[i]; } www.linaro.org

  41. Autovectorisation .cpu generic ldr w7, [x2,x5] .file "t.c" add w4, w6, 1 .text add w7, w8, w7 .align 2 str w7, [x0,x5] .global add cmp w3, w4 .type add, %function bls .L1 add: ubfiz x4, x4, 2, 32 cbz w3, .L1 ldr w7, [x1,x4] add x4, x0, 16 ldr w5, [x2,x4] cmp x1, x4 add w6, w6, 2 add x5, x1, 16 add w5, w7, w5 cset w8, cs str w5, [x0,x4] cmp x0, x5 cmp w3, w6 cset w7, cs bls .L1 add x5, x2, 16 uxtw x6, w6 cmp x2, x4 lsl x6, x6, 2 cset w6, cs ldr w3, [x1,x6] cmp x0, x5 ldr w1, [x2,x6] cset w4, cs add w1, w3, w1 orr w5, w8, w7 str w1, [x0,x6] orr w4, w6, w4 .L1: tst w5, w4 ret beq .L3 .L3: cmp w3, 5 sub w6, w3, #1 bls .L3 add x6, x6, 1 lsr w7, w3, 2 lsl x6, x6, 2 mov x4, 0 mov x3, 0 lsl w6, w7, 2 .L11: mov w5, w4 ldr w5, [x1,x3] .L9: ldr w4, [x2,x3] add x8, x2, x4 add w4, w5, w4 add x9, x1, x4 str w4, [x0,x3] ld1 {v0.4s}, [x8] add x3, x3, 4 ld1 {v1.4s}, [x9] cmp x3, x6 add x8, x0, x4 bne .L11 add v0.4s, v1.4s, v0.4s ret add w5, w5, 1 .size add, .-add st1 {v0.4s}, [x8] .ident "GCC: (GNU) 4.9.0 20130416 (experimental)" cmp w5, w7 add x4, x4, 16 bcc .L9 cmp w3, w6 beq .L1 uxtw x5, w6 lsl x5, x5, 2 ldr w8, [x1,x5] www.linaro.org

Recommend


More recommend