std map code performance mymcu
play

std::map<Code,Performance> myMCU{?} @DanielPenning The - PowerPoint PPT Presentation

std::map<Code,Performance> myMCU{?} @DanielPenning The mapping between Code & Performance www.embeff.com World Map (1459) World Map (1525) People admitted they dont know. @DanielPenning The mapping between Code &


  1. std::map<Code,Performance> myMCU{?} @DanielPenning The mapping between Code & Performance www.embeff.com

  2. World Map (1459)

  3. World Map (1525)

  4. People admitted they don’t know. @DanielPenning The mapping between Code & Performance www.embeff.com

  5. The Beginning of Modern Science 1. Admit ignorance 2. Observations § Measure and gather data. § Connect data into comprehensive theories. @DanielPenning The mapping between Code & Performance www.embeff.com

  6. Embedded & Ignorance ? Compiler Target Architecture Code Performance Compiler Settings Target Cache Target Speed Possibly a highly complex and interdependent mapping! @DanielPenning The mapping between Code & Performance www.embeff.com

  7. Consequences Prejudices prevail Mistrust against libraries Low code quality Performance suffers @DanielPenning The mapping between Code & Performance www.embeff.com

  8. Let’s admit our ignorance. @DanielPenning The mapping between Code & Performance www.embeff.com

  9. Observations in Embedded Profiling Top Down Process. § Great to identify bottlenecks. § Bad to create specific understanding. § Build knowledge bottom up Start with small code blocks. § Observe performance. § Create heuristics. § @DanielPenning The mapping between Code & Performance www.embeff.com

  10. Code Performance for armv7m Architecture widely used (Cortex-M3/M4) Provides D ata W atchpoint and T race Unit CMSIS Register Description DWT_CYCCNT Cycle Count Register DWT_CPICNT CPI Count Register DWT_EXCCNT Exception Overhead Count Register DWT_SLEEPCNT Sleep Count Register DWT_LSUCNT LSU Count Register DWT_FOLDCNT Folded-instruction Count Register @DanielPenning The mapping between Code & Performance www.embeff.com

  11. Measure Cycles STM32F4 openocd JTAG (PC) DWT BKPT //< Read CYCCNT CodeUnderTest(<Parameter>) BKPT //< Read CYCCNT @DanielPenning The mapping between Code & Performance www.embeff.com

  12. Let’s make observations. @DanielPenning The mapping between Code & Performance www.embeff.com

  13. Example 1: Basic Optimization int square(int x) { square(int): mul r0, r0, r0 return x*x; bx lr } square(int): 30 push {r7} 25 sub sp, sp, #12 add r7, sp, #0 20 str r0, [r7, #4] Cycles 15 ldr r3, [r7, #4] ldr r2, [r7, #4] 10 mul r3, r2, r3 mov r0, r3 5 adds r7, r7, #12 0 mov sp, r7 ldr r7, [sp], #4 Minimal (-Og) No (-O0) bx lr @DanielPenning The mapping between Code & Performance www.embeff.com

  14. Heuristic #1 The difference between minimal and no optimization is huge. @DanielPenning The mapping between Code & Performance www.embeff.com

  15. Example 2: Pipeline DependentOps_O2(int): DependentOps_O1(int): int DependentOps(int x) { ldr r3, .L3 ldr r3, .L2 int tmp = x/3; ldr r1, .L3+4 smull r2, r3, r3, r0 int tmp2 = x/7; smull r2, r3, r3, r0 asrs r1, r0, #31 add r3, r3, r0 subs r3, r3, r1 return tmp+tmp2; asrs r2, r0, #31 ldr r2, .L2+4 } smull r1, r0, r1, r0 smull ip, r2, r2, r0 rsb r3, r2, r3, asr #2 add r0, r0, r2 17 subs r0, r0, r2 rsb r0, r1, r0, asr #2 16 add r0, r0, r3 add r0, r0, r3 15 Cycles bx lr 14 bx lr 13 .L3: .L2: 12 .word -1840700269 .word 1431655766 11 .word 1431655766 .word -1840700269 10 -O1 -O2 @DanielPenning The mapping between Code & Performance www.embeff.com

  16. Heuristic #2 In low-level assembly, the compiler is probably smarter than you. @DanielPenning The mapping between Code & Performance www.embeff.com

  17. Example 3: FPU vs Soft-FPU int MultiplyWithPi(int input) { return input * 3.14159265359f; } MultiplyWithPi_FPU(int): MultiplyWithPi_SoftFPU(int): vmov s15, r0 @ int push {r3, lr} vldr.32 s14, .L3 bl __aeabi_i2f vcvt.f32.s32 s15, s15 ldr r1, .L4 vmul.f32 s15, s15, s14 bl __aeabi_fmul vcvt.s32.f32 s15, s15 bl __aeabi_f2iz vmov r0, s15 @ int pop {r3, pc} bx lr .L4: .L3: .word 1078530011 .word 1078530011 @DanielPenning The mapping between Code & Performance www.embeff.com

  18. Example 3: FPU vs Soft-FPU int MultiplyWithPi(int input) { return input * 3.14159265359f; } 110 90 70 Cycles FPU 50 Soft-FPU 30 10 -9 -8 -7 -6 -5 -4 -3 -2 -1 0 1 2 3 4 5 6 7 8 9 Input Value @DanielPenning The mapping between Code & Performance www.embeff.com

  19. Heuristic #3 Software-FPU ~ 6x slower and not deterministic. @DanielPenning The mapping between Code & Performance www.embeff.com

  20. Example 4: CRC Computation Cyclic Redundancy Check § Direct Computation § Lookup-Table § Hardware-Support Online Benchmarking § Execute on real hardware. § Technical Preview Stage. § https://barebench.com @DanielPenning The mapping between Code & Performance www.embeff.com

  21. barebench.com - Demo - @DanielPenning The mapping between Code & Performance www.embeff.com

  22. Heuristic #4 Performance may be dependent on clock speed. @DanielPenning The mapping between Code & Performance www.embeff.com

  23. Heuristic #5 Caching is essential for high clock speeds. @DanielPenning The mapping between Code & Performance www.embeff.com

  24. Conclusion Admit lack of knowledge. Measure performance. Use measurements to form heuristics. Share heuristics. Use heuristics instead of prejudices. Let‘s make embedded systems better! @DanielPenning The mapping between Code & Performance www.embeff.com

Recommend


More recommend