how computers work
play

How Computers Work Jakob Stoklund Olesen Apple How Computers Work - PowerPoint PPT Presentation

How Computers Work Jakob Stoklund Olesen Apple How Computers Work Out of order CPU pipeline Optimizing for out of order CPUs Machine trace metrics analysis Future work Out of Order CPU Pipeline Fetch Branch Predictor Decode


  1. How Computers Work Jakob Stoklund Olesen Apple

  2. How Computers Work • Out of order CPU pipeline • Optimizing for out of order CPUs • Machine trace metrics analysis • Future work

  3. Out of Order CPU Pipeline Fetch Branch Predictor Decode Rename Reorder Buffer Scheduler Load ALU ALU Br Retire

  4. Dot Product int dot(int a[], int b[], int n) { int sum = 0; for (int i = 0; i < n; i++) sum += a[i]*b[i]; return sum; }

  5. Dot Product loop: ldr r3 ← [r0, r6, lsl #2] ldr r4 ← [r1, r6, lsl #2] mul r3 ← r3, r4 add r5 ← r3, r5 add r6 ← r6, #1 cmp r6, r2 bne loop

  6. Retire p100 ← ldr [p10, p94, lsl #2] p101 ← ldr [p11, p94, lsl #2] p102 ← mul p100, p101 p103 ← add p102, p95 p104 ← add p94, #1 loop: p105 ← cmp p104, p12 ldr r3 ← [r0, r6, lsl #2] bne p105, taken Reorder Buffer p106 ← ldr [p10, p104, lsl #2] ldr r4 ← [r1, r6, lsl #2] p107 ← ldr [p11, p104, lsl #2] p108 ← mul p107, p106 mul r3 ← r3, r4 p109 ← add p108, p103 p110 ← add p104, #1 add r5 ← r3, r5 p111 ← cmp p110, p12 Speculate add r6 ← r6, #1 bne p111, taken p112 ← ldr [p10, p110, lsl #2] cmp r6, r2 p113 ← ldr [p11, p110, lsl #2] p114 ← mul p112, p113 bne loop p115 ← add p114, p109 p116 ← add p110, #1 p117 ← cmp p116, p12 bne p117, taken Rename

  7. p100 ← ldr [p10, p94, lsl #2] p101 ← ldr [p11, p94, lsl #2] Load ALU ALU Branch p102 ← mul p100, p101 p103 ← add p102, p95 1 100 p104 ← add p94, #1 2 101 p105 ← cmp p104, p12 3 bne p105, taken p106 ← ldr [p10, p104, lsl #2] 4 p107 ← ldr [p11, p104, lsl #2] 5 p108 ← mul p107, p106 6 102 p109 ← add p108, p103 7 p110 ← add p104, #1 8 p111 ← cmp p110, p12 9 103 bne p111, taken 10 p112 ← ldr [p10, p110, lsl #2] p113 ← ldr [p11, p110, lsl #2] p114 ← mul p112, p113

  8. p100 ← ldr [p10, p94, lsl #2] p101 ← ldr [p11, p94, lsl #2] Load ALU ALU Branch p102 ← mul p100, p101 p103 ← add p102, p95 1 100 104 p104 ← add p94, #1 2 101 105 p105 ← cmp p104, p12 3 bne bne p105, taken p106 ← ldr [p10, p104, lsl #2] 4 p107 ← ldr [p11, p104, lsl #2] 5 p108 ← mul p107, p106 6 102 p109 ← add p108, p103 7 p110 ← add p104, #1 8 p111 ← cmp p110, p12 9 103 bne p111, taken 10 p112 ← ldr [p10, p110, lsl #2] p113 ← ldr [p11, p110, lsl #2] p114 ← mul p112, p113

  9. p100 ← ldr [p10, p94, lsl #2] p101 ← ldr [p11, p94, lsl #2] Load ALU ALU Branch p102 ← mul p100, p101 p103 ← add p102, p95 1 100 104 p104 ← add p94, #1 2 101 110 105 p105 ← cmp p104, p12 3 106 111 bne bne p105, taken p106 ← ldr [p10, p104, lsl #2] 4 107 bne p107 ← ldr [p11, p104, lsl #2] 5 p108 ← mul p107, p106 6 102 p109 ← add p108, p103 7 p110 ← add p104, #1 8 108 p111 ← cmp p110, p12 9 103 bne p111, taken 10 p112 ← ldr [p10, p110, lsl #2] p113 ← ldr [p11, p110, lsl #2] 109 p114 ← mul p112, p113

  10. p100 ← ldr [p10, p94, lsl #2] p101 ← ldr [p11, p94, lsl #2] Load ALU ALU Branch p102 ← mul p100, p101 p103 ← add p102, p95 1 100 104 p104 ← add p94, #1 2 101 110 105 p105 ← cmp p104, p12 3 106 116 111 bne bne p105, taken p106 ← ldr [p10, p104, lsl #2] 4 107 a 117 bne p107 ← ldr [p11, p104, lsl #2] 5 112 a c bne p108 ← mul p107, p106 6 113 102 c b p109 ← add p108, p103 7 a b p110 ← add p104, #1 8 108 c p111 ← cmp p110, p12 9 103 b bne p111, taken 114 10 p112 ← ldr [p10, p110, lsl #2] p113 ← ldr [p11, p110, lsl #2] 109 p114 ← mul p112, p113

  11. Throughput • Map µops to functional units • One µop per cycle per functional unit • Multiple ALU functional units • ADD throughput is 1/3 cycle/instruction

  12. Multiply-Accumulate loop: ldr r3 ← [r0, r6, lsl #2] ldr r4 ← [r1, r6, lsl #2] mla r5 ← r3, r4, r5 add r6 ← r6, #1 cmp r6, r2 bne loop

  13. loop: Load ALU ALU Branch ldr r3 ← [r0, r6, lsl #2] ldr r4 ← [r1, r6, lsl #2] 1 ldr a mla r5 ← r3, r4, r5 2 ldr a c add r6 ← r6, #1 3 ldr a bne cmp r6, r2 ldr 4 bne bne loop 5 6 mla 7 4 cycles loop-carried 8 dependence 9 2x slower! 10 mla

  14. Pointer Chasing int len(node *p) { int n = 0; while (p) p = p->next, n++ return n; }

  15. Pointer Chasing loop: ldr r1 ← [r1] add r0 ← r0, #1 cmp r1, #0 bxeq lr b loop

  16. p100 ← ldr [p97] p101 ← add p98, #1 p102 ← cmp p100, #0 loop: bxeq p102, not taken ldr r1 ← [r1] p103 ← ldr [p100] add r0 ← r0, #1 p104 ← add p101, #1 cmp r1, #0 p105 ← cmp p104, #0 bxeq lr bxeq p105, not taken b loop p106 ← ldr [p103] p107 ← add p104, #1 p108 ← cmp p107, #0 bxeq p108, not taken

  17. Load ALU ALU Branch p100 ← ldr [p97] p101 ← add p98, #1 1 100 101 p102 ← cmp p100, #0 2 bxeq p102, not taken 3 p103 ← ldr [p100] 4 p104 ← add p101, #1 5 102 p105 ← cmp p104, #0 6 b bxeq p105, not taken 7 p106 ← ldr [p103] 8 p107 ← add p104, #1 9 p108 ← cmp p107, #0 10 bxeq p108, not taken

  18. Load ALU ALU Branch p100 ← ldr [p97] p101 ← add p98, #1 1 100 101 p102 ← cmp p100, #0 2 104 bxeq p102, not taken 3 p103 ← ldr [p100] 4 p104 ← add p101, #1 5 103 102 p105 ← cmp p104, #0 6 b bxeq p105, not taken 7 p106 ← ldr [p103] 8 p107 ← add p104, #1 9 105 p108 ← cmp p107, #0 10 b bxeq p108, not taken

  19. Load ALU ALU Branch p100 ← ldr [p97] p101 ← add p98, #1 1 100 101 p102 ← cmp p100, #0 2 104 bxeq p102, not taken 3 107 p103 ← ldr [p100] 4 a p104 ← add p101, #1 5 103 a 102 p105 ← cmp p104, #0 6 a b bxeq p105, not taken 7 a p106 ← ldr [p103] 8 a p107 ← add p104, #1 9 106 a 105 p108 ← cmp p107, #0 10 a b bxeq p108, not taken

  20. Latency • Each µop must wait for operands to be computed • Pipelined units can use multiple cycles per instruction • Load latency is 4 cycles from L1 cache • Long dependency chains cause idle cycles

  21. What Can Compilers Do? • Reduce number of µops • Reduce dependency chains to improve instruction-level parallelism • Balance resources: Functional units, architectural registers • Go for code size if nothing else helps

  22. Reassociate • Maximize ILP • Reduce critical path • Beware of register pressure

  23. Unroll Loops • Small loops are unrolled by OoO execution • Unroll very small loops to reduce overhead • Unroll large loops to expose ILP by scheduling iterations in parallel • Only helps if iterations are independent • Beware of register pressure

  24. Unroll and Reassociate loop: mla mla r1 ← …, r1 mla mla r2 ← …, r2 mla mla r3 ← …, r3 mla mla r4 ← …, r4 mla end: mla add r0 ← r1, r2 mla add r1 ← r3, r4 mla add r0 ← r0, r1

  25. Unroll and Reassociate • Difficult after instruction selection • Handled by the loop vectorizer • Needs to estimate register pressure on IR • MI scheduler can mitigate some register pressure problems

  26. Schedule for OoO • No need for detailed itineraries • New instruction scheduling models • Schedule for register pressure and ILP • Overlap long instruction chains • Keep track of register pressure

  27. If-conversion mov (…) → rdx mov (…) → rsi mov (…) → rdx lea (rsi, rdx) → rcx mov (…) → rsi lea 32768(rsi, rdx) → rsi lea (rsi, rdx) → rcx cmp 65536, rsi lea 32768(rsi, rdx) → rsi jb end test rcx, rcx mov -32768 → rdx test rcx, rcx cmovg r8 → rdx mov -32768 → rcx cmp 65536, rsi cmovg r8 → rcx cmovnb rdx → rcx mov cx, (…) end: mov cx, (…)

  28. If-conversion • Reduces branch predictor pressure • Avoids expensive branch mispredictions • Executes more instructions • Can extend the critical path • Includes condition in critical path

  29. If-conversion mov (…) → rdx mov (…) → rsi lea (rsi, rdx) → rcx lea 32768(rsi, rdx) → rsi cmp 65536, rsi jb end test rcx, rcx mov -32768 → rcx cmovg r8 → rcx end: mov cx, (…)

  30. If-conversion mov (…) → rdx mov (…) → rsi lea (rsi, rdx) → rcx test rcx, rcx lea 32768(rsi, rdx) → rsi mov -32768 → rdx cmp 65536, rsi cmovg r8 → rdx cmovnb rdx → rcx mov cx, (…)

  31. Machine Trace Metrics • Picks a trace of multiple basic blocks • Computes CPU resources used by trace • Computes instruction latencies • Computes critical path and “slack”

  32. Slack Cmov Add Mul 2 cycles slack

  33. Sandy Bridge Port 0 Port 1 Port 5 Port 2+3 Port 4 ALU ALU ALU Load Store VecMul VecAdd Branch Store Data Shuffle Shuffle Shuffle Address FpDiv FpAdd VecLogic FpMul Blend Blend

  34. Throughput Mul Ldr Ldr Add Add Add Br

  35. Throughput Ldr Ldr Add Add Add Mul Br

  36. Rematerialization mov r1 ← 123 str r1 → [sp+8] loop: … loop: mov r1 ← 123 … ldr r1 ← [sp+8]

  37. Rematerialization mov r1 ← 123 str r1 → [sp+8] loop: … loop: mov r1 ← 123 … ldr r1 ← [sp+8] Add Add Ldr Add Add Mov

  38. Code Motion • Sink code back into loops • Sometimes instructions are free • Use registers to improve ILP

  39. Code Generator SelectionDAG Early SSA Optimizations MachineTraceMetrics ILP Optimizations LICM, CSE, Sinking, Peephole Leaving SSA Form MI Scheduler Register Allocator

  40. IR Optimizers Canonicalization Inlining Loop Vectorizer Target Info Loop Strength Reduction SelectionDAG

Recommend


More recommend