How Computers Work Jakob Stoklund Olesen Apple How Computers Work - PowerPoint PPT Presentation

How Computers Work Jakob Stoklund Olesen Apple

How Computers Work • Out of order CPU pipeline • Optimizing for out of order CPUs • Machine trace metrics analysis • Future work

Out of Order CPU Pipeline Fetch Branch Predictor Decode Rename Reorder Buffer Scheduler Load ALU ALU Br Retire

Dot Product int dot(int a[], int b[], int n) { int sum = 0; for (int i = 0; i < n; i++) sum += a[i]*b[i]; return sum; }

Dot Product loop: ldr r3 ← [r0, r6, lsl #2] ldr r4 ← [r1, r6, lsl #2] mul r3 ← r3, r4 add r5 ← r3, r5 add r6 ← r6, #1 cmp r6, r2 bne loop

Retire p100 ← ldr [p10, p94, lsl #2] p101 ← ldr [p11, p94, lsl #2] p102 ← mul p100, p101 p103 ← add p102, p95 p104 ← add p94, #1 loop: p105 ← cmp p104, p12 ldr r3 ← [r0, r6, lsl #2] bne p105, taken Reorder Buffer p106 ← ldr [p10, p104, lsl #2] ldr r4 ← [r1, r6, lsl #2] p107 ← ldr [p11, p104, lsl #2] p108 ← mul p107, p106 mul r3 ← r3, r4 p109 ← add p108, p103 p110 ← add p104, #1 add r5 ← r3, r5 p111 ← cmp p110, p12 Speculate add r6 ← r6, #1 bne p111, taken p112 ← ldr [p10, p110, lsl #2] cmp r6, r2 p113 ← ldr [p11, p110, lsl #2] p114 ← mul p112, p113 bne loop p115 ← add p114, p109 p116 ← add p110, #1 p117 ← cmp p116, p12 bne p117, taken Rename

p100 ← ldr [p10, p94, lsl #2] p101 ← ldr [p11, p94, lsl #2] Load ALU ALU Branch p102 ← mul p100, p101 p103 ← add p102, p95 1 100 p104 ← add p94, #1 2 101 p105 ← cmp p104, p12 3 bne p105, taken p106 ← ldr [p10, p104, lsl #2] 4 p107 ← ldr [p11, p104, lsl #2] 5 p108 ← mul p107, p106 6 102 p109 ← add p108, p103 7 p110 ← add p104, #1 8 p111 ← cmp p110, p12 9 103 bne p111, taken 10 p112 ← ldr [p10, p110, lsl #2] p113 ← ldr [p11, p110, lsl #2] p114 ← mul p112, p113

p100 ← ldr [p10, p94, lsl #2] p101 ← ldr [p11, p94, lsl #2] Load ALU ALU Branch p102 ← mul p100, p101 p103 ← add p102, p95 1 100 104 p104 ← add p94, #1 2 101 105 p105 ← cmp p104, p12 3 bne bne p105, taken p106 ← ldr [p10, p104, lsl #2] 4 p107 ← ldr [p11, p104, lsl #2] 5 p108 ← mul p107, p106 6 102 p109 ← add p108, p103 7 p110 ← add p104, #1 8 p111 ← cmp p110, p12 9 103 bne p111, taken 10 p112 ← ldr [p10, p110, lsl #2] p113 ← ldr [p11, p110, lsl #2] p114 ← mul p112, p113

p100 ← ldr [p10, p94, lsl #2] p101 ← ldr [p11, p94, lsl #2] Load ALU ALU Branch p102 ← mul p100, p101 p103 ← add p102, p95 1 100 104 p104 ← add p94, #1 2 101 110 105 p105 ← cmp p104, p12 3 106 111 bne bne p105, taken p106 ← ldr [p10, p104, lsl #2] 4 107 bne p107 ← ldr [p11, p104, lsl #2] 5 p108 ← mul p107, p106 6 102 p109 ← add p108, p103 7 p110 ← add p104, #1 8 108 p111 ← cmp p110, p12 9 103 bne p111, taken 10 p112 ← ldr [p10, p110, lsl #2] p113 ← ldr [p11, p110, lsl #2] 109 p114 ← mul p112, p113

p100 ← ldr [p10, p94, lsl #2] p101 ← ldr [p11, p94, lsl #2] Load ALU ALU Branch p102 ← mul p100, p101 p103 ← add p102, p95 1 100 104 p104 ← add p94, #1 2 101 110 105 p105 ← cmp p104, p12 3 106 116 111 bne bne p105, taken p106 ← ldr [p10, p104, lsl #2] 4 107 a 117 bne p107 ← ldr [p11, p104, lsl #2] 5 112 a c bne p108 ← mul p107, p106 6 113 102 c b p109 ← add p108, p103 7 a b p110 ← add p104, #1 8 108 c p111 ← cmp p110, p12 9 103 b bne p111, taken 114 10 p112 ← ldr [p10, p110, lsl #2] p113 ← ldr [p11, p110, lsl #2] 109 p114 ← mul p112, p113

Throughput • Map µops to functional units • One µop per cycle per functional unit • Multiple ALU functional units • ADD throughput is 1/3 cycle/instruction

Multiply-Accumulate loop: ldr r3 ← [r0, r6, lsl #2] ldr r4 ← [r1, r6, lsl #2] mla r5 ← r3, r4, r5 add r6 ← r6, #1 cmp r6, r2 bne loop

loop: Load ALU ALU Branch ldr r3 ← [r0, r6, lsl #2] ldr r4 ← [r1, r6, lsl #2] 1 ldr a mla r5 ← r3, r4, r5 2 ldr a c add r6 ← r6, #1 3 ldr a bne cmp r6, r2 ldr 4 bne bne loop 5 6 mla 7 4 cycles loop-carried 8 dependence 9 2x slower! 10 mla

Pointer Chasing int len(node *p) { int n = 0; while (p) p = p->next, n++ return n; }

Pointer Chasing loop: ldr r1 ← [r1] add r0 ← r0, #1 cmp r1, #0 bxeq lr b loop

p100 ← ldr [p97] p101 ← add p98, #1 p102 ← cmp p100, #0 loop: bxeq p102, not taken ldr r1 ← [r1] p103 ← ldr [p100] add r0 ← r0, #1 p104 ← add p101, #1 cmp r1, #0 p105 ← cmp p104, #0 bxeq lr bxeq p105, not taken b loop p106 ← ldr [p103] p107 ← add p104, #1 p108 ← cmp p107, #0 bxeq p108, not taken

Load ALU ALU Branch p100 ← ldr [p97] p101 ← add p98, #1 1 100 101 p102 ← cmp p100, #0 2 bxeq p102, not taken 3 p103 ← ldr [p100] 4 p104 ← add p101, #1 5 102 p105 ← cmp p104, #0 6 b bxeq p105, not taken 7 p106 ← ldr [p103] 8 p107 ← add p104, #1 9 p108 ← cmp p107, #0 10 bxeq p108, not taken

Load ALU ALU Branch p100 ← ldr [p97] p101 ← add p98, #1 1 100 101 p102 ← cmp p100, #0 2 104 bxeq p102, not taken 3 p103 ← ldr [p100] 4 p104 ← add p101, #1 5 103 102 p105 ← cmp p104, #0 6 b bxeq p105, not taken 7 p106 ← ldr [p103] 8 p107 ← add p104, #1 9 105 p108 ← cmp p107, #0 10 b bxeq p108, not taken

Load ALU ALU Branch p100 ← ldr [p97] p101 ← add p98, #1 1 100 101 p102 ← cmp p100, #0 2 104 bxeq p102, not taken 3 107 p103 ← ldr [p100] 4 a p104 ← add p101, #1 5 103 a 102 p105 ← cmp p104, #0 6 a b bxeq p105, not taken 7 a p106 ← ldr [p103] 8 a p107 ← add p104, #1 9 106 a 105 p108 ← cmp p107, #0 10 a b bxeq p108, not taken

Latency • Each µop must wait for operands to be computed • Pipelined units can use multiple cycles per instruction • Load latency is 4 cycles from L1 cache • Long dependency chains cause idle cycles

What Can Compilers Do? • Reduce number of µops • Reduce dependency chains to improve instruction-level parallelism • Balance resources: Functional units, architectural registers • Go for code size if nothing else helps

Reassociate • Maximize ILP • Reduce critical path • Beware of register pressure

Unroll Loops • Small loops are unrolled by OoO execution • Unroll very small loops to reduce overhead • Unroll large loops to expose ILP by scheduling iterations in parallel • Only helps if iterations are independent • Beware of register pressure

Unroll and Reassociate loop: mla mla r1 ← …, r1 mla mla r2 ← …, r2 mla mla r3 ← …, r3 mla mla r4 ← …, r4 mla end: mla add r0 ← r1, r2 mla add r1 ← r3, r4 mla add r0 ← r0, r1

Unroll and Reassociate • Difficult after instruction selection • Handled by the loop vectorizer • Needs to estimate register pressure on IR • MI scheduler can mitigate some register pressure problems

Schedule for OoO • No need for detailed itineraries • New instruction scheduling models • Schedule for register pressure and ILP • Overlap long instruction chains • Keep track of register pressure

If-conversion mov (…) → rdx mov (…) → rsi mov (…) → rdx lea (rsi, rdx) → rcx mov (…) → rsi lea 32768(rsi, rdx) → rsi lea (rsi, rdx) → rcx cmp 65536, rsi lea 32768(rsi, rdx) → rsi jb end test rcx, rcx mov -32768 → rdx test rcx, rcx cmovg r8 → rdx mov -32768 → rcx cmp 65536, rsi cmovg r8 → rcx cmovnb rdx → rcx mov cx, (…) end: mov cx, (…)

If-conversion • Reduces branch predictor pressure • Avoids expensive branch mispredictions • Executes more instructions • Can extend the critical path • Includes condition in critical path

If-conversion mov (…) → rdx mov (…) → rsi lea (rsi, rdx) → rcx lea 32768(rsi, rdx) → rsi cmp 65536, rsi jb end test rcx, rcx mov -32768 → rcx cmovg r8 → rcx end: mov cx, (…)

If-conversion mov (…) → rdx mov (…) → rsi lea (rsi, rdx) → rcx test rcx, rcx lea 32768(rsi, rdx) → rsi mov -32768 → rdx cmp 65536, rsi cmovg r8 → rdx cmovnb rdx → rcx mov cx, (…)

Machine Trace Metrics • Picks a trace of multiple basic blocks • Computes CPU resources used by trace • Computes instruction latencies • Computes critical path and “slack”

Slack Cmov Add Mul 2 cycles slack

Sandy Bridge Port 0 Port 1 Port 5 Port 2+3 Port 4 ALU ALU ALU Load Store VecMul VecAdd Branch Store Data Shuffle Shuffle Shuffle Address FpDiv FpAdd VecLogic FpMul Blend Blend

Throughput Mul Ldr Ldr Add Add Add Br

Throughput Ldr Ldr Add Add Add Mul Br

Rematerialization mov r1 ← 123 str r1 → [sp+8] loop: … loop: mov r1 ← 123 … ldr r1 ← [sp+8]

Rematerialization mov r1 ← 123 str r1 → [sp+8] loop: … loop: mov r1 ← 123 … ldr r1 ← [sp+8] Add Add Ldr Add Add Mov

Code Motion • Sink code back into loops • Sometimes instructions are free • Use registers to improve ILP

Code Generator SelectionDAG Early SSA Optimizations MachineTraceMetrics ILP Optimizations LICM, CSE, Sinking, Peephole Leaving SSA Form MI Scheduler Register Allocator

IR Optimizers Canonicalization Inlining Loop Vectorizer Target Info Loop Strength Reduction SelectionDAG

How Computers Work Jakob Stoklund Olesen Apple How Computers Work - PowerPoint PPT Presentation

How Computers Work Jakob Stoklund Olesen Apple How Computers Work Out of order CPU pipeline Optimizing for out of order CPUs Machine trace metrics analysis Future work Out of Order CPU Pipeline Fetch Branch Predictor Decode

Language and Computers where to start? Outline Computers Computers Computers Topic 1: Text

Quantum Mechanics; a Blessing and a Curse By Elias Marcopoulos Quantum Computers Quantum

Language and Computers where to start? Language and Outline Language and Computers

Outline Language learning Computers Computers Computers Topic 6: CALL Topic 6: CALL Topic 6:

Good Morning! INT1004 Computers for Business Ulrich Werner Discovering Computers Technology in

Outline Searching Computers Computers Computers Topic 2: Searching Topic 2: Searching Topic

Who cares about spelling? Why people care about spelling Computers Computers Computers Topic

A Brief History of Computers A Brief History of Computers A Brief History of Computers By

COMPUTERS TAKING A QUANTUM LEAP Quantum computers will harness the power of atoms and molecules

Good Evening! INT1005 Introduction to Computer Systems Ulrich Werner Discovering Computers

What is MT good for? Language and Example translations Language and Computers Computers

Searching in speech Language and Keyword searching in OSCAR Language and Computers Computers

The Turing Test Language and Example conversation (cont.) Language and Computers Computers

Good Morning! INT1004 Computers for Business Ulrich Werner Discovering Computers Technology in

Good Morning! INT1004 Computers for Business February 2017 Ulrich Werner Discovering Computers

Computers and Economics Computers and Economics Week 12b - April 12 Week 13a April 17 1

Summary of previous lecture number representation: usually twos complement, but other

Pre-Knowledge In order to complete this lab

CSSE232 Computer Architecture I Mul5cycle Datapath Class Status

ARITHMETIC AND LOGIC UNIT Mahdi Nazm Bojnordi Assistant Professor School of Computing

Processor Architecture Stream Arithmetic Logic Unit Programming ALU for high-performance

Harry Porters Relay Computer Harry Porter, Ph.D. Portland State University November 7, 2007

lecture 15 MIPS data path and control 3 Multicycle model: Pipelining March 7, 2016

African Digital Solutions to the Education Challenge of COVID-19 Fred Swaniker 29th September

Sambuz

Useful Links

Newsletter

Mail Us