Exam Review 2 1 ROB: head/tail yes R1 B yes none no X5 R3 - PowerPoint PPT Presentation

Exam Review 2 1

ROB: head/tail yes R1 B yes none no X5 R3 A none no no --- --- F yes none yes --- --- X7 fault yes none exercise: result of processing rest? next entry --- --- --- --- --- --- head yes no yes X8 R4 D no none no X10 R1 C E none log. R4 ready? except? store? phys. free list: (for next rename) rename map X12 X9 no R3 X11 R2 X1 R1 X0 R0 phys. A R3 X3 tail X4 R4 D yes none no X6 R1 C yes no none no X1 R1 B old tail yes none 2 PC log. reg prev. X11, X3

Questions? 3

vector instructions register types: scalar, vector, predicate/mask, length made-up syntax follows: @MaskRegister VADD V0, V1, V2 for ( int i = 0; i < MIN(VectorLengthRegister, MaxVectorLength); i += 1) { if (MaskRegister[i]) { V0[i] = V1[i] + V2[i]; } } 4 @MaskRegister V0 ← V1 + V2 ,

vector exercise void vector_add_one( int *x, int length) { for ( int i = 0; i < length; ++i) { x[i] += 1; } } exercise: write as a vector machine program with 64-element vectors vector length register or predicate (mask) registers 5

vector exercise answer // R1 contains X, R2 contains length End : goto Loop void vector_add_one( int *x, int length) { Loop : IF R2 <= 0, goto End } } x[i] += 1; for ( int i = 0; i < length; ++i) { 6 VL ← R2 MOD 64 V1 ← MEMORY[R1] V1 ← V1 + 1 MEMORY[R1] ← V1 R2 ← R2 − VL VL ← 64

relaxed memory models ex 1 reasons for reorderings? 7

relaxed reasons optimizations to think about: executing loads/stores out-of-order (if addresses don’t confmict) combining two loads for same address (“load forwarding”) combining load + store for same address (“store forwarding”) not waiting for invalidations to be acknowledged (esp. non-bus network) 8

relaxed memory models ex 2 What can happen? X = Y = 0 CPU1: CPU2: examples of possible sequential orders? (there are 8) examples of non-sequential orders? what could happen to cause other orders? 9 R1 ← Y R4 ← X X ← 1 X ← 2 R2 ← Y Y ← 2 R3 ← X

possible sequential orders 2 0 0 2 1 0 2 1 0 0 2 2 0 0 2 2 1 2 2 1 0 0 0 X = Y = 0 0 CPU1: CPU2: R1 0 R3 R4 R2 0 1 0 0 0 1 1 10 R1 ← Y R4 ← X X ← 1 X ← 2 R2 ← Y Y ← 2 R3 ← X

non-seq orders X = Y = 0 example cause: CPU2 doesn’t wait for CPU1 invalidate example cause: reordered stores in CPU2 R1 = 2 and R3 = 2 example cause: load forwarding (reuse fjrst load) example cause: store forwarding (use stored value in X) R2 = 2 and R3 = 1 and R4 = 1 11 CPU2: CPU1: R1 ← Y R4 ← X X ← 1 X ← 2 R2 ← Y Y ← 2 R3 ← X

(HW) transactional memory what is a transaction? atomic — as if uninterrupted by other things limitations? I/O amount of space to store “transaction log” when is performance good/bad? livelock — transcations abort each other over and over? possibly more “wasted work” if contention (e.g. short transaction aborts long one) fairness? overhead to manipulate transaction log if lots of items? 12

Virtual and Physical Virtual Page # Physical Page # Index of Set? Index of Set? Ofgset Cache has virtual indexes? Solution #1: Disallow overlap Solution #2: Translate fjrst Solution #3: Allow virtual indexes (with overlap) 13

Physically Tagged, Virtually Indexed 14

Plausible splits page #/tag tag only 15 set index ofgset page #/tag set index

Virtual and Physical Virtual Page # Physical Page # Index of Set? Index of Set? Ofgset Cache has virtual indexes? Solution #1: Disallow overlap Solution #2: Translate fjrst Solution #3: Allow virtual indexes (with overlap) 16

Translate First address TLB Cache value page table lookup memory access 17

Virtual Caches no translation for entire cache lookup including tag checking exist, but more complicated multiple virtual addresses for one physical example ways: OS must prevent/manage aliasing physical L2 tracks virtual to physical mappping in L1 18 need to handle aliasing

OOO tradeofgs 19

gem5 pipeline Commit File Register Physical Queue Store Queue Load Bufger Fetch Reorder WB Exec. Issue Queue Instr Rename Decode 20

OOO tradeofgs (1) dependencies plus latency limits performance diminishing returns from additional computational resources latencies that can be especially long: cache/memory accesses branch resolution speculation helps “cheat” on dependencies branch prediction memory reordering (+ check if addresses confmict later) 21

OOO tradeofgs (2) limits on number of instructions “in fmight” number of physical registers size of queues (instruction, load/store) size of reorder bufger # active cache misses 22

OOO tradeofgs (3) miscellaneous issues: right types of functional units for programs? wasted work from frequent “exceptions”? might include, e.g., memory ordering error 23

OOO tradeofg exercise what programs will be most afgected by a smaller/larger: reorder bufger instruction queue number of fmoating point adders number of physical registers number of instructions fetched/decoded/renamed/issued/committed per cycle 24

VLIW Memory Fetch Longer instruction word pipeline Write Back Int/Mul ALU 2 Int/Mul ALU 1 Read Regs Fetch Write Back Address ALU fetch instruction bundles Read Regs Fetch Write Back — Simple ALU Read Regs Fetch specialized pipelines parallel pipelines, shared registers 25

VLIW vs OOO VLIW is like OOO but… run-time eliminates OOO scheduling logic/queues compiler does dependency detection including dealing with functional unit latency possibly eliminates reorder bufger 26 instructions are scheduled at compile-time, not

VLIW problems requires smart compiler can’t reschedule based on memory latency, etc. assembly/machine code tied to particular HW design 27

VLIW exercise int *foo; int *bar; ... for ( int i = 0; i < 1000; ++i) { *foo = *foo * *bar; foo += 1; bar += 1; } ouline what assembly for a VLIW processor with: bundles of two instructions: 1: load/store (address is reg+ofgset) or add/subtract 2: compare-and-branch or multiply or add/subtract all instructions take registers or constants adds can load a constant 28 all instructions take two cycles to produce usable result

VLIW exercise: slow answer NOP // . ++foo NOP . NOP . . . NOP // foo . NOP . GOTO Loop End : # R0: FOO; R1: BAR; R2: I 29 NOP // bar . ++bar # R3: FOO temp1, R4: BAR temp1 . NOP Loop : NOP . IF R1 < 1000 GOTO End . // foo . ++i . R2 ← 0 R3 ← M[R0+0] R2 ← R2 + 1 R4 ← M[R1+0] R1 ← R1 + 4 R0 ← R0 + 4 R3 ← R3 × R4 // . × // wait for × M[R1 − 4] ← R3

VLIW exercise: faster answer? needed nops due to instruction delays/lack of work alternative: unroll loop several times move loads/stores between iterations of the loop eliminate branch at beginning 30

fjnal notes a bunch of multiple choice (because I could write it) have room until 7:15PM — will give 2 hours office hours Friday 10am–12pm / Piazza super last minute questions? office hours Monday 1pm–3pm 31

Exam Review 2 1 ROB: head/tail yes R1 B yes none no X5 R3 - PowerPoint PPT Presentation

Exam Review 2 1 ROB: head/tail yes R1 B yes none no X5 R3 A none no no --- --- F yes none yes --- --- X7 fault yes none exercise: result of processing rest? next entry --- --- --- --- --- --- head yes no yes X8

Exam4 Information and Guidance General Topics General Exam Information Exam types

Quicksort Sorting Lower Bound Exam Exam Exam Exam 2 2 tomorrow evening 2 2 tomorrow

Examination Lydia Love DVM DACVAA 2018 Exam Committee Chair September 2018 Exam Format

Announcements Announcements Final Exam will be a take Final Exam will be a take- -home exam

ICS 101 Final Exam Review Fall 2016 Final Exam information In lab: check final exam schedule

Exam 2 Review CS461/ECE422 Fall 2009 Exam guidelines Same as for first exam A single page

Exam Review 2 Exam Overview Final Exam Friday,

Math 211 Math 211 Review for the Final Exam December 8, 2002 2 The Final Exam The Final Exam

Operating Systems Operating Systems CMPSC 473 CMPSC 473 Exam 1 Review Exam 1 Review February

The Bohr Model of Hydrogen Exam Details The exam will be held Wednesday, October 5th from

The final exam Other finals review Final Exam Review CSH Review November 17 th

Vector Graphics Project Check out FilesAndExceptions from SVN Exam 2 review File I/O and

Review Final exam Final exam will be 11-12 problems, drop any 2 Cumulative up to and including

Exam review, Game of Life work time Turn in your questions on material in preparation for

Exam #2 Review sseshadr Agenda Administrative things Exam tomorrow, shouldve been

Lectur Lecture 20: e 20: DC M DC Motor otors Exam Exam 2 Results 2 Results Most M ost

Cryptomaniac A Cautionary Tale Dont Let This Happen to You! AES Selection Process Started

nk nku u ing ac act t of f giv iving Part-2 Surya Nanda By Acharya Suryanarayan Nanda

Introductory Class Rep Training In consultation with 1 Customise this slide to be about you Add

Configurational-Bias Monte Carlo N interacting particles in volume V at temperature T Thijs J.H.

ISA Implementations Partly in Run programs for one ISA on hardware with different ISA Techniques:

CS422 Computer Architecture Spring 2004 Lecture 15, 20 Feb 2004 Bhaskaran Raman Department of

Automatic Synthesis of Fast and Certified Code for Polynomial Evaluation through the example of

Chapter 2 Instruction-Level Parallelism and Its E Exploitation l it ti 1 Overview