cs356 unit 12a
play

CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining - PowerPoint PPT Presentation

12a.1 12a.2 CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining 12a.3 12a.4 Logic Circuits Combinational Logic Gates Combinational ________________ logic Main Idea: Circuits called "gates" perform Logic


  1. 12a.1 12a.2 CS356 Unit 12a Processor Hardware Organization BASIC HW Pipelining 12a.3 12a.4 Logic Circuits Combinational Logic Gates Combinational • ________________ logic • Main Idea: Circuits called "gates" perform Logic Outputs Inputs – Performs a specific function (mapping (Usually logic operations to produce desired outputs operations like of ___ input combinations to desired +, -, *, /, &, |, <<) from some digital inputs output combinations) Outputs depend only on – No internal state or feedback current outputs 1 • Given a set of inputs, we will always N 0 get the same output after some time C 0 (propagation) __________ Current Out Combinational 1 1 inputs P Logic • __________ logic (Storage devices) 0 0 0 R 1 – _____________ are the fundamental Sequential building blocks OR gate V values feedback 1 0 1 0 1 B and provide 0 • Remembers a set of bits for later use Register holding "state" T AND gate "memory" Sequential Logic 0 • Acts like a variable from software OR gate Outputs depend on current inputs • Controlled by a "_________" signal OR gate and previous inputs (previous inputs summarized via state) NOT gate

  2. 12a.5 12a.6 Propagation Delay Combinational Logic Functions • Main Idea: All digital logic circuits have • Map input combinations of n-bits to desired propagation delay m-bit output – Time it takes for output to change when inputs • Can describe function with a truth table and are changed then find its circuit implementation IN0 IN1 IN2 OUT0 OUT1 1 N In0 0 0 0 0 0 0 In1 0 C Outputs Inputs Logic Out1 1 1 P 0 0 1 1 0 0 Circuit In2 0 0 R 1 … V 0 1 0 B 1 1 1 0 1 4 gate delays for input 0 T to propagate to outputs 0 12a.7 12a.8 ALU’s Sequential Devices (Registers) • Registers _______ the D input value when a control %rax • Perform a selected Func. Op. Func. Op. input (aka the clock signal) transitions from _____ Code Code (clock edge) and stores that value at the Q output until operation on two input 00_0000 A SHL B 10_0000 A+B the next clock edge numbers. • A register is like a ___________ in software. It stores a 00_0010 A SHR B 10_0010 A-B value for later use. – FS[5:0] select the desired Block Diagram of 00_0011 A SAR B … … • We can choose to only clock the register at "________" a Register operation times when we want the register to capture a new … … 10_0100 A AND B value (i.e. when it is the __________ of an instruction) 01_1000 A * B 10_0101 A OR B • Key Idea : Registers __________ data while we operate C0 on those values __________ X[31:0] A 01_1001 A * B 10_0110 A XOR B RES[31:0] (uns.) 32 0x80000000 add %rbx,%rax add %rcx,%rax RES ALU 01_1010 A / B 10_0111 A NOR B …causes q(t) to 32 The clock Y[31:0] B sample and hold OF __ pulse 0x00000001 32 the current d(t) 01_1011 A / B … … __ (positive ZF value until FS[5:0] Func (uns.) edge) here… another clock __ CF 6 ALU sum pulse 10_1010 A < B 100010 … … %rax

  3. 12a.9 12a.10 Clock Signal • Alternating high/low voltage Clock Signal pulse train Op. 1 Op. 2 Op. 3 1 (5V) • Controls the ordering and timing 0 (0V) of operations performed in the 1 cycle processor 2.8 GHz = 2.8*10 9 cycles per second • 1 cycle is usually measured from = 0.357 ns/cycle Basic HW organization for a simplified instruction set rising edge to rising edge FROM X86 TO RISC Processor • Clock frequency = # of cycles per second (e.g. 2.8 GHz = 2.8 * 10 9 cycles per second) 12a.11 12a.12 From CISC to RISC A RISC Subset of x86 // 3 x86 memory read instructions mov (%rdi), %rax // 1 mov 0x40(%rdi), %rax // 2 • Complex Instruction Set Computers often • Split mov instructions that access memory mov 0x40(%rdi,%rsi), %rax // 3 have instructions that ______ widely in how into separate instructions: // Equiv. load sequences ld 0x0(%rdi), %rax // 1 much _________ they perform and how – ld = _____________ from memory ld 0x40(%rdi), %rax // 2 much _________ they take to execute mov %rsi, %rbx // 3a – st = _____________ to memory // CISC instruction add %rdi, %rbx // 3b movq 0x40(%rdi, %rsi, 4), %rax – Benefit is _________ instructions are ld 0x40(%rbx), %rax // 3c • Limit ld & st instructions to use at most needed to accomplish a task // RISC Equiv. w/ 1 mem. or ALU op. ____________________________ // 3 x86 memory write instructions // per instruction mov %rax, (%rdi) // 1 • Reduced Instruction Set Computers favor mov %rsi, %rbx # use %rbx as a temp. mov %rax, 0x40(%rdi) // 2 – No ld 0x04(%rdi, %rsi, 4), %rax shl 2, %rbx # %rsi * 4 mov %rax, 0x40(%rdi,%rsi) // 3 instructions that take roughly the _______ add %rdi, %rbx # %rdi + (%rsi*4) • Too much work add $0x40, %rbx # 0x40 + %rdi + (%rsi*4) time to execute and follow a common // Equiv. store sequences mov (%rbx), %rax # %rax = *%rbx – At most ld ____________, %rax or st %rax, 0x0(%rdi) // 1 _____________ of steps st %rax, 0x40(%rdi) // 2 st %rax, _____________ mov %rsi, %rbx // 3a CISC vs. RISC Equivalents – It often requires _______ instructions to • Limit arithmetic & logic instructions to only add %rdi, %rbx // 3b st %rax, 0x40(%rbx) // 3c describe the overall task (larger code size) operate on registers // CISC instruction See example to the right • – No add (%rsp), %rax since this implicitly add %rax, (%rsp) accesses (dereferences) memory • RISC makes the ___________ design easier // Equiv. RISC sequence (w/ ld and st) ld 0(%rsp), %rbx – Only add ______________ so let's tweak our x86 instructions to be add %rax, %rbx st %rbx, 0(%rsp) more RISC-like

  4. 12a.13 12a.14 Processor Block Diagram Developing a Processor Organization • Identify which hardware components each instruction type Fetch Decode Exec. Mem WB would use and in what order: ALU-Type, LD, ST, Jump Control Signals Cond. Codes (e.g. ALU operation, Read/Write D-Cache, etc.) Registers Decode PC (%rax, %rbx, Addr. Data Addr. Data ALU Zero PC etc.) ZF OFCFSF Res. (aka RegFile) D-Cache I-Cache / I-MEM D-Cache / D-MEM Registers I-Cache ALU Addr Data (aka ALU-Type LD ST JE RegFile) Data add %rax,%rbx ld 8(%rax),%rbx st %rbx, 8(%rax) je label/displacement Instruction ALU Output Data to write to 1. Operands (Machine Code) (Addr. or Result) dest. register 2. 3. 4. 5. 10 ns 10 ns 10 ns 10 ns 10 ns 6. Clock Cycle Time = Sum of delay through worst case pathway = 50 ns 12a.15 12a.16 Processor Execution (add) Processor Execution (load) Fetch Decode Exec. Mem WB Fetch Decode Exec. Mem WB Control Signals Control Signals (e.g. ALU operation, (e.g. ALU operation, Read/Write D-Cache, etc.) Read/Write D-Cache, etc.) %rax+%rdx Decode Decode PC PC ZF OFCFSF ZF OFCFSF %rax %rbx D-Cache D-Cache Registers Registers I-Cache ALU Addr Data I-Cache ALU Addr Data (aka (aka 0x40 RegFile) %rdx RegFile) addr data Instruction Instruction ALU Output Data to write to ALU Output Data to write to Operands Operands (Machine Code) (Machine Code) (Addr. or Result) dest. register (Addr. or Result) dest. register add %rax,%rdx ld 0x40(%rbx),%rax [Machine Code: 48 01 c2] [Machine Code: 48 8b 43 40] %rdx = %rax+%rdx %rdx = data

  5. 12a.17 12a.18 Processor Execution (store) Processor Execution (branch/jump) Fetch Decode Exec. Mem WB Fetch Decode Exec. Mem WB Control Signals Control Signals PC + 0x08 (e.g. ALU operation, (e.g. ALU operation, Read/Write D-Cache, etc.) Read/Write D-Cache, etc.) Decode Decode PC PC ZF OF CF SF ZF OFCFSF 1 0 1 0 %rbx PC D-Cache D-Cache addr Registers Registers 0x40 0x08 I-Cache ALU Addr Data I-Cache ALU Addr Data (aka (aka RegFile) RegFile) Data Data Instruction Instruction %rax (Machine Code) (Machine Code) st %rax,0x40(%rbx) je L1 (disp. = 0x08) [Machine Code: 48 89 43 40] [Machine Code: 74 08] 12a.19 12a.20 Example for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; Memory A: i A[i] Cntr B: B[i] C: PIPELINING 10 ns per input set = _____ ns total

Recommend


More recommend