12.1 CS356 Unit 12 Processor Hardware Organization Pipelining
12.2 From combinational to sequential logic BASIC HW
12.3 Logic Circuits Combinational • Combinational Logic Logic Outputs Inputs – Performs a specific function (Usually operations like (mapping of 2 n input combinations to +, -, *, /, &, |, <<) desired output combinations) Outputs depend only on – No internal state or feedback current outputs • Given a set of inputs, we will always Outputs get the same output after some time (propagation) delay Current Combinational inputs Logic • Sequential Logic – Registers : fundamental building blocks Sequential • Remembers a set of bits for later use values feedback 1 0 1 and provide • Acts like a variable from software Register holding "state" "memory" Sequential Logic • Controlled by a "clock" signal Outputs depend on current inputs and previous inputs (previous inputs summarized by current state)
12.4 Combinational Logic Gates • Circuits called gates perform logic operations to produce desired outputs from some digital inputs 1 0 0 1 1 0 0 0 1 OR gate 1 0 0 AND gate 0 OR gate OR gate NOT gate
12.5 Propagation Delay • All digital logic circuits have propagation delay – Time for output to change when inputs are changed 1 1 0 0 1 1 1 0 1 0 0 1 0 1 1 0 4 gate delays for input to 0 propagate to outputs 0
12.6 Combinational Logic Functions • Map input combinations of n -bits to desired m -bit output • Can describe function with a truth table and then find its circuit implementation IN0 IN1 IN2 OUT0 OUT1 In0 In1 0 0 0 0 0 Outputs Inputs Out1 Logic 0 0 1 1 0 In2 Circuit … 1 1 1 0 1
12.7 ALU’s • Perform a selected Func. Op. Func. Op. Code Code operation on two input 00_0000 A SHL B 10_0000 A+B numbers 00_0010 A SHR B 10_0010 A-B – FS[5:0] selects the desired … … 00_0011 A SAR B operation … … 10_0100 A AND B 01_1000 A * B 10_0101 A OR B C0 0x7fffffff X[31:0] A 01_1001 A * B 10_0110 A XOR B RES[31:0] 32 (uns.) 0x80000000 RES ALU 32 01_1010 A / B 10_0111 A NOR B Y[31:0] B OF 1 32 0x00000001 A / B (uns.) … … 01_1011 ZF 0 FS[5:0] Func CF 0 6 … … 10_1010 A < B 100010
12.8 Sequential Devices (Registers) • Registers capture the D input value when a control input %rax ( clock signal ) transitions from 0 to 1 ( clock edge ) and store that value at the Q output until the next clock edge • A register is similar to a variable in software: at the clock edge, it stores a value for later use. • Block Diagram of We can choose to only clock the register at "certain" a Register times when we want the register to capture a new value (e.g., when it is the destination of an instruction) • Key Idea Registers store data while we operate on those values add %rbx,%rax add %rcx,%rax …causes q(t) to The clock sample and hold pulse the current d(t) (positive value until edge) here… another clock pulse ALU sum %rax
12.9 Clock Signal • Alternating high/low voltage Clock Signal pulse train Op. 1 Op. 2 Op. 3 1 (5V) • Controls the ordering and timing 0 (0V) of operations performed in the 1 cycle processor 2.8 GHz = 2.8*10 9 cycles per second • 1 cycle is usually measured from = 0.357 ns/cycle rising edge to rising edge Processor • Clock frequency = # of cycles per second (e.g. 2.8 GHz = 2.8 * 10 9 cycles per second)
12.10 Basic HW organization for a simplified instruction set FROM X86 TO RISC
12.11 From CISC to RISC • Complex Instruction Set Computers (CISC) // CISC instruction movq 0x40(%rdi, %rsi, 4), %rax often have instructions that vary widely in // RISC equivalent with 1 memory or ALU how much work they perform and how // operation per instruction mov %rsi, %rbx # use %rbx as a temp. much time they take to execute shl 2, %rbx # %rsi * 4 – Fewer instructions are needed for a task add %rdi, %rbx # %rdi + (%rsi*4) add $0x40, %rbx # 0x40 + %rdi + (%rsi*4) • Reduced Instruction Set Computers (RISC) mov (%rbx), %rax # %rax = *%rbx favor instructions that take roughly the CISC vs. RISC Equivalents same time to execute and follow a common sequence of steps – More instructions needed, each faster John Hennessy and David Patterson, ACM Turing Award Lecture, 2017
12.12 A RISC Subset of x86 // 3 x86 memory read instructions mov (%rdi), %rax // 1 mov 0x40(%rdi), %rax // 2 • Split mov instructions that access memory mov 0x40(%rdi,%rsi), %rax // 3 into separate instructions: // Equivalent load sequences ld 0x0(%rdi), %rax // 1 – ld = Load/Read from memory ld 0x40(%rdi), %rax // 2 – st = Store/Write to memory mov %rsi, %rbx // 3a add %rdi, %rbx // 3b • ld 0x40(%rbx), %rax // 3c Limit ld & st instructions to use at most indirect w/ displacement // 3 x86 memory write instructions mov %rax, (%rdi) // 1 – No ld 0x04(%rdi, %rsi, 4), %rax mov %rax, 0x40(%rdi) // 2 mov %rax, 0x40(%rdi,%rsi) // 3 • Too much work – At most ld 0x40(%rdi), %rax or // Equivalent store sequences st %rax, 0x0(%rdi) // 1 st %rax, 0x40(%rdi) st %rax, 0x40(%rdi) // 2 mov %rsi, %rbx // 3a • Limit arithmetic & logic instructions to only add %rdi, %rbx // 3b st %rax, 0x40(%rbx) // 3c operate on registers // CISC instruction – No add (%rsp), %rax since this implicitly add %rax, (%rsp) accesses (dereferences) memory // Equivalent RISC sequence with ld / st – Only add %reg1, %reg2 ld 0(%rsp), %rbx add %rax, %rbx st %rbx, 0(%rsp)
12.13 Developing a Processor Organization Hardware components used by each instruction type: Cond. Codes Registers (%rax, %rbx, P Addr. Data Addr. Data Zero ALU etc.) C Res. (aka RegFile) I-Cache / D-Cache / I-MEM D-MEM ALU-Type LD ST JE add %rax,%rbx ld 8(%rax),%rbx st %rbx, 8(%rax) je label/displacement 1. PC PC PC PC • Addr. of Instruc • Addr. of Instruc • Addr. of Instruc • Addr. of Instruc 2. I-Cache I-Cache I-Cache I-Cache • Fetch Instruc • Fetch Instruc • Fetch Instruc • Fetch Instruc 3. Registers Registers Registers • Get %rax,%rbx • Get %rax • Get %rax / %rbx 4. ALU ALU ALU ALU • Sum %rax+%rbx • Sum %rax+8 • Sum %rax+8 • If cond=TRUE, PC = PC+disp. 5. Registers D-Cache D-Cache • Save result to %rbx • Read data • Write %rbx data 6. Registers • Save data to %rbx
12.14 Processor Block Diagram Fetch Decode Exec. Mem WB Control Signals (e.g. ALU operation, Read/Write D-Cache, Dec etc.) PC ode Z O C S F F F F D-Cache Registers I-Cache ALU (aka Addr Data RegFile) ALU Output Data to write Instruction Operands (Addr. or to dest. (Machine Code) Result) register 10 ns 10 ns 10 ns 10 ns 10 ns Clock Cycle Time = Sum of delay through worst case pathway = 50 ns
12.15 Processor Execution ( add ) Fetch Decode Exec. Mem WB Control Signals (e.g. ALU operation, Read/Write D-Cache, Dec etc.) %rax+%rdx PC ode Z O C S F F F F %rax D-Cache Registers I-Cache ALU (aka Addr Data RegFile) %rdx ALU Output Data to write Instruction Operands (Addr. or to dest. (Machine Code) Result) register add %rax,%rdx [Machine Code: 48 01 c2] %rdx = %rax+%rdx
12.16 Processor Execution ( ld ) Fetch Decode Exec. Mem WB Control Signals (e.g. ALU operation, Read/Write D-Cache, Dec etc.) PC ode Z O C S F F F F %rbx D-Cache Registers I-Cache ALU (aka Addr Data 0x40 RegFile) addr data ALU Output Data to write Instruction Operands (Addr. or to dest. (Machine Code) Result) register ld 0x40(%rbx),%rax [Machine Code: 48 8b 43 40] %rdx = data
12.17 Processor Execution ( st ) Fetch Decode Exec. Mem WB Control Signals (e.g. ALU operation, Read/Write D-Cache, Dec etc.) PC ode Z O C S F F F F %rbx D-Cache Registers addr 0x40 I-Cache ALU (aka Addr Data RegFile) Instruction %rax (Machine Code) st %rax,0x40(%rbx) [Machine Code: 48 89 43 40]
12.18 Processor Execution ( je ) Fetch Decode Exec. Mem WB Control Signals PC + 0x08 (e.g. ALU operation, Read/Write D-Cache, Dec etc.) PC Z O C S ode F F F F 1 0 1 0 PC D-Cache Registers 0x08 I-Cache ALU (aka Addr Data RegFile) Instruction (Machine Code) je L1 (disp. = 0x08) [Machine Code: 74 08]
12.19 PIPELINING
12.20 Example for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; Memory A: i A[i] Cntr B: B[i] C: 10 ns per input set = 1000 ns total
12.21 Pipelining Example for(i=0; i < 100; i++) C[i] = (A[i] + B[i]) / 4; Stage 1 Stage 2 Pipelining refers to insertion of registers to Stage 1 Stage 2 split combinational logic into smaller stages that Clock 0 A[0] + B[0] can be overlapped in time (i.e., create an Clock 1 A[1] + B[1] (A[0] + B[0]) / 4 assembly line) Clock 2 A[2] + B[2] (A[1] + B[1]) / 4
12.22 Need for Registers • Provides separation between combinational functions – Without registers, fast signals could “catch-up” to data values in the next operation stage
Recommend
More recommend