Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary A Low-Latency, Energy-Efficient L1 Cache Based on a Self-Timed Pipeline Louis-Charles Trudeau 1 Ghyslain Gagnon 1 François Gagnon 1 Claude Thibeault 1 Thomas Awad 2 Doug Morrissey 2 1 École de technologie supérieure, Montréal, Canada 2 Octasic Inc., Montréal, Canada 21st IEEE International Symposium on Asynchronous Circuits and Systems, 2015 1 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Self-Timed Pipeline Design Performance Results Summary Plan Introduction Problematic Motivations Scope of Work Cache Implementation Architecture and Organization Operation Self-Timed Pipeline Design Design Guidelines Pipeline Control Pipeline Operation Performance Results Summary 2 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Problematic Self-Timed Pipeline Design Motivations Performance Results Scope of Work Summary Research Program Objective: Adapting Octasic’s power-efficient asynchronous architecture in a general purpose processor (ARM v7-A). Collaborators 3 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Problematic Self-Timed Pipeline Design Motivations Performance Results Scope of Work Summary Execution.Sub-System Problematic Execution Execution Current architecture separates Inst. Data Switch Data Inst. Unit.0 Unit.7 the asynchronous CPU from Execution Execution Inst. Data Data Inst. Instruction.Bus Unit.1 Unit.6 the synchronous L1 memory. Execution Execution Data Data Inst. Inst. Unit.2 Unit.5 Crossbar. Execution Execution Inst. Data Data Inst. Unit.3 Unit.4 Register.File Memory.Sub-System Instruction. Prog..Counter. Decoder. m.Branch. Data.Memory. Instruction. Predictor Load/Store.Unit Fetch. L1.Data L1.Instruction Memory. Memory. 4 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Problematic Self-Timed Pipeline Design Motivations Performance Results Scope of Work Summary Execution.Sub-System Problematic Execution Execution Current architecture separates Inst. Data Switch Data Inst. Unit.0 Unit.7 the asynchronous CPU from Execution Execution Inst. Data Data Inst. Instruction.Bus Unit.1 Unit.6 the synchronous L1 memory. Execution Execution Data Data Inst. Inst. Unit.2 Unit.5 ◮ 2-cycle synchronization Crossbar. penalty . Execution Execution Inst. Data Data Inst. Unit.3 Unit.4 Register.File Memory.Sub-System Instruction. Prog..Counter. Decoder. m.Branch. Data.Memory. Instruction. Predictor Load/Store.Unit Fetch. L1.Data L1.Instruction Memory. Memory. 4 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Problematic Self-Timed Pipeline Design Motivations Performance Results Scope of Work Summary Execution.Sub-System Problematic Execution Execution Current architecture separates Inst. Data Switch Data Inst. Unit.0 Unit.7 the asynchronous CPU from Execution Execution Inst. Data Data Inst. Instruction.Bus Unit.1 Unit.6 the synchronous L1 memory. Execution Execution Data Data Inst. Inst. ◮ 2-cycle synchronization Unit.2 Unit.5 Crossbar. penalty. Execution Execution Inst. Data Data Inst. Unit.3 Unit.4 ◮ Energy efficiency is Register.File Memory.Sub-System suboptimal. Instruction. Prog..Counter. Decoder. m.Branch. Data.Memory. Instruction. Predictor Load/Store.Unit Fetch. L1.Data L1.Instruction Memory. Memory. 4 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Problematic Self-Timed Pipeline Design Motivations Performance Results Scope of Work Summary Motivations This work focuses on improving the L1 memory access. Why Go Asynchronous ? ◮ No balanced clock trees. Clocks are point-to-point and skew insensitive. ◮ No major critical path due to frequency constraints. Less large/leaky gates. ◮ Less complex pipeline structure. Only neighboring stages are connected together. 5 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Problematic Self-Timed Pipeline Design Motivations Performance Results Scope of Work Summary Scope of Work Design an asynchronous cache based on a self-timed pipeline. Objectives 1. Mitigate CPU ↔ L1 memory access latency. 2. Reduce the average memory access time. 3. Improve the cache energy efficiency. 4. Push the synchronization barrier at the L2 memory. 6 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Architecture and Organization Self-Timed Pipeline Design Operation Performance Results Summary Plan Introduction Problematic Motivations Scope of Work Cache Implementation Architecture and Organization Operation Self-Timed Pipeline Design Design Guidelines Pipeline Control Pipeline Operation Performance Results Summary 7 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Architecture and Organization Self-Timed Pipeline Design Operation Performance Results Summary L1 Instruction Cache Design Dual-fetch, 32kB, 4-way set-associative phased-cache. Synchronous Cache Asynchronous Cache ◮ 5-stage pipeline (hit). ◮ 4-stage pipeline (hit) ◮ Pipeline stall on miss. ◮ Single stage stall on miss. ◮ 2-cycle pipeline reinjection ◮ Resource arbitration for following cache fill. concurrent cache fill. Integration in ARM-like processor ⇒ Dhrystone & Coremark (armcc compiled). 8 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Architecture and Organization Self-Timed Pipeline Design Operation Performance Results Summary Cache Pipeline Shared Resources ◮ Tag Memory Tol L2 L1/L2l ◮ Data Memory Tag FIFO Data Write Write Controller ◮ ( L2 Memory ) Tag Data RAMs RAMs (+lDatalFFs) (4lways) Tasks Partitioning ◮ 6 pipeline stages. Tag Comp. ◮ Two-phase Inst[1:0] Tag Fwd Data Data PC Read Addr Read Out PC handshake protocol. 9 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Architecture and Organization Self-Timed Pipeline Design Operation Performance Results Summary Cache Pipeline Shared Resources ◮ Tag Memory ToR L2 L1/L2R ◮ Data Memory Tag FIFO Data Write Write Controller ◮ ( L2 Memory ) Tag Data RAMs RAMs (+RDataRFFs) (4Rways) Tasks Partitioning Tag ◮ 6 pipeline stages . Comp. ◮ Two-phase Inst[1:0] Tag Fwd Data Data PC Read Addr Read Out PC handshake protocol. 9 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Architecture and Organization Self-Timed Pipeline Design Operation Performance Results Summary Cache Operation Tol Pipeline Stages L2 L1/L2l Tag FIFO Data Write Write ◮ Tag Read Controller ◮ Forward Address Tag Data RAMs RAMs ◮ Tag & Data Write (4lways) (+lDatalFFs) ◮ Data Read Tag ◮ Data Output Comp. Inst[1:0] Tag Fwd Data Data PC Read Addr Read Out PC 10 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Architecture and Organization Self-Timed Pipeline Design Operation Performance Results Summary Cache Operation Tol Pipeline Stages L2 L1/L2l Tag FIFO Data Write Write ◮ Tag Read Controller Miss ◮ Forward Address Tag Data RAMs RAMs ◮ Tag & Data Write (4lways) (+lDatalFFs) ◮ Data Read Tag ◮ Data Output Comp. Hit Inst[1:0] Tag Fwd Data Data PC Read Addr Read Out PC 10 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Architecture and Organization Self-Timed Pipeline Design Operation Performance Results Summary Cache Operation Tol Pipeline Stages L2 L1/L2l Tag FIFO Data Write Write ◮ Tag Read Controller Miss ◮ Forward Address Tag Data RAMs RAMs ◮ Tag & Data Write (4lways) (+lDatalFFs) ◮ Data Read Tag ◮ Data Output Comp. Inst[1:0] Tag Fwd Data Data PC Read Addr Read Out PC 10 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Introduction Cache Implementation Architecture and Organization Self-Timed Pipeline Design Operation Performance Results Summary Cache Operation Tol Pipeline Stages L2 L1/L2l Tag FIFO Data Write Write ◮ Tag Read Controller Miss ◮ Forward Address Tag Data RAMs RAMs ◮ Tag & Data Write (4lways) (+lDatalFFs) ◮ Data Read Tag ◮ Data Output Comp. Inst[1:0] Tag Fwd Data Data PC Read Addr Read Out PC 10 Trudeau, Gagnon, Gagnon, Thibeault, Awad, Morrissey L1 Cache Based on a Self-Timed Pipeline
Recommend
More recommend