A Memory System Design Framework: Creating Smart Memories Amin Firoozshahian, Alex Solomatnikov Hicamp Systems Inc. Ofer Shacham, Zain Asgar, http://www.c 2 s 2 .org Stephen Richardson, Christos Kozyrakis, Mark Horowitz Stanford University
An Era of Chip-Multiprocessors… � � Single-thread performance scaling has stopped � � More processor cores on the same die � � Claim: � � Scale performance � � Keep design complexity constant IBM Cell Sun Rock Intel Nehalem Amin Firoozshahian 2
Looking a Little More Closely Sun Rock
Reality… � � Replicated cores � � Incredibly complicated memory system � � Large amounts of logic � � Innovation is in the memory system � � Transactions, streaming, fast synchronization, security, etc. � � Never exactly the same � � Where all the bugs are! Amin Firoozshahian 4
ISA for Memory Systems � � Can we regularize the memory system hardware? � � “Program” it rather than “Design” it? � � Benefits: � � Reduce design time � � Patch errors � � Run-time tuning � � How can we do this? Amin Firoozshahian 5
Shared Memory System � � Resources: Proc Proc � � Local memory � � Data, state bits miss $ $ � � Interconnect � � Controllers Cache Cache Controller Controller � � Operations: Interconnect Msg � � Probing state bits � � Track requests Memory � � Communication � � Data movements (spill / refill) Amin Firoozshahian 6
Streaming Memory System � � Resources: Proc Proc � � Local memory � � Interconnect … Local Local � � Controllers Mem Mem DMA DMA � � Interconnect Operations: � � Communication � � Data movements Memory � � Track outstanding transfers Amin Firoozshahian 7
Transactional Memory System � � Resources � � Local memory Proc Proc � � More state bits � � Interconnect Addr. Addr. $ $ FIFO FIFO � � Controllers Commit Commit Controller Controller � � Operations Interconnect � � Data movements � � State checks / updates Memory � � Communication Amin Firoozshahian 8
Commonalities � � Same resources and operations � � Different in: � � How the operations are sequenced � � Interpretation of state bits � � We need: � � Flexible local storage and interconnect � � Programmable controllers Amin Firoozshahian 9
Local Memories � � Programmable memory mat � � Data array � � State bits State Data � � PLA logic � � Comparator � � Cmp Accessed by Update � � Address, Opcode � � Opcode Returns Address � � data, state, compare result [K. Mai et.al., “ Architecture and Circuit Techniques for a Reconfigurable Memory Block,” 10 IEEE International Solid-State Circuits Conference , February 2004
Programmable Controllers � � Use an off-the-shelf processor? � � FLASH, Typhoon, etc. � � Too slow � � All the way to the L1 cache interface � � Our approach: � � Micro-coded engines (functional units) � � Each class of operations in a separate engine Amin Firoozshahian 11
Programming � � A set of subroutines � � A set of basic operations � � Executed in a functional unit � � Each one calls next � � Link subroutines to each other Unit 2 Msg Msg Unit 1 Unit 3 Amin Firoozshahian 12
Microarchitecture � � A small pipeline � � Configuration (“program”) memories � � Horizontal micro-code � � Decide what to do � � Decide how to proceed Amin Firoozshahian 13
Organization DMA DMA To/From local storages DMA State Data Tracking Update Movement Line Buffers Interrupt MSHR USHR Processor Interface Network Interface To/From Processors To/From Network Amin Firoozshahian 14
Read Miss Example Access Tags Access Data DMA DMA DMA State Data Tracking WB / Miss Evict Read Miss Line Read Update Movement Line Buffers MSHR USHR Interrupt Processor Interface Network Interface Read Miss Read Miss Spill Miss Read Miss Amin Firoozshahian 15
Programming Complexity � � Cache Coherence � � Message types received by controller: 6 � � From processor: Cache miss, Upgrade miss, Prefetch � � From network: Coherence request, Refill, Upgrade � � Subroutine types in Tracking unit: 11 � � Streaming � � Message types: 5 � � Direct access, Gather, Scatter, Gather reply, Scatter ack. � � Subroutine types in Tracking unit: 9 Amin Firoozshahian 16
Smart Memories � � 8-core CMP system � � ST 90nm-GP CMOS technology � � 5.5 ns cycle time (181MHz) 7.77mm � � 2.9M gates, 55M transistors 7.77mm 17
Status � � System bring-up……………...….. � � System configuration……….…... � � JTAG tests…………………….…... � � Coherent shared memory tests… � � Transactional tests (TCC)………. � � Streaming tests…………………… � � More testing in progress � � Planning for a 32-processor system Test Chip Amin Firoozshahian 18
Evaluation � � Comparison with a hardwired controller � � But which one? You would claim I am cheating! � � Compare with an “ideal” controller � � Assume controller actions occur in zero time � � Account for external actions � � Data read/write � � Message send/receive � � Gives an upper bound Amin Firoozshahian 19
Average Read Latency Average Read Latency - 32 processor system 9 Real Controllers 8 Ideal controllers 7 6 Cycles 5 4 3 2 1 0 FFT Barnes FMM 179.art Bitonic Barnes MP3D MPEG2 MPEG2 Sort Enc Enc Coherent Streaming Transactions Shared Memory Amin Firoozshahian 20
Execution Time � � Total average overhead: 15% Average Overhead (%) 30 24.29 25 20.03 Overhead (%) 20 14.51 14.14 15 10.64 8.33 10 7.58 6.93 5 1.88 0 Coherent Streaming Transactions Shared Memory Amin Firoozshahian 21
Conclusion � � Strong similarity between memory systems � � Common resources and operations � � A framework for memory systems design � � Generate specific “instances” � � Modest performance overhead � � Compared to ideal systems Amin Firoozshahian 22
Recommend
More recommend