page 1
play

Page 1 Increasing Bandwidth - Interleaving Main Memory Performance - PDF document

Classical DRAM Organization (square) bit (data) lines EECS 252 Graduate Computer r Each intersection represents o a 1-T DRAM Cell w Architecture RAM Cell d Array e c o Lec 23 Storage Technology d word (row) select e r David


  1. Classical DRAM Organization (square) bit (data) lines EECS 252 Graduate Computer r Each intersection represents o a 1-T DRAM Cell w Architecture RAM Cell d Array e c o Lec 23 – Storage Technology d word (row) select e r David Culler Electrical Engineering and Computer Sciences Column Selector & row Column I/O Circuits address University of California, Berkeley Address http://www.eecs.berkeley.edu/~culler • Row and Column Address data http://www-inst.eecs.berkeley.edu/~cs252 together: – Select 1 bit a time DRAM Capacitors: more capacitance Review:1-T Memory Cell (DRAM) in a small area row select • Write: – 1. Drive bit line – 2.. Select row • Read: – 1. Precharge bit line to Vdd/2 – 2.. Select row bit – 3. Cell and bit line share charges » Very small voltage changes on the bit line – 4. Sense (fancy sense amp) » Can detect changes of ~1 million electrons • Trench capacitors: • Stacked capacitors – 5. Write: restore the value – Logic ABOVE capacitor – Logic BELOW capacitor • Refresh – Gain in surface area of capacitor – Gain in surface area of capacitor – Better Scaling properties – 1. Just do a dummy read to every cell. – 2-dim cross-section quite small – Better Planarization DRAM Read Timing 4 Key DRAM Timing Parameters • Every DRAM access RAS_L CAS_L WE_L OE_L begins at: • t RAC : minimum time from RAS line falling to – The assertion of the RAS_L A 256K x 8 the valid data output. D – 2 ways to read: DRAM 9 8 – Quoted as the speed of a DRAM when buy early or late v. CAS DRAM Read Cycle Time – A typical 4Mb DRAM t RAC = 60 ns RAS_L – Speed of DRAM since on purchase sheet? • t RC : minimum time from the start of one row CAS_L access to the start of the next. A Row Address Col Address Junk Row Address Col Address Junk – t RC = 110 ns for a 4Mbit DRAM with a t RAC of 60 ns • t CAC : minimum time from CAS line falling to WE_L valid data output. OE_L – 15 ns for a 4Mbit DRAM with a t RAC of 60 ns D High Z Junk Data Out High Z Data Out • t PC : minimum time from the start of one Read Access Output Enable column access to the start of the next. Time Delay – 35 ns for a 4Mbit DRAM with a t RAC of 60 ns Early Read Cycle: OE_L asserted before CAS_L Late Read Cycle: OE_L asserted after CAS_L Page 1

  2. Increasing Bandwidth - Interleaving Main Memory Performance Access Pattern without Interleaving: CPU Memory Cycle Time Access Time Time D1 available • DRAM (Read/Write) Cycle Time >> DRAM Start Access for D1 Start Access for D2 (Read/Write) Access Time – - 2:1; why? Memory Access Pattern with 4-way Interleaving: Bank 0 • DRAM (Read/Write) Cycle Time : Memory – How frequent can you initiate an access? Bank 1 CPU – Analogy: A little kid can only ask his father for money on Saturday Memory • DRAM (Read/Write) Access Time: Bank 2 – How quickly will you get what you want once you initiate an access? Memory Access Bank 0 Bank 3 – Analogy: As soon as he asks, his father will give him the money Access Bank 1 • DRAM Bandwidth Limitation analogy: Access Bank 2 Access Bank 3 – What happens if he runs out of money on Wednesday? We can Access Bank 0 again Main Memory Performance Main Memory Performance • Timing model – 1 to send address, – 4 for access time, 10 cycle time, 1 to send data – Cache Block is 4 words • Simple M.P. = 4 x (1+10+1) = 48 • Wide M.P. = 1 + 10 + 1 = 12 • Interleaved M.P. = 1+10+1 + 3 =15 • Wide : • Interleaved : – CPU/Mux 1 word; – CPU, Cache, Bus 1 word: address address address address Mux/Cache, Bus, Memory N Modules 0 1 2 3 Memory N words (4 Modules); example is 4 5 6 7 (Alpha: 64 bits & 256 word interleaved 8 9 10 11 bits) 12 13 14 15 • Simple : Bank 0 Bank 1 Bank 2 Bank 3 – CPU, Cache, Bus, Memory same width (32 bits) Finding Bank Number and Address Avoiding Bank Conflicts within a bank • Lots of banks Problem: We want to determine the number of banks, N b , to use int x[256][512]; and the number of words to store in each bank, W b , such that: for (j = 0; j < 512; j = j+1) for (i = 0; i < 256; i = i+1) • given a word address x, it is easy to find the bank where x will x[i][j] = 2 * x[i][j]; be found, B(x), and the address of x within the bank, A(x). • Even with 128 banks, since 512 is multiple of 128, conflict on word accesses • for any address x, B(x) and A(x) are unique. • SW: loop interchange or declaring array not power of 2 • the number of bank conflicts is minimized (“array padding”) • HW: Prime number of banks – bank number = address mod number of banks – bank number = address mod number of banks – address within bank =  address / number of words in bank – modulo & divide per memory access with prime no. banks? Page 2

  3. Finding Bank Number and Address Fast Bank Number within a bank • Chinese Remainder Theorem Solution: We will use the following relation to determine the bank As long as two sets of integers ai and bi follow these rules number for x, B(x), and the address of x within the bank, A(x): b i = x mod a i ,0 ≤ b i < a i , 0 ≤ x < a 0 × a 1 × a 2 ×… and that ai and aj are co-prime if i ≠ j, then the integer x has only one B(x) = x MOD N b solution (unambiguous mapping): A(x) = x MOD W b – bank number = b 0 , number of banks = a 0 – address within bank = b 1 , number of words in bank = a 1 and we will choose N b and W b to be co-prime, i.e., there is no prime – N word address 0 to N-1, prime no. banks, words power of 2 number that is a factor of N b and W b (this condition is satisfied • 3 banks Nb = 3, and 8 words per bank, Wb = 8. if we choose N b to be a prime number that is equal to an integer Seq. Interleaved Modulo Interleaved power of two minus 1). Bank Number: 0 1 2 0 1 2 Address We can then use the Chinese Remainder Theorem within Bank: 0 0 1 2 0 16 8 to show that B(x) and A(x) is always unique. 1 3 4 5 9 1 17 2 6 7 8 18 10 2 3 9 10 11 3 19 11 4 12 13 14 12 4 20 5 15 16 17 21 13 5 6 18 19 20 6 22 14 7 21 22 23 15 7 23 Fast Page Mode Operation Fast Memory Systems: DRAM specific • Regular DRAM Organization: Column Address N cols – N rows x N column x M-bit • Multiple CAS accesses: several names (page mode) – Read & Write M-bit at a time – Extended Data Out (EDO) : 30% faster in page mode – Each M-bit access requires a RAS / CAS cycle DRAM • New DRAMs to address gap; Row • Fast Page Mode DRAM N rows Address what will they cost, will they survive? – N x M “SRAM” to save a row – RAMBUS : startup company; reinvent DRAM interface • After a row is read into the » Each Chip a module vs. slice of memory register » Short bus between CPU and chips – Only CAS is needed to access N x M “SRAM” » Does own refresh other M-bit blocks on that row M bits » Variable amount of data returned – RAS_L remains asserted while M-bit Output CAS_L is toggled » 1 byte / 2 ns (500 MB/s per chip) – Synchronous DRAM : 2 banks on chip, a clock signal to DRAM, 1st M-bit Access 2nd M-bit 3rd M-bit 4th M-bit transfer synchronous to system clock (66 - 150 MHz) RAS_L – Intel claims RAMBUS Direct (16 b wide) is future PC memory • Niche memory or main memory? CAS_L – e.g., Video RAM for frame buffers, DRAM + fast serial output A Row Address Col Address Col Address Col Address Col Address DRAM History DRAM Future: 1 Gbit+ DRAM • DRAMs: capacity +60%/yr, cost –30%/yr – 2.5X cells/area, 1.5X die size in -3 years • ‘98 DRAM fab line costs $2B Mitsubishi Samsung – DRAM only: density, leakage v. speed • Blocks 512 x 2 Mbit 1024 x 1 Mbit • Rely on increasing no. of computers & memory per • Clock 200 MHz 250 MHz computer (60% market) – SIMM or DIMM is replaceable unit • Data Pins 64 16 => computers use any generation DRAM • Die Size 24 x 24 mm 31 x 21 mm • Commodity, second source industry => high volume, low profit, conservative – Sizes will be much smaller in production – Little organization innovation in 20 years • Metal Layers 3 4 • Order of importance: 1) Cost/bit 2) Capacity • Technology 0.15 micron 0.16 micron – First RAMBUS: 10X BW, +30% cost => little impact Page 3

Recommend


More recommend