Highly-Associative Caches for Low-Power Processors ������������� �������������� �������������������������� Motivation n Cache uses 30-60% processor energy in embedded systems. o Example: 43% for StrongArm-1 n Many academic studies on cache l [Albera, Bahar, ’98] – Power and performance trade-offs l [Amrutur, Horowitz, ‘98,’00] – Speed and power scaling l [Bellas, Hajj, Polychronopoulos, ’99] – Dynamic cache management l [Ghose, Kamble,’99] – Power reduction through sub-banking, etc. l [Inoue, Ishihara, Murakami,’99] – Way predicting set-associative cache l [Kin,Gupta, Mangione-Smith, ’97] – Filter cache l [Ko, Balsara, Nanda, ’98] – Multilevel caches for RISC and CISC l [Wilton, Jouppi, ’94] – CACTI cache model n Many Industrial Low-Power Processors use CAM ( content- addressable-memory ) o ARM3 – 64 -way set-associative – [Furber et. al. ’89] o StrongArm – 32 -way set-associative – [Santhanam et. al. ’98] o Intel XScale – 32 -way set-associative – ’01 n CAM : Fast and Energy-Efficient
Talk Outline � Structural Comparison � Area and Delay Comparison � Energy Comparison � Related work � Conclusion Set-Associative RAM-tag Cache 7DJ��6WDWXV����'DWD 7DJ��6WDWXV����'DWD n Not energy-efficient o All ways are read out n Two-phase approach o More energy-efficient o 2X latency " " 7DJ��������,QGH[����������2IIVHW
Set-Associative RAM-tag Sub-bank %86 &DFKH n Not energy-efficient o All ways are read out n Two-phase approach ��� o More energy-efficient �� o 2X latency JZO U OZO OZO H G R n Sub-banking F H 7DJ� ' � 65$0 'DWD 'DWD n 1 sub-bank = 1 way V V &HOOV 65$0 H 65$0 U &HOOV &HOOV G n Low-swing Bitlines G $ o Only for reads, writes performed full-swing n Wordline Gating 2IIVHW� 7DJ� 2IIVHW� 6HQVH 6HQVH &RPS 'HF� 'HF� $PSV $PSV DGGU RIIVHW RIIVHW ,�2 %86 CAM-tag Cache 7DJ�������6WDWXV�'DWD 7DJ�������6WDWXV�'DWD n Only one sub-bank activated n Associativity within sub-bank n Easy to implement high associativity +,7" +,7" +,7" :RUG 7DJ %DQN 2IIVHW
CAM-tag Cache Sub-bank ��� �� JZO n Only one sub-bank activated OZO OZO \ n Associativity within D U U $ sub-bank � J 65$0 65$0 D W &HOOV � n Easy to implement &HOOV 0 $ high associativity & 2IIVHW� 2IIVHW� 6HQVH 6HQVH 'HF� $PSV 'HF� $PSV WDJ RIIVHW RIIVHW ,�2 %86 CAM Functionality and Energy Usage n CAM Energy Dissipation 6%LWBE %LW %LWBE 6%LW o Search Lines :/ o Match Lines o Drivers 0 $ 5 6 0LVPDWFK 0DWFK PDWFK 6%LWBE %LW %LWBE 6%LW 6%LWBE %LW %LWBE 6%LW ;25 :/ :/ ���7�&$0�&HOO :LWK�6HSDUDWH :ULWH�6HDUFK�/LQHV $QG�/RZ�6ZLQJ� PDWFK � PDWFK � � � 0DWFK�/LQH � � � � � � � �
CAM-tag Cache Sub-bank Layout ��.%�&DFKH�6XE�EDQN�LPSOHPHQWHG�LQ������ µ P�&026�WHFKQRORJ\ ��[���5$0�$UUD\ �[��[���&$0�$UUD\ � 10% area overhead over RAM-tag cache Delay Comparison 5$0�WDJ�&DFKH� &ULWLFDO�3DWK� *OREDO�:RUGOLQH�'HFRGLQJ /RFDO�:RUGOLQH�'HFRGLQJ ,QGH[�%LWV JZO OZO 'HFRGHG�RIIVHW 7DJ�&RPS� 7DJ�ELWV 7DJ�UHDGRXW 'DWD�RXW 'DWD�UHDGRXW &$0�WDJ�&DFKH� &ULWLFDO�3DWK� 7DJ�ELWV 7DJ�ELWV�EURDGFDVWLQJ /RFDO�:RUGOLQH�'HFRGLQJ JZO 7DJ�ELWV OZO 7DJ�&RPS� 'HFRGHG�RIIVHW 'DWD�RXW 'DWD�UHDGRXW �����������������������
Hit Energy Comparison - S 450 � Q LZW L � H 400 ijpeg K F D pegwit 350 & � perl % . 300 � m88ksim � U R 250 gcc I � V V Avg 200 H F F $ 150 � U H S 100 � \ J U 50 H Q ( � 0 W L + 1-way 2-way 4-way 8-way 8-way 16-way 32-way RAM RAM RAM RAM CAM CAM CAM $VVRFLDWLYLW\�DQG�,PSOHPHQWDWLRQ Miss Rate Results 16 25 /=: SHJZLW 14 20 12 10 15 8 8KB 10 6 16KB 4 5 2 0 0 1-way 2-way 4-way 8-way 16-way 32-way 64-way 1-way 2-way 4-way 8-way 16-way 32-way 64-way 2 3.5 1.8 1.6 3 LMSHJ SHUO 1.4 2.5 1.2 2 1 0.8 1.5 0.6 1 0.4 0.5 0.2 0 0 1-way 2-way 4-way 8-way 16-way 32-way 64-way 1-way 2-way 4-way 8-way 16-way 32-way 64-way 3.5 6 3 5 JFF P��NVLP 2.5 4 2 3 1.5 2 1 1 0.5 0 0 1-way 2-way 4-way 8-way 16-way 32-way 64-way 1-way 2-way 4-way 8-way 16-way 32-way 64-way
Total Access Energy ( pegwit ) 3HJZLW�� +LJK�PLVV�UDWH�IRU�KLJK�DVVRFLDWLYLW\ - S � 1-RAM 2500 Q L � 2-RAM H K 4-RAM F D 8-RAM 2000 & � 8-CAM % . 16-CAM � � U 32-CAM R 1500 I � V V H F F $ 1000 � U H S � \ J U 500 H Q ( � O D W 0 R 7 32X 64X 128X 256X 512X 1024X 0LVV�(QHUJ\�([SUHVVHG�LQ�0XOWLSOHV�RI����ELW�5HDG�$FFHVV�(QHUJ\ Total Access Energy ( perl ) 3HUO�� 9HU\�ORZ�PLVV�UDWH�IRU�KLJK�DVVRFLDWLYLW\ - S 1-RAM � 500 Q L � 2-RAM H 450 K 4-RAM F D 8-RAM 400 & � % 8-CAM . 350 � 16-CAM � U R 300 32-CAM I � V V 250 H F F $ 200 � U H S 150 � \ J U 100 H Q ( � 50 O D W R 0 7 32X 64X 128X 256X 512X 1024X 0LVV�(QHUJ\�([SUHVVHG�LQ�0XOWLSOHV�RI����ELW�5HDG�$FFHVV�(QHUJ\
Other Advantages of CAM-tag � Hit signal generated earlier � Simplifies pipelines � Simplified store operation � Wordline only enabled during a hit � Stores can happen in a single cycle � No write buffer necessary Related Work � CACTI and CACTI2 o [Wilton and Jouppi ’94],[Reinman and Jouppi, ’99] o Accurate delay and energy estimate l Results within 10% o Energy estimate not suited for low-power designs o Typical Low-power features not included in CACTI l Sub-banking l Low-swing bitlines l Wordline gating l Separate CAM search line l Low-swing match lines o Energy Estimation 10X greater than our model for one CAM-tag cache sub-bank l Our results closely agree with [Amruthur and Horowitz, 98]
Conclusion � CAM tags – high performance and low-power � Energy consumption of 32-way CAM < 2-way RAM � Easy to implement highly-associative tags � Low area overhead (10%) � Comparable access delay � Better CPI by reducing miss rate Thank You! ���������������������������������
Recommend
More recommend