customized computing for power efficiency customized
play

Customized Computing for Power Efficiency Customized Computing for - PDF document

Customized Computing for Power Efficiency Customized Computing for Power Efficiency Jason Cong Jason Cong cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science Department http://cadlab.cs.ucla.edu/~cong


  1. Customized Computing for Power Efficiency Customized Computing for Power Efficiency Jason Cong Jason Cong cong@cs.ucla.edu cong@cs.ucla.edu UCLA Computer Science Department UCLA Computer Science Department http://cadlab.cs.ucla.edu/~cong http://cadlab.cs.ucla.edu/~cong There are Many Options to Improve Performance There are Many Options to Improve Performance Page 1

  2. Past Alternatives -- -- Frequency Scaling Frequency Scaling Past Alternatives Source : Shekhar Borkar, Intel Current Alternatives: Parallelization Current Alternatives: Parallelization Parallelization Source : Shekhar Borkar, Intel Page 2

  3. Multi- -core Processors core Processors Multi Sun UltraSPARC T2 Microprocessor 8 Cores 64 threads Tilera TILE64 multi-core Processor Warehouse of Computers Warehouse of Computers IBM BlueGene/L No.1 in the newest Top500 Page 3

  4. But Power Remain to Be a Limiting Factor … … But Power Remain to Be a Limiting Factor Cost of computing • HW acquisition • Energy bill • Heat removal • Space • … Power Will Be the Driver for Acceptance of Power Will Be the Driver for Acceptance of Customized Computing Customized Computing Parallelization Customization Source : Shekhar Borkar, Intel Page 4

  5. UCLA Experience UCLA Experience -- -- Lithography Simulation Acceleration Lithography Simulation Acceleration Simulation of the optical imaging process � � Simulation of the optical imaging process � Computational intensive and quite slow for full � Computational intensive and quite slow for full- -chip simulation chip simulation � Synthesized into Synthesized into Stratix Stratix- -II FPGA on XDI platform using AutoPilot II FPGA on XDI platform using AutoPilot � Experiment Results [FPGA’ ’2008] 2008] Experiment Results [FPGA � 15X speedup using a 5 by 5 partitioning over Opteron 2.2G 4G RAM � Logic utilization around 25K ALUT (and 8K is used in the interface framework rather than design) � Power utilization less than 15W in FPGA comparing with 86W in Opteron248 � Close to 100X (5.8 x 15) improvement on energy efficiency Page 5

  6. A Lot More is Needed for Power- -Efficient Customized Efficient Customized A Lot More is Needed for Power Computing Computing � More power More power- -efficient programmable fabrics efficient programmable fabrics � � Capability to do power gating, voltage and frequency scaling � Capability to do power gating, voltage and frequency scaling � A powerful, fully automated C/C++ to FPGA compiler A powerful, fully automated C/C++ to FPGA compiler � � � Taking full advantages of various power optimization options in Taking full advantages of various power optimization options in a a transparent way transparent way Customization beyond just FPGA fabrics � Customization beyond just FPGA fabrics � � Application � Application- -specific instruction specific instruction- -set processors (ASIP) set processors (ASIP) � Application � Application- -specific processor networks (ASPN) specific processor networks (ASPN) More power efficient programmable (global) interconnects � More power efficient programmable (global) interconnects � � E.g., RF � E.g., RF- -interconnects interconnects RF- -Interconnects Interconnects -- -- Power Efficient Programmable Power Efficient Programmable RF (Global) Interconnect Solution (Global) Interconnect Solution Page 6

  7. Limited RC Wires Bandwidth Limited RC Wires Bandwidth f T 10 @ 45nm CMOS Technology � @ 45nm CMOS Technology � � Data Rate: 4 � Data Rate: 4 Gbit/s Gbit/s � f � f T T of 45nm CMOS can be as high as 240GHz of 45nm CMOS can be as high as 240GHz � � Baseband signal bandwidth only about 4GHz Baseband signal bandwidth only about 4GHz � � 98.4% of available bandwidth is wasted 98.4% of available bandwidth is wasted � Open Question: Open Question: How to take advantage of full How to take advantage of full- -bandwidth of modern CMOS bandwidth of modern CMOS? ? � UCLA 90nm CMOS VCO at 324GHz UCLA 90nm CMOS VCO at 324GHz (ISSCC 2008) (ISSCC 2008) -70 323.5GHz VCO -80 Pout (dBm) CMOS VCO designed by Frank -90 Chang’s group at UCLA, fabricated in 90nm process -100 323.038 323.238 323.438 323.638 323.838 324.0 Frequency (GHz) CMOS Voltage Controlled Oscillator, measured with a subharmonic CMOS Voltage Controlled Oscillator, measured with a subharmonic mixer and driven mixer and driven with a 80 GHz synthesizer local oscillator. The mixing frequency with a 80 GHz synthesizer local oscillator. The mixing frequency is ( is ( f f VCO VCO - - 4* 4* f f LO LO )= )= f f IF IF , or , or f VCO f VCO - -4*(80 GHz)= 3.5 GHz, yielding 4*(80 GHz)= 3.5 GHz, yielding f f VCO VCO = 323.5 GHz! = 323.5 GHz! On-Wafer VCO Test Setup at JPL *Huang, D., LaRocca T., Chang, M.-C. F., “324GHz CMOS Frequency Generator Using Linear Superposition Technique IEEE International Solid-State Circuits Conference (ISSCC), 476-477, (Feb 2008) San Francisco, CA Page 7

  8. Multiband RF- Multiband RF -Interconnect Interconnect Signal Power Signal Power Signal Power Signal Power Signal Spectrum • In TX, each mixer up-converts individual baseband streams into specific frequency band (or channel) • N different data streams (N=6 in exemplary figure above) may transmit simultaneously on the shared transmission medium to achieve higher aggregate data rates • In RX, individual signals are down-converted by mixer, and recovered after low-pass filter Advantages of RF- Advantages of RF -Interconnect (RF Interconnect (RF- -I) I) Latency – – speed speed- -of of- -light data transmission light data transmission � Latency � Bandwidth – – high aggregate data rate through simultaneous high aggregate data rate through simultaneous � � Bandwidth transmissions on multiple bands of RF modulated signals transmissions on multiple bands of RF modulated signals Area – – avoid extensive use of repeaters avoid extensive use of repeaters � Area � Energy – – low overall energy bit low overall energy bit � � Energy � Reconfigurability Reconfigurability – – efficient bidirectional and tunable efficient bidirectional and tunable � communications via shared on/off- -chip transmission lines or chip transmission lines or communications via shared on/off off- off -chip antennas chip antennas Page 8

  9. Simple RF- -I Topology I Topology Simple RF RF-I Four NoC Components Transmission � Four NoC Components � C C Line Bundle > > > > > > > > Tunable Tx Tx/Rx /Rx’ ’s s � Tunable � C C Tx/Rx � Arbitrary topologies � Arbitrary topologies NoC Component � Arbitrary bandwidths � Arbitrary bandwidths One physical topology can be configured to many virtual topologies C C C C C C C C C C C C C C C C C C C C Bus Multicast Fully Crossbar Connected Pipeline/Ring RF- -I for Multi I for Multi- -Core On Core On- -Chip Communication Chip Communication RF [HPCA’ ’2008, MICRO 2008, MICRO’ ’2008] 2008] [HPCA � 10x10 mesh of pipelined routers 10x10 mesh of pipelined routers � � NoC runs at 2GHz � NoC runs at 2GHz � � XY routing XY routing � 64 4GHz 3 64 4GHz 3- -wide processor cores wide processor cores � � Labeled aqua � Labeled aqua � � 8KB L1 Data Cache 8KB L1 Data Cache � � 8KB L1 Instruction Cache 8KB L1 Instruction Cache � 32 L2 Cache Banks 32 L2 Cache Banks � � Labeled pink � Labeled pink � � 256KB each 256KB each � � Organized as shared NUCA Organized as shared NUCA cache cache � 4 Main Memory Interfaces � 4 Main Memory Interfaces � Labeled green � Labeled green � RF RF- -I transmission line bundle I transmission line bundle � � Black thick line spanning mesh � Black thick line spanning mesh Page 9

  10. RF- RF -I Logical Organization I Logical Organization • Logically: - RF-I behaves as set of N express channels - Each channel assigned to src, dest router pair ( s , d ) • Reconfigured by: - remapping shortcuts to match needs of different applications LOGICAL A LOGICAL B Power Savings Power Savings 16 4 bytes 8 bytes � We can thin the baseline mesh links We can thin the baseline mesh links � Requires high bw to bytes � � From 16B From 16B… … communicate w/ B � � … …to 8B to 8B A � … � …to 4B to 4B � RF RF- -I makes up the difference in I makes up the difference in � performance while saving overall performance while saving overall power! power! � RF � RF- -I provides bandwidth where I provides bandwidth where most necessary most necessary � � Baseline RC wires supply the rest B Baseline RC wires supply the rest � Over 60% power reduction Over 60% power reduction � A lot of potential for global interconnects in programmable fabrics Page 10

Recommend


More recommend