low power high performance reconfigurable processor using
play

LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING - PowerPoint PPT Presentation

LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING SINGLE-FLUX-QUANTUM CIRCUITS Naofumi Takagi Graduate School of Informatics Kyoto University Kyoto 606-8501, Japan takagi@i.kyoto-u.ac.jp Our Team Takagi Group: Kyoto University


  1. LOW-POWER, HIGH-PERFORMANCE, RECONFIGURABLE PROCESSOR USING SINGLE-FLUX-QUANTUM CIRCUITS Naofumi Takagi Graduate School of Informatics Kyoto University Kyoto 606-8501, Japan takagi@i.kyoto-u.ac.jp

  2. Our Team Takagi Group: Kyoto University Prof. N. Takagi, Prof. K. Takagi Murakami Group: Kyushu University Prof. K. Murakami, Prof. K. Inoue, Prof. H. Honda Yoshikawa Group: Yokohama National University Prof. N. Yoshikawa, Prof. Y. Yamanashi Akaike Group: Nagoya University Prof. H. Akaike, Prof. A. Fujimaki, Prof. M. Tanaka Nagasawa Group: ISTEC-SRL (Superconductivity Research Laboratory, International Superconductivity Technology Center) Mr. S. Nagasawa, Dr. M. Hidaka 2

  3. Aim of the research Developing basic technologies of energy- efficient, high-performance computers, e.g., a 10 Tflops desk-side computer, using superconducting sigle-flux-quantum (SFQ) circuits. By adopting the processor architecture called ‘large-scale reconfigurable data- paths (RDPs).’ 3

  4. A 10TFlops Computer By technologies at 2006 90nm CMOS Parallel computer SFQ (0.5 ~ 0.35um) LSRDP Developing basic technologies 4

  5. Our approach Reduction of power consumption of conventional circuit technology Development of technologies for realizing a high-performance computer using a new low-power circuit technology SFQ 5

  6. Backgrounds Superconducting Single-Flux-Quantum Circuits Ultra Low-Power, Ultra High-Speed SFQ in a superconductive loop Josephson Junction 2~3 ps ~1 mV Φ 0 = h/2e = 2.07 mV . ps SFQ pulse 6

  7. SFQ technologies at 2006 • Conventional Nb 4-layer 2 µ m fabrication process Cell-based design, Logic cell library Automatic routing by Josephson transmission line SFQ-LSIs with more than 10,000JJs • Development of a new 1 µ m fabrication process Nb 6 lyers No design environment • Development of passive transmission line (PTL) technology High-speed inner-chip data transfer SFQ pulse Superconducting micro-strip line 7

  8. Reconfigurable Data-Path (RDP) processor Streaming Memory Access Controller メモリアクセスコントローラー • Reconfigurable data-path ( MAC ) (SMAC) –A lot of floating point Units Operand Routing Network オペランドルーティング (FPUs) ( ORN ) ネットワーク (ORN) –Reconfigurable operand routing 演算器 演算器 ... ... PE PE PE PE ( PE ) FPU FPU FPU FPU networks :(ORNs) PE General 汎用 Purpose プロセッサ –Dynamic reconfiguration ORN ORN Processor ( GPP ) ... ... PE PE PE FPU FPU FPU PE FPU • Features ORN ORN –Reconfiguring the data-path by ... ... ... ... ... ... ... ... routing ORNs to fit the processing of a loop in large- ... ... PE FPU PE FPU PE FPU PE FPU scale numerical computation ORN ORN –Parallel and pipelined processing – Burst input /output data is SMAC MAC transferred from/to memory I/O Port I/O Port Main Mem. 主記憶

  9. Research subjects 1. SFQ fabrication process and circuit design environments (1) Nb multi-layer 1 µ m fabrication process (Nagasawa G.) (2) Logic cell library for the 1 µ m process (Yoshikawa G. and Akaike G.) (3) CAD for SFQ digital circuit design (Takagi G.) 2. SFQ-FPUs and SFQ-RDP prototypes (1) SFQ-FPUs (Yoshikawa G. and Takagi G.) Half-precision FPA and FPM operating at 25GHz (2 µ m process) FPA and FPM operating at 50GHz (1 µ m process) (2) SFQ-RDP prototypes (ALU+ORN) (Akaike G.) 2x2 SFQ-RDP operating at 25GHz (2 µ m process) 4x4 SFQ-RDP operating at 50GHz (1 µ m process) 3. RDP architecture (Murakami G.) RDP architecture, RDP compiler, RDP-oriented algorithms 9

  10. Results of the research

  11. 1. Fabrication process and design environment Development of a Nb 9-layer 1 µ m fabrication process Nb layer thickness SiO 2 layer thickness 400 nm M9 (COU) M9 (COU) BC 400 nm SiO 2 BC JC AlOx Active layers including JJ 300 nm M8 (BAS) M8 (BAS) RC RES1 RC junctions and resistors GC GC GC GC 300 nm 400 nm M7 (GP) Complemented planarization layer 300 nm C6 C6 C6 C6 150 nm M6 (GND3) Main Ground plane C5 150 nm 150 nm C5 C5 C5 C5 SiO 2 150 nm M5 (PTL2) 2 nd PTL layer C4 C4 150 nm 150 nm C4 C4 C4 150 nm M4 (GND2) C3 C3 150 nm 150 nm C3 C3 C3 150 nm SiO 2 M3 (PTL1) 1 st PTL layer C2 C2 C2 C2 150 nm 150 nm 200 nm M2 (GND1) M2 (GND1) C1 C1 200 nm SiO 2 200 nm M1 (DCP) DC power layer Si Substrate Nb layers for M1-M7 are planarized. Cross-sectional SEM photograph Excellent flatness was obtained even though the step edges of several underlying patterns are overlapped. 11

  12. Shift registers for evaluation of the Nb 9-layer 1 µ m process Design 16-bit SR 16-bit SR  16 circuits • Two 16-bit shift registers • Two 64-bit shift registers 64-bit SR 64-bit SR 2560-bit SR 2560-bit SR 2560-bit SR 2560-bit SR 1280-bit SR 1280-bit SR 1280-bit SR 1280-bit SR • Two 160-bit shift registers 640-bit SR 640-bit SR • Two 640-bit shift registers • Four 1280-bit shift 160-bit SR 160-bit SR registers • Four 2560-bit shift registers  68,990 JJs in total Chip size : 8.5mm x 7.0mm Measurement results (Best chip)  Only three defects, Correct operation of 13 of 16 circuits  Correct operation of all 2560bit shift registers with 10,281 JJs

  13. Development of a logic cell library for the 9-layer process Basic structure of a logic cell PTL2 30 µ m PTL1 グランドコンタクト Ground contact 30 µ m バイアスピラー Bias pillar 30 µ m A microphotograph of a dffc2 cell

  14. A 4x4 switch by the 9-layer 1 µ m process and the cell library Upper PTL Lower PTL Via hole  Operation up to 112 GHz (World’s highest)  Total Power Consumption : 660 µ W  Total number of JJs: 3362  The number of vias: 434

  15. Area Reduction by 81% compared to the conventional Nb 4-layer 2 µ m fabrication process Conventional Nb 4-layer 2 µ m proces New Nb 9-ayer 1 µ m process Functional block Functional block Circuit area ratio 1 : 0.19 (81% reduction) (Cell size : 40mm x 40mm → 30 µ m x 30 µ m)

  16. Device density and operating frequency in LSIs 10 12 Limit from Long Interconnect Delay Limit from Power Device Density (Trs/cm 2 ) Density for CMOS 10 9 Demonstrated in CREST 4x4 SW Si MOSFET Limit from Power 10 6 Present SFQ in 2x2 RDP Density for Compound USA/ Previous SFQ in Japan SiGe HBT GaAs 10 3 MESFET GaAs HBT InP HBT Si Bip GaAs HEMT InP HEMT 1 0.01 0.1 1 10 100 1000 Frequency (GHz) SFQ LSIs developed in this project have reached the region that semiconductors can not reach.

  17. Energy consumption for a device used in LSIs 10 -13 Present 10 -15 CMOS Energy Consumption (J) Demonstrated Present SFQ in USA in CREST 10 -17 2x2 RDP SFQ in Japan before 2005 FPM/FPA 4x4 SW Primitives will 10 -19 be demonstrated in CREST Thermal Energy @350K 10 -21 Thermal Energy @4K 10 -23 10 1 10 2 10 3 10 4 1 Clock Period (ps)

  18. Design flow of SFQ LSIs • Sequential circuit synthesis Design E ntry • Clock scheduling and distribution S pecificS ynthesis S ubs ystems Logic • Asynchronous logic for S FQLogicCircuits S ynthesis synthesis T echnology Library S pecification Logic Netlist Unique process to SFQ circuits &C onstraints S taticTiming P&R Analyzer Placement Library Layout-driven design Logic Placed C ells S imulator &C onnections C onstraints &Violations Timing Layout Viewer R outing Verification of Verification C ell &Wire Geometry pipeline operations Precise timing analysis Mask Layout 18

  19. Development of design tools clock tree synthesis Designed a sample circuit: • 8-bit carry lookahead adder Verified correct operations semi-automatic placement automatic routing 8-bit CLA 158 gates, 9 levels concurrent-flow clocking 7092JJs, 598PTLs

  20. 2 . SFQ-FPUs and SFQ-RDP prototypes Half-precision FPA and FPM using the 2 µ m process FPA Clock Generator Normalizer Shifter of A Multiplier Shifter Register for Confirmation Adder & Shifter Register Subtractor Shifter Register of Significands FPM Controller Normalizer Norm Clock Generator alizer Shifter Register of Shifter Register Normalizer Exponent and Sign 1mm Shifter of B 1mm Shifter Register for Confirmation Operating frequency: 32GHz Performance: 2.6 GFLOPs The number of junctions: 11044 JJs Operating frequency: 20GHz Power consumption: 3.5 mW Performance: 1.67 GFLOPs Circuit area: 6.22 × 3.78 mm 2 The number of junctions: 10244 JJs Power consumption: 3.5 mW 20 Circuit area: 5.86 x 5.72 mm 2

  21. FPA and FPM using the 1 µ m process Block diagram of bit-serial FPM Operation circuit for Systolic array multiplier Micrograph of 10-bit bit-serial FPM significant part Circuit area: 7.58 mm 2 Junction count: 6157 JJs Clock Generator Significand Processing Circuit 2.16 mm Operation circuit for exponent part 50-GHz test results of 4b multiplier Normalizer Exponent Simulation Shift Register Processing Circuit Shift Register for Input for Output Measurement 9% 3.510 mm 21

  22. 2x3 SFQ-RDP prototype using the 2 µ m process 6 ALUs Clock frequency: 23 GHz Junction counts : 14040 (World’s largest integration scale ) Circuit area: 6.84 × 6.72 mm 2 *SRL Nb 2.5 kA/cm 2 standard process CONNECT cooperated with SRL, NiCT, NU & YNU 22

Recommend


More recommend