ATLAS ATLAS A Scalable Emulator for A Scalable Emulator for Transactional Parallel Systems Transactional Parallel Systems Christos Kozyrakis and Kunle Olukotun Christos Kozyrakis and Kunle Olukotun Computer Systems Laboratory Computer Systems Laboratory Stanford University Stanford University http://tcc.stanford.edu tcc.stanford.edu http://
Motivation Motivation � CMPs are here, but how do we program them? CMPs are here, but how do we program them? � � Our proposal: transactional programming & execution Our proposal: transactional programming & execution � � Programs written as sequences of transactions Programs written as sequences of transactions � � CMP executes transactions in parallel with optimistic concurrenc CMP executes transactions in parallel with optimistic concurrency y � � More details at More details at http://tcc.stanford.edu http://tcc.stanford.edu � � Challenges Challenges � � Explore programming model with large applications & datasets Explore programming model with large applications & datasets � � Interactions with operating systems and IO Interactions with operating systems and IO � � Large Large- -scale transactional architectures (>16 nodes) scale transactional architectures (>16 nodes) � � Need a fast, scalable emulator for system Need a fast, scalable emulator for system- -level studies level studies � � Full Full- -system simulation too slow for our purposes… system simulation too slow for our purposes… � 2 C. Kozyrakis, WARFP, Feb. 2005 2
ATLAS Overview ATLAS Overview A multi- -board emulator for transactional board emulator for transactional A multi � � parallel systems parallel systems CPU CPU Transaction Transaction Goals Goals � � Cache Cache � 16 to 64 CPUs (8 to 32 boards) 16 to 64 CPUs (8 to 32 boards) � CMP NETWORK � 50 to 100MHz 50 to 100MHz � DISK DRAM IO/DRAM � Stand Stand- -alone full alone full- -feature system feature system � Net Control � OS, IDE disks, 100Mb Ethernet, OS, IDE disks, 100Mb Ethernet, … … � PCI ATLAS architecture space ATLAS architecture space � � � Small, medium, and large Small, medium, and large- -scale CMPs and scale CMPs and SMPs SMPs � � UMA and NUMA UMA and NUMA � � Flexible transactional memory hierarchy & protocol Flexible transactional memory hierarchy & protocol � � Flexible network model Flexible network model � � Flexible clocking, latency, and bandwidth settings Flexible clocking, latency, and bandwidth settings � 3 C. Kozyrakis, WARFP, Feb. 2005 3
Building Block: Xilinx Xilinx ML310 Board ML310 Board Building Block: XC2VP30 FPGA features XC2VP30 FPGA features � � � 2 PowerPC 405 cores 2 PowerPC 405 cores � � 2.4Mb dual 2.4Mb dual- -ported SRAM ported SRAM � � 30K logic cells 30K logic cells � � 8 8 RocketIO RocketIO 3.125Gbps transceivers 3.125Gbps transceivers � System features System features � � � 256MB DDR, 512MB 256MB DDR, 512MB CompactFlash CompactFlash � � Ethernet, PCI, USB, IDE, … Ethernet, PCI, USB, IDE, … � Design and development tools Design and development tools � � � Foundation ISE for design entry, synthesis, … Foundation ISE for design entry, synthesis, … � � For the transactional memory hierarchy and network For the transactional memory hierarchy and network � � Chipscope Chipscope Pro logic analyzer for debugging Pro logic analyzer for debugging � � EDK for system simulation, system SW development, configuration, EDK for system simulation, system SW development, configuration, … … � � Montavista Montavista Linux 3.1 Pro Linux 3.1 Pro � 4 C. Kozyrakis, WARFP, Feb. 2005 4
Example: 2- -way bus way bus- -based transactional CMP based transactional CMP Example: 2 BRAM BRAM OCM OCM PowerPC 405 PowerPC 405 PLB PLB Transaction Transaction BRAM BRAM Store Queue State Queue Store State BRAM BRAM Logic Logic Logic PLB PLB PLB BRAM Logic Macro 5 C. Kozyrakis, WARFP, Feb. 2005 5
ATLAS Software Framework ATLAS Software Framework � PowerPC and ML310 features provide rich SW framework PowerPC and ML310 features provide rich SW framework � � Linux OS Linux OS � � Port for Port for Xilinx Xilinx boards available from boards available from Montavista Montavista � � Allows exploration of transactions with IO and scheduling Allows exploration of transactions with IO and scheduling � � Gcc Gcc C/C++ software framework C/C++ software framework � � TCC API for transactional programming TCC API for transactional programming � � Allows experimentation with wide range of applications Allows experimentation with wide range of applications � � Jikes Jikes- -RVM Java framework RVM Java framework � � TCC API for transactional programming TCC API for transactional programming � � Allows exploration of dynamic optimization techniques Allows exploration of dynamic optimization techniques � � Allows us to focus on parallel programming quickly Allows us to focus on parallel programming quickly � � No need to develop significant infrastructure from scratch No need to develop significant infrastructure from scratch � � Gradual path to parallel application development Gradual path to parallel application development � � Sequential version of C/C++/Java apps runs immediately Sequential version of C/C++/Java apps runs immediately � 6 C. Kozyrakis, WARFP, Feb. 2005 6
Trade- -offs & Scalability offs & Scalability Trade � ATLAS trade ATLAS trade- -offs offs � – Sacrifice some hardware modeling flexibility Sacrifice some hardware modeling flexibility – � Simple CPU, SW or coprocessor FPU, bounded on Simple CPU, SW or coprocessor FPU, bounded on- -chip memory chip memory � + Fast hardware prototyping Fast hardware prototyping + � Develop RTL for transactional memory + networking protocol Develop RTL for transactional memory + networking protocol � + Rich software framework Rich software framework + + Based on commercial hardware and software + Based on commercial hardware and software � Low cost, timely upgrades and improvements Low cost, timely upgrades and improvements � � Scaling Scaling � � Scalability by adding boards (size & performance) Scalability by adding boards (size & performance) � � Use Use RocketIO RocketIO tranceivers tranceivers and and Xilinx Xilinx Aurora protocol Aurora protocol � � Limitations Limitations � � 32 32- -bit cores can address up to 4GB of shared memory bit cores can address up to 4GB of shared memory � ⇒ must synthesize router for >16 CPU 8 transceivers per chip ⇒ � 8 transceivers per chip must synthesize router for >16 CPU � 7 C. Kozyrakis, WARFP, Feb. 2005 7
Summary Summary � A scalable emulator for transactional parallel systems A scalable emulator for transactional parallel systems � � Based on commercial FPGA chips, boards, and software Based on commercial FPGA chips, boards, and software � � 32 to 64 CPUs at 50 to 100MHz 32 to 64 CPUs at 50 to 100MHz � � A 6.4 GIPS emulator at full scale A 6.4 GIPS emulator at full scale � � Low cost, fast, flexible Low cost, fast, flexible � � ATLAS architecture space ATLAS architecture space � � Large Large- -scale parallel systems with transactional memory support scale parallel systems with transactional memory support � � ATLAS software space ATLAS software space � � Transactional parallel programming and optimizations Transactional parallel programming and optimizations � � Operating systems and IO research Operating systems and IO research � � Large Large- -scale application development scale application development � � Embedded, server, desktop Embedded, server, desktop � 8 C. Kozyrakis, WARFP, Feb. 2005 8
Recommend
More recommend