System X Virgina Tech's Supercomputer “The fastest academic supercomputer” Project #2 CS466, Fall 2004 By Raj Bharath Swam inathan Hareesh Nagarajan {rswam ina, hnagaraj} @ cs.uic.edu University of Illinois at Chicago
How was it built? ● VTech faculty (The terascale com puting facility – TCF) worked closely with vendor partners ● 1100 Power MAC G5 were put into racks and the construction began ● In parallel device drivers, hand optim ization of num erical libraries, code porting was going on ● The super com puter was on paper in Feb 2003 and was built by Septem ber 2003. ● Unfortunately the system couldn't perform scientific com putation as ECC RAM was required and the G5 didn't support it. Enter Xserve G5.
The TCF lab went from looking like this (Left) t this (Bottom )
Specification ● Nodes 1100 Apple XServe G5 2.3 GHz dual processor cluster nodes (4 GB RAM, 80 GB S- ATA HD) – 4.4 TB (4400 GB) of RAM – 88 TB (88000 GB) of HDD – 2200 Processors ● Primary Communication 24 Mellanox 96 port InfiniBand switches (4X InfiniBand, 10 Gbps) ● Secondary Communication 6 Cisco 4506 Gigabit Ethernet switches ● Cooling Liebert X- treme Density System cooling ● Software Mac OS X, MVAPICH, XLC & XLF ● Current Linpack – Rpeak = 20.24 Teraflop – Rmax = 12.25 Teraflops – Nmax = 620000
Some facts ● System X comes in at #7 on the top500.org's list ● Each of the 1100 Xserve servers was custom built by Apple. ● $5.8 million price tag ($5.2 million for the initial machines, and $600,000 for the Xserve upgrade) ● New (custom built!) Xserve servers are about 15% faster than the desktop machines - - - > The new System X operates about 20 percent faster, almost adding 2 teraflops ● The extra 5- percent performance boost came from optimized software ● Typically, System X runs several projects simultaneously, each tying up 400 to 500 processors for research into weather and molecular modeling.
Power PC G5 Processor – key features ● Based on IBM’s PowerPC 970FX series. ● 64 bit PowerPC Architecture ● Native support for 32- bit Applications ● Front side bus speed upto 1.25GHz ● Superscalar execution core with 12 functional units supporting upto 215 in- flight instructions ● Uses a dedicated optimized 128 bit velocity Engine for accelerated SIMD processing ● Can address upto 4 TB of RAM
Specifications ● 90nm Silicon on Insulator (SOI) process with copper interconnects ● Consumes 42W of power at 1.3V. ● Around 58 million transistors. ● Uses a 2 Level Cache ● Registers: – 32 64- bit general purpose registers – 32 64- bit floating- point registers – 32 128- bit vector registers ● Eight deep issue queues for each functional unit ● Uses a 16 stage pipeline
Front- side bus ● It runs at 1/ 2 the core clock speed DDR. So for the 2.3GHz processor, the Front Side Bus runs at 1.15GHz DDR ● Bus is composed of two unidirectional channels, each 32 bits wide, the total theoretical peak bandwidth for the 1.15GHz bus is close to 10GB/ sec. Dual processors mean twice the bandwidth i.e around 20GB/ sec
Cache ● L1 data cache: 32 KB write through 2- way Associative mapped ● L1 instruction cache: 64 KB direct mapped ● L2 cache: 512K fully associative ● L1 cache is parity protected ● L2 cache is protected using ECC (Error Correction code) logic
Fetch, Decode & Issue ● Eight instructions per cycle are fetched from the 64KB instruction cache into an instruction queue. ● 9 pipeline stages devoted to instruction fetch and decode ● “Decode, crack, and group formation" phase breaks down instructions to simpler IOPS( Internal Operations), which resemble RISC instructions ● 5 IOPS are dispatched per clock (4 instructions + 1 branch) in program order to a set of issue queues ● Out- of- order execution logic pulls instructions from these issue queues to feed the chip's eight functional units.
Branch prediction ● On each instruction fetch the front end's branch unit scans the eight instructions and picks out up to two branches. Prediction is done using one of two branch prediction schemes. 1. Standard BHT Scheme – 16K entries, 1- bit branch predictor. 2. Global predictor table scheme – 16K entries. Each entry has an associated 11 bit vector that records the actual execution path taken by the previous 11 fetch groups and a 1- bit branch predictor. ● A third 16K- entry keeps track of which of the two schemes works best for each branch. When each branch is finally evaluated, the processor compares the success of both schemes and records in this selector table which scheme has done the best job so far of predicting the outcome of that particular branch.
Integer unit ● 2 Integer Units attached to 80 GPR’s (32 architectural + 48 rename) ● Simple, non- dependent integer IOPs can issue and finish at a rate of one per cycle. Dependent integer IOPS need 2 cycles ● Condition register logical unit (CRU): Dedicated unit for handling logical operations related to the PowerPC's condition register
Load Store Unit ● Two identical load- store units that executes all of the LOADs and STOREs. ● Dedicated address generation hardware which is part of the load- store units. Hence address generation takes place as part of the execution phase of the Load- Store Units pipeline.
Integer Issue Queue
Floating point unit ● Two identical FPUs, each of which can execute the fastest floating- point instructions in 6 cycles. Single- and double- precision operations take the same amount of time to execute. ● FPUs are fully pipelined for all operations except floating- point divides. ● 80 total microarchitectural registers, where 32 are PowerPC architectural registers and the remaining 48 are rename registers. ● The floating- point units can complete both a multiply operation and an add operation as part of the same machine instruction (fused multiply- add), thereby accelerating matrix multiplication, vector dot products, and other scientific computations.
Floating point Issue queue
Vector Unit ● Contains 4 fully pipelined vector processing units 1. Vector Permute Unit (VPU) ● Vector Arithmetic Logic Unit (VALU) 2. Vector Simple Integer Unit (VSIU) 3. Vector Complex Integer Unit (VCIU) 4. Vector Floating- point Unit (VFPU) ● Upto four vector IOPs per cycle total can be issued to the two vector issue queues - two IOPs per cycle maximum to the 16- entry VPU queue and two IOPs per cycle maximum to the 20- entry VALU queue
Vector Issue Queue
Conclusion ( On the processor. The presentation isn't over! ) ● Dual processors provide the high- density power and scalability required by the research and computational clustering environments of System X. ● The PowerPC G5 is designed for symmetric multiprocessing. ● Dual independent frontside buses allow each processor to handle its own tasks at maximum speed with minimal interruption. ● With sophisticated multiprocessing capabilities built in, Mac OS X and Mac OS X Server dynamically manage multiple processing tasks across the two processors. This allows dual PowerPC G5 systems to accomplish up to twice as much as a single- processor system in the same amount of time, without requiring any special optimization of the application.
A brief intro to Interconnection Networks ● Shared media has disadvantages (collisions) ● Switches allow communication directly from source to destination, without intermediate nodes to interfere with these signals ● A crossbar switch allows any node to communicate with any other node in one pass through interconnection ● An Omega interconnection uses less hardware but contention is more likely. Contention is called blocking ● A fat tree switch has more bandwidth added higher in the tree to match the requirements of common communication patterns
More... ● A Storage Area Network (SAN) that tries to optimize based on shorter distances is Infiniband . ● High performance clusters such as the System X utilize “Fat Tree” or Constant Bidirectional Bandwidth (CBB) networks to construct large node count non- blocking switch configurations ● Here integrated crossbars with relatively low number of ports are used to build a non- blocking switch topology supporting a much larger number of endpoints.
Switche s Crossbar switch(left) CBB Network (below) used in the System X P = 96 (Ports) 24 Mellanox switches 96/ 2 * 24 = 1152 ~ 1100 Nod
How does it apply to SystemX? Infiniband is a switch based serial I/ O interconnect architecture operating at a base speed of 10Gb/ s in each direction per port. Used in the System X
A cluster making use of Infiniband system fabric Note: We were unable to obtain the exact schem a of the System
The Mellanox Switch
Apple's new liquid cooling system 1. G5 processor at point of contact to the heatsink. 2. G5 processor card from IBM 3. Heatsink 4. Cooling fluid output from the radiator to the pump 5. Liquid cooling system pump 6. Pump power cable 7. Cooling fluid radiator input from the G5 processor 8. Radiant grille 9. Airflow direction
More on the cooling system... 1. Liquid cooling system pump 2. G5 processors 3. Radiator output 4. Radiator 5. Pump power cable 6. Radiator input
Recommend
More recommend