Introduction to PC-Cluster Hardware I Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 1. Day, 27 th of June, 2005 HLRS, University of Stuttgart Introduction to PC-Cluster Hardware I Slide 1 High Performance Computing Center Stuttgart
Outline • Motivation • Hardware Architectures – Architectural design of Classic Personal Computers – IA32-architecture, Pentium-4 series of processors – Pipelining – Multiprocessor Architecture – Examples – Evolution of SuperComputers Introduction to PC-Cluster Hardware I Slide 2 High Performance Computing Center Stuttgart
Motivation Introduction to PC-Cluster Hardware I Slide 3 High Performance Computing Center Stuttgart
We need the compute power • Relevant engineering problems require performance that is orders of magnitude higher than what is available • CFD: Simulation of turbulence at a reasonable level of resolution • Combustion: Combination of turbulence simulation and realistic chemical models • Climate simulation: Resolution required that is orders of magnitude higher than today • Bioinformatics, Chemistry, ... Introduction to PC-Cluster Hardware I Slide 4 High Performance Computing Center Stuttgart
How Has Compute Power Been Increasing ? • Moore‘s law: The Performance of a Computer doubles every 18 months • This was realized by: – Downsizing the structures on the silicon – Increasing the clock frequency – Adding functional units – Improving the functional units • Physical limits – Speed of light at clock rate of 10 GHz, the signal travel distance within one clock tick is 3cm – Cooling (packaging) Introduction to PC-Cluster Hardware I Slide 5 High Performance Computing Center Stuttgart
We can not go on like this • Surprisingly it looks like we are already at the physical limit: – Intel cancelled the current Pentium IV development line – Clock-rate can no more grow orders of magnitude (7 GHz looks to be the current limit due to leakage current) • Fast hardware (e.g. ECL or GaAs) has a high power consumption, therefore the potential for higher integration is limited The processor suppliers announced that future CPU’s will have several processors on a die (currently 2 processors / 2 HT) in future, parallel architectures will be essential and everywhere, even at the desk. Introduction to PC-Cluster Hardware I Slide 6 High Performance Computing Center Stuttgart
Motivation Questions Response Introduction to PC-Cluster Hardware I Slide 7 High Performance Computing Center Stuttgart
Abstract Model Reality Physical Model Mathematical Model Numerical Scheme Questions & Response Application Program a few parallel Programming Models e.g. MPI HPF OpenMP Hardware Architecture Introduction to PC-Cluster Hardware I Slide 8 High Performance Computing Center Stuttgart
Hardware Architectures Introduction to PC-Cluster Hardware I Slide 9 High Performance Computing Center Stuttgart
History – Intel Chips of the 4. generation (starting 1972) • „Highly Integrated Circuits“ („Large Scale Integration“, LSI – VLSI) Back then: thousands of Transistors per cm 2 • Intel designs 1971 an allround-processor for the japanese firm Busicom: 4004 4004 Pentium 4 Transistors 2300 42 Mio. Technology 10 µ m 0.13 µ m Frequency 108 kHz 3.5 GHz Addressable Memory 640 Byte 4 GB Width of bus 4 Bit 64 Bit Performance (instr./s) 0.06 MIPS 3792 MIPS Die size 12 mm 2 217 mm 2 Introduction to PC-Cluster Hardware I Slide 10 High Performance Computing Center Stuttgart
How does a processor work? 1/3 The architecture of a Personal Computer: (numbers are theoretical!) • The processor executes Graphics Card Processor simple commands • These are read out of 12,8 GB/s memory 2,13GB/s Cache • But: Main memory is 6,4 GB/s slow (theor.: 7-8 ns) Northbridge Memory • The cache decouples the processor from memory (for well- Harddisk 1 behaved codes). Southbridge USB Harddisk 2 • Access to the devices 320 MB/s 60 MB/s and Hard disks is esp. slow (Hard disk:~10ms) These are theoretical values, only!! To memory,You see 1,2GB/s Introduction to PC-Cluster Hardware I Slide 11 High Performance Computing Center Stuttgart
How does a processor work? 2/3 • Instruction Fetch: Fetch the Register 1 instruction, which the PC points to. Register 2 ALU Register 3 • Instruction Decode: Decode the instruction: Stack Pointer add r3, r1, r2 FPU and load the registers. Prog. Counter Prog. Counter • Instruction Execute: Arithmetic Memory Management Unit Logic Unit adds up arguments. Write Back: Write register value. • Cache • Increment the Program Counter. Introduction to PC-Cluster Hardware I Slide 12 High Performance Computing Center Stuttgart
How does a processor work? 3/3 • Instruction Fetch: Fetch the Register 1 instruction, which the PC points to. Register 2 ALU Register 3 • Instruction Decode: Decode the instruction: Stack Pointer jmp switch =PC+Offset FPU Load PC and Offset into ALU. Prog. Counter Memory Management Unit • Instruction Execute: Arithmetic Logic Unit adds PC and Offset . Write Back: Write register value. • Cache • Increment the Program Counter. Not necessary here. Introduction to PC-Cluster Hardware I Slide 13 High Performance Computing Center Stuttgart
Pentium IV Hyperthreading Introduction to PC-Cluster Hardware I Slide 14 High Performance Computing Center Stuttgart
Picture of Pentium IV Die Introduction to PC-Cluster Hardware I Slide 15 High Performance Computing Center Stuttgart
Pentium IV processors A jump (backwards?) from Northwood (130nm) to Prescott (90nm): • Introduction to PC-Cluster Hardware I Slide 16 High Performance Computing Center Stuttgart
Cache performance comparison Comparison of Read/Write Performance of Northwood & Prescott: • L1 Read Bandwidth L2 Read Bandwidth MB/s Bytes/cycle MB/s Bytes/cycle Northwood 3,06 Ghz 23705 7,73 Northwood 3,06 Ghz 12162 3,97 Prescott 3,20 Ghz 23206 7,25 Prescott 3,20 Ghz 13146 4,11 source: http://www.hardwareanalysis.com Introduction to PC-Cluster Hardware I Slide 17 High Performance Computing Center Stuttgart
Cache – Functioning of a Cache 1/3 How is 1GB memory mapped into 1MB cache? • • The Cache is organized in lines: 64 Bytes / line, 16384 lines • If You load one byte within a cache-line (not yet in cache), the whole line is loaded: Register 1 Register 2 ALU Register 3 Stack Pointer FPU Prog. Counter Memory Management Unit 64 Bytes Memory Cache 4 Bytes 64 Byte Introduction to PC-Cluster Hardware I Slide 18 High Performance Computing Center Stuttgart
Cache – Functioning of a Cache 2/3 Associativity of Cache: • – Direct Mapped Cache: Every Cache-Line would be hard-allocated to memory – here 16384 memory addresses would share the same cache-line: inefficient. – Fully Associative Cache: Any Cache-line may store from any address in memory – this is not possible to do in hardware: here need 256 address comparators!! – N-Way Set Associative: A compromise between the previous two. N parallel comparators are used, i.e. a line in memory may fit into one of the N lines. • Pentium-4 Northwood: 4-Way associativity • Pentium-4 Prescott: 8-Way associativity (better?, slower!) • If the address is cached in a cache-line: Good. • If the address is not cached: fetch from memory, expel “old” cache-line Introduction to PC-Cluster Hardware I Slide 19 High Performance Computing Center Stuttgart
Cache – Functioning of a Cache 3/3 Which cache-line (of the N possible) to expel ?? • • Theory: Expel the one that is least likely (if at all) to be used in future. • The Pentium-4 uses a pseudo Least-Recently Used (LRU) algorithm: – The part of the address information not needed is used for that: 31 15 5 0 Addr: touched • Why is there a separate Instruction Cache? • The instruction stream has different access characteristics (more locality due to loops, jumps). Introduction to PC-Cluster Hardware I Slide 20 High Performance Computing Center Stuttgart
Dual-Core CPUs To speed up computers, the frequency will be less & less important. • • Instead multiple cores are being employed on the dye: e.g. the fastest dual-core chip Intel 840D: two cores, each two HT. • All of them share the cache..... Pentium 4 Pentium 4 3,4 Ghz 3,4 GHz 3,2 GB/s Memory 1,6 GB/s AGP Memory controller 4x Hub (MCH) Memory 1 GB/s 266 MB/s 1,6 GB/s I/O Controller Hub Introduction to PC-Cluster Hardware I Slide 21 High Performance Computing Center Stuttgart
Dual-Core CPUs AMD Opteron's Hypertransport is a solution for Dual-CPU/Dual-Core • SMP-Systems with High Memory IO-Requirements: Mem Mem PCI-X Opteron Opteron Tunnel 2.6 GHz 2.6 GHz Hypertransport 16-Bit, 1 Ghz, 8 GB/s Bus conn. PCI-Express Gigabit Ethernet SATA Disks Legacy Peripheral Introduction to PC-Cluster Hardware I Slide 22 High Performance Computing Center Stuttgart
Recommend
More recommend