Introduction to PC-Cluster Hardware I Russian-German School on - PowerPoint PPT Presentation

Introduction to PC-Cluster Hardware I Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 1. Day, 27 th of June, 2005 HLRS, University of Stuttgart Introduction to PC-Cluster Hardware I Slide 1 High Performance Computing Center Stuttgart

Outline • Motivation • Hardware Architectures – Architectural design of Classic Personal Computers – IA32-architecture, Pentium-4 series of processors – Pipelining – Multiprocessor Architecture – Examples – Evolution of SuperComputers Introduction to PC-Cluster Hardware I Slide 2 High Performance Computing Center Stuttgart

Motivation Introduction to PC-Cluster Hardware I Slide 3 High Performance Computing Center Stuttgart

We need the compute power • Relevant engineering problems require performance that is orders of magnitude higher than what is available • CFD: Simulation of turbulence at a reasonable level of resolution • Combustion: Combination of turbulence simulation and realistic chemical models • Climate simulation: Resolution required that is orders of magnitude higher than today • Bioinformatics, Chemistry, ... Introduction to PC-Cluster Hardware I Slide 4 High Performance Computing Center Stuttgart

How Has Compute Power Been Increasing ? • Moore‘s law: The Performance of a Computer doubles every 18 months • This was realized by: – Downsizing the structures on the silicon – Increasing the clock frequency – Adding functional units – Improving the functional units • Physical limits – Speed of light at clock rate of 10 GHz, the signal travel distance within one clock tick is 3cm – Cooling (packaging) Introduction to PC-Cluster Hardware I Slide 5 High Performance Computing Center Stuttgart

We can not go on like this • Surprisingly it looks like we are already at the physical limit: – Intel cancelled the current Pentium IV development line – Clock-rate can no more grow orders of magnitude (7 GHz looks to be the current limit due to leakage current) • Fast hardware (e.g. ECL or GaAs) has a high power consumption, therefore the potential for higher integration is limited  The processor suppliers announced that future CPU’s will have several processors on a die (currently 2 processors / 2 HT)  in future, parallel architectures will be essential and everywhere, even at the desk. Introduction to PC-Cluster Hardware I Slide 6 High Performance Computing Center Stuttgart

Motivation Questions Response Introduction to PC-Cluster Hardware I Slide 7 High Performance Computing Center Stuttgart

Abstract Model Reality Physical Model Mathematical Model Numerical Scheme Questions & Response Application Program a few parallel Programming Models e.g. MPI HPF OpenMP Hardware Architecture Introduction to PC-Cluster Hardware I Slide 8 High Performance Computing Center Stuttgart

Hardware Architectures Introduction to PC-Cluster Hardware I Slide 9 High Performance Computing Center Stuttgart

History – Intel Chips of the 4. generation (starting 1972) • „Highly Integrated Circuits“ („Large Scale Integration“, LSI – VLSI) Back then: thousands of Transistors per cm 2 • Intel designs 1971 an allround-processor for the japanese firm Busicom: 4004 4004 Pentium 4 Transistors 2300 42 Mio. Technology 10 µ m 0.13 µ m Frequency 108 kHz 3.5 GHz Addressable Memory 640 Byte 4 GB Width of bus 4 Bit 64 Bit Performance (instr./s) 0.06 MIPS 3792 MIPS Die size 12 mm 2 217 mm 2 Introduction to PC-Cluster Hardware I Slide 10 High Performance Computing Center Stuttgart

How does a processor work? 1/3 The architecture of a Personal Computer: (numbers are theoretical!) • The processor executes Graphics Card Processor simple commands • These are read out of 12,8 GB/s memory 2,13GB/s Cache • But: Main memory is 6,4 GB/s slow (theor.: 7-8 ns) Northbridge Memory • The cache decouples the processor from memory (for well- Harddisk 1 behaved codes). Southbridge USB Harddisk 2 • Access to the devices 320 MB/s 60 MB/s and Hard disks is esp. slow (Hard disk:~10ms) These are theoretical values, only!! To memory,You see 1,2GB/s Introduction to PC-Cluster Hardware I Slide 11 High Performance Computing Center Stuttgart

How does a processor work? 2/3 • Instruction Fetch: Fetch the Register 1 instruction, which the PC points to. Register 2 ALU Register 3 • Instruction Decode: Decode the instruction: Stack Pointer add r3, r1, r2 FPU and load the registers. Prog. Counter Prog. Counter • Instruction Execute: Arithmetic Memory Management Unit Logic Unit adds up arguments. Write Back: Write register value. • Cache • Increment the Program Counter. Introduction to PC-Cluster Hardware I Slide 12 High Performance Computing Center Stuttgart

How does a processor work? 3/3 • Instruction Fetch: Fetch the Register 1 instruction, which the PC points to. Register 2 ALU Register 3 • Instruction Decode: Decode the instruction: Stack Pointer jmp switch =PC+Offset FPU Load PC and Offset into ALU. Prog. Counter Memory Management Unit • Instruction Execute: Arithmetic Logic Unit adds PC and Offset . Write Back: Write register value. • Cache • Increment the Program Counter. Not necessary here. Introduction to PC-Cluster Hardware I Slide 13 High Performance Computing Center Stuttgart

Pentium IV Hyperthreading Introduction to PC-Cluster Hardware I Slide 14 High Performance Computing Center Stuttgart

Picture of Pentium IV Die Introduction to PC-Cluster Hardware I Slide 15 High Performance Computing Center Stuttgart

Pentium IV processors A jump (backwards?) from Northwood (130nm) to Prescott (90nm): • Introduction to PC-Cluster Hardware I Slide 16 High Performance Computing Center Stuttgart

Cache performance comparison Comparison of Read/Write Performance of Northwood & Prescott: • L1 Read Bandwidth L2 Read Bandwidth MB/s Bytes/cycle MB/s Bytes/cycle Northwood 3,06 Ghz 23705 7,73 Northwood 3,06 Ghz 12162 3,97 Prescott 3,20 Ghz 23206 7,25 Prescott 3,20 Ghz 13146 4,11 source: http://www.hardwareanalysis.com Introduction to PC-Cluster Hardware I Slide 17 High Performance Computing Center Stuttgart

Cache – Functioning of a Cache 1/3 How is 1GB memory mapped into 1MB cache? • • The Cache is organized in lines: 64 Bytes / line, 16384 lines • If You load one byte within a cache-line (not yet in cache), the whole line is loaded: Register 1 Register 2 ALU Register 3 Stack Pointer FPU Prog. Counter Memory Management Unit 64 Bytes Memory Cache 4 Bytes 64 Byte Introduction to PC-Cluster Hardware I Slide 18 High Performance Computing Center Stuttgart

Cache – Functioning of a Cache 2/3 Associativity of Cache: • – Direct Mapped Cache: Every Cache-Line would be hard-allocated to memory – here 16384 memory addresses would share the same cache-line: inefficient. – Fully Associative Cache: Any Cache-line may store from any address in memory – this is not possible to do in hardware: here need 256 address comparators!! – N-Way Set Associative: A compromise between the previous two. N parallel comparators are used, i.e. a line in memory may fit into one of the N lines. • Pentium-4 Northwood: 4-Way associativity • Pentium-4 Prescott: 8-Way associativity (better?, slower!) • If the address is cached in a cache-line: Good. • If the address is not cached: fetch from memory, expel “old” cache-line Introduction to PC-Cluster Hardware I Slide 19 High Performance Computing Center Stuttgart

Cache – Functioning of a Cache 3/3 Which cache-line (of the N possible) to expel ?? • • Theory: Expel the one that is least likely (if at all) to be used in future. • The Pentium-4 uses a pseudo Least-Recently Used (LRU) algorithm: – The part of the address information not needed is used for that: 31 15 5 0 Addr: touched • Why is there a separate Instruction Cache? • The instruction stream has different access characteristics (more locality due to loops, jumps). Introduction to PC-Cluster Hardware I Slide 20 High Performance Computing Center Stuttgart

Dual-Core CPUs To speed up computers, the frequency will be less & less important. • • Instead multiple cores are being employed on the dye: e.g. the fastest dual-core chip Intel 840D: two cores, each two HT. • All of them share the cache..... Pentium 4 Pentium 4 3,4 Ghz 3,4 GHz 3,2 GB/s Memory 1,6 GB/s AGP Memory controller 4x Hub (MCH) Memory 1 GB/s 266 MB/s 1,6 GB/s I/O Controller Hub Introduction to PC-Cluster Hardware I Slide 21 High Performance Computing Center Stuttgart

Dual-Core CPUs AMD Opteron's Hypertransport is a solution for Dual-CPU/Dual-Core • SMP-Systems with High Memory IO-Requirements: Mem Mem PCI-X Opteron Opteron Tunnel 2.6 GHz 2.6 GHz Hypertransport 16-Bit, 1 Ghz, 8 GB/s Bus conn. PCI-Express Gigabit Ethernet SATA Disks Legacy Peripheral Introduction to PC-Cluster Hardware I Slide 22 High Performance Computing Center Stuttgart

Introduction to PC-Cluster Hardware I Russian-German School on - PowerPoint PPT Presentation

Introduction to PC-Cluster Hardware I Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 1. Day, 27 th of June, 2005 HLRS, University of Stuttgart Introduction to PC-Cluster Hardware I Slide 1 High

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Introduction to Graph Cluster Analysis Outline Introduction to Cluster Analysis Types of

Distributed Processing Distributed Processing, Client/Server, and Clusters , Chapter 13

Advanced Charm++ Tutorial Presented by: Isaac Dooley & Chao Mei 4/20/2007 1 Topics For

Welcome Overview of the week 29 April to 03 May, 2013 Week 18 29 Monday 30 Tuesday 1

Architectures for Parallel Processing Current Architectures for Parallel "With the

Introduction to Distributed Systems Corso di Sistemi Distribuiti e Cloud Computing A.A. 2019/20

Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are

Outline Introduction 1 Use R! in fifteen different ways: What is Quantian? A survey of R

Parallel & Distributed Computer Systems Chapter 01: Distributed Systems by Dr. Aymen AKREMI,

Introduction to PC-Cluster Hardware I Russian-German School on - PowerPoint PPT Presentation

Introduction to PC-Cluster Hardware I Russian-German School on High-Performance Computer Systems, 27 th June - 6 th July, Novosibirsk 1. Day, 27 th of June, 2005 HLRS, University of Stuttgart Introduction to PC-Cluster Hardware I Slide 1 High

Hardware Observability Framework Hardware Observability Framework Hardware Observability

Cluster Architectures Overview Cluster Computing The Problem The Solution The Anatomy

history and drivers The Aerospace Cluster The Cluster-Association The Aerospace Cluster The

Getting started on the cluster Learning Objectives Describe the structure of a compute cluster

Introduction to Cluster Computing Brian Vinter vinter@diku.dk Overview Cluster Computing

VC. VC. Hardware Startup The Hardware Revolu/on The Hardware Revolution Removing Barriers to

Sec Secure ure Hardware Hardware and Hardware and Hardware- En Enabled abled Security

Reaching the Goal with the Regensburg Marathon Cluster - A NetBSD Cluster Project - Hubert Feyrer

What is Cluster Analysis? Cluster: a collection of data objects Similar to one another

Cluster Presentation Cluster Presentation EU-EECA ICT Cluster is the joint effort of three

EDEN CLUSTER STATIONS EDEN CLUSTER STATIONS Density MUNICIPALITY SAPS STATION (inhabitants/km 2

Build Your Cluster with Rocks Build Your Cluster with Rocks Yu Fu Yu Fu University of Florida

What is Cluster Analysis? Dmitriy (Dima) Gorenshteyn Sr. Data Scientist, Memorial Sloan

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

Computing Cluster Usage Visualization Tool Compu&amp;ng Cluster Usage Visualiza&amp;on

Introduction to Graph Cluster Analysis Outline Introduction to Cluster Analysis Types of

Distributed Processing Distributed Processing, Client/Server, and Clusters , Chapter 13

Advanced Charm++ Tutorial Presented by: Isaac Dooley &amp; Chao Mei 4/20/2007 1 Topics For

Welcome Overview of the week 29 April to 03 May, 2013 Week 18 29 Monday 30 Tuesday 1

Architectures for Parallel Processing Current Architectures for Parallel &quot;With the

Introduction to Distributed Systems Corso di Sistemi Distribuiti e Cloud Computing A.A. 2019/20

Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are

Outline Introduction 1 Use R! in fifteen different ways: What is Quantian? A survey of R

Parallel &amp; Distributed Computer Systems Chapter 01: Distributed Systems by Dr. Aymen AKREMI,

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Computing Cluster Usage Visualization Tool Compu&ng Cluster Usage Visualiza&on

Advanced Charm++ Tutorial Presented by: Isaac Dooley & Chao Mei 4/20/2007 1 Topics For

Architectures for Parallel Processing Current Architectures for Parallel "With the

Parallel & Distributed Computer Systems Chapter 01: Distributed Systems by Dr. Aymen AKREMI,