the beowulf cluster at the center for computational
play

The Beowulf Cluster at the Center for Computational Mathematics, - PowerPoint PPT Presentation

The Beowulf Cluster at the Center for Computational Mathematics, CU-Denver www-math.cudenver.edu/ccm/beowulf Jan Mandel, CCM Director Russ Boice, System Administrator Supported by National Science Foundation Grant DMS-0079719 CLUE North


  1. The Beowulf Cluster at the Center for Computational Mathematics, CU-Denver www-math.cudenver.edu/ccm/beowulf Jan Mandel, CCM Director Russ Boice, System Administrator Supported by National Science Foundation Grant DMS-0079719 CLUE North September 18, 2003

  2. Overview • Why a Beowulf cluster? • Parallel programming • Some really big clusters • Design constraints and objectives • System hardware and software • System administration • Development tools • The burn-in experience • Lessons learned

  3. Why a Beowulf Cluster? • Parallel supercomputer on the cheap • Take advantage of bulk datacenter pricing • Open source software tools available • Uniform system administration • Looks like one computer from the outside • Better than a network of workstation

  4. Why parallel programming • Speed: Divide problem into parts that can run on different CPUs – Communication between the parts is necessary, and – the art of efficient parallel programming is to minimize the communication • Memory: On a cluster, the memory needed for the problem is distributed between the nodes • But parallel programming is hard!

  5. Numerical parallel programming software layers Distributed High parallel object OpenMP Performance libraries Fortran (HPF) (PetSC, HYPRE,…) Shared Message passing memory libraries (hardware, (MPI, PVM) virtual) Interconnect hardware drivers (ethernet, SCI, Myrinet…)

  6. Top 500 list • Maintained by www.top500.org • Speed measured in floating point operations per second (FLOPs) • LINPACK benchmark = solving dense square linear systems of algebraic equations by Gaussian elimination www.netlib.org • Published twice a year at the International Supercomputing Conference

  7. Source: Jack Dongarra http://www.cs.utk.edu/~dongarra/esc.pdf

  8. Source: Jack Dongarra http://www.cs.utk.edu/~dongarra/esc.pdf

  9. Source: www.top500.org

  10. Design objectives and constraints • Budget $200,000, including 3 year warranty • Maximize computing power in GFLOPs • Maximize communication speed • Maximize memory per node • Run standard MPI codes • Nodes useful as computers in themselves • Use existing application software licenses • Run existing software, porting, development • Remote control of everything, including power • System administration over low bandwidth links

  11. Basic choices • Linux, because – It is free and we have been using it on the whole network already for years – Cluster software runs on Linux – Our applications run on Linux • Thick nodes, with disks and complete Linux, because – Nodes need to be useful for classical computations – Local disks are faster than over network – Tested Scyld (global process space across the cluster), which did not work well at the time – At least we know how to make them run

  12. Interconnects available in early 2001 • 100Mb/s ethernet: slow, high latency • 1Gb/s ethernet: expensive (fibre only), high latency • Myrinet: nominal 2Gb/s duplex, star topology, needs expensive switch • SCI (Dolphin): nominal 10Gb/s, actual 1.6Gb/s duplex, torus topology, no switch, best latency and best price per node. Also promised serial consoles and remote power cycling of individual nodes. • Dolphin and Myrinet avoid TCP/IP stack • Speed limited by the PCI bus - 64bit 66MHz required to get fast communication • Decision: SCI Dolphin Wulfkit with Scali cluster software

  13. x86 CPUs available in early 2001 • Intel PIII: 1GHz, cca 0.8 GFLOPs – Dual CPUs = best GFLOPs/$ – 64bit 66MHz PCI bus available on server class motherboards – 1U 2CPU possible – Cheap DRAM • Intel P4 1.5GHz – SSE2 = double precision floating point vector processor – Theoretically fast, but no experience with SSE2 at the time – No 64bit 66MHz PCI bus, no dual processors – Rambus memory only, expensive • AMD Athlon – Not available with dual processors – No experience in house • Decision: Dual PIII, server class motherboard, 1U

  14. Disks available in early 2001 • ATA100 – Internal only, 2 devices/bus, no RAID – Simple drives, less expensive • Ultra160 SCSI – Internal/external, RAID – 16bit bus, up to 160MB/s – Up to 16 devices/channel – More intelligence on drive, more expensive – Disk operation interleaving – High-end server class motherboards have SCSI • Decision: Ultra160 SCSI

  15. Remote console management • Goal: manage the cluster from off campus • Considered KVM switches • Solutions exist to convert KVM to a graphics session, but – Required a windoze client – And lots of bandwith, even DSL may not be enough (bad experience with sluggish VNC even over 10Mb/s) – Would the client run through a firewall? • All we wanted was to convert KVM to a telnet session when the display is in green screen text mode – when we are up and run X we do not need a console … but found no such gadget on the market • Decision: console through serial port and reverse telnet via terminal servers

  16. Purchasing • Bids at internet prices + few % for integration, delivery, installation, tech support, and 3 year warranty • Vendor acts as a single point for all warranties and tech support (usual in the cluster business) • Worked out detailed specs with vendors – DCG, became ATIPA in the process – Paralogic – Finally bought from Western Scientific

  17. The Beowulf Cluster at CCM

  18. Cluster hardware • 36 nodes (Master + 35 slaves) – Dual PIII-933MHz, 2GB memory – Slaves have 18GB IBM Ultrastar Ultra160 SCSI disk, floppy • Master node – mirrored 36GB IBM Ultrastar Ultra160 SCSI disk, CDROM – External enclosure 8*160GB Seagate Barracuda Ultra160 SCSI, PCI RAID card with dual SCSI channels, mirrored & striped – Dual gigabit fiber ethernet – VXA 30 AIT tape library • SCI Dolphin interconnect • Cluster infrastructure – 100Mb/s switch with gigabit fiber uplink to master – 4 APC UPS 2200 with temperature sensors and ethernet – 3 Perle IOLAN+ terminal servers for the serial consoles – 10Mb/s hub for the utility subnet (UPS, terminal servers)

  19. Performance • CPU theoretical ~60 GFLOPs • Actual 38 GFLOPs LINPACK benchmark • Disk array: 4 striped disks@40MB/s on 160MB/s channel=160MB/s theoretical, 100MB/s actual disk array bandwidth • SCI interconnect: 10Gb/s between cards, card to node 528MB/s theoretical (PCI), 220MB/s actual bandwidth, <5µs latency

  20. Each UPS To Four 30 SCI Supplies two Amp 115 power strips Fiber Gaga bit to Internet which then VAC Circuits Supply 36 Nodes 2 CPU 2GB RAM Each Master Nodes and other Node 1 equipment UPS Power SCI Cable Interconnect Node 10 UPS Power Node 20 UPS Power Node 30 UPS Power Node 35 RS 232 100 MB Ethernet Fiber Giga Bit Link Three 100Mb/s Terminal Switch Controllers To Internet 10 MB Hub

  21. SCI Dolphin interconnect topology 6 x 6 2D torus 1 13 25 31 19 7 3 15 27 33 21 9 5 17 29 35 23 11 4 16 28 34 22 10 2 14 26 32 20 8 M 12 24 30 18 6

  22. Mass storage Internal 36GB mirrored drives on 160MB/s SCSI Two 160MB/s Eight 160 GB SCSI Channels SCSI Drives (with RAID leaves Master node 698 GB actual capacity) Node 1 Node 2 Two VXA Tape Drives With 30 Tape Auto Load Library Node 3 One SCSI (Recording Rate: 4000 kB/s) (70 GB Typical Compressed Channel Capacity Per Tape) Node 35 18 GB Internal SCSI Drives

  23. OF BIG HARD DRIVES WARNING EXPLICIT PICTURES NEXT

  24. The master node The master

  25. The slave nodes The slaves

  26. The back Serial console cables Ethernet cables Keyboard, monitor, and mouse plugged into a slave node SCI Dolphin cables

  27. The uninterruptible power supplies (UPS), utility hub, two UPS network interface boxes, and temperature sensor on top

  28. Disk array 8*160GB Tape library An extra fan to blow in the beast’s face

  29. The beast eats lots of power

  30. Perfectly useless doors that came with the beast but just inhibit air flow. Note the shiny surface of the left door, holes only on the sides.

  31. Ethernet switch Terminal servers for serial consoles

  32. The disk array and the tape library

  33. are a bit too long. The 10Gb/s SCI Dolphin cables

  34. Office power strips were commandeered to distribute the load between the outlets on the uninterruptible power supplies to avoid tripping the circuits breakers.

  35. The backplane in gory detail

  36. Cluster software • Redhat Linux 7.2 • Scali cluster software – SCI Dolphin accessible through MPI library only – Management tools, Portable Batch System (PBS) • Portland group compilers – C, C++, Fortran 90 – High Performance Fortran (HPF) is the easiest way to program the cluster • Totalview debugger by Etnus – Switch between group of processes in an MPI job, control all processes at once – Alternative: one gdb per process in an xterm window…

  37. Networking • All slaves and one master fiber interface run a local network with NAT • Master is the gateway, only node visible from the outside • Disk array on master shared by NFS with slaves • Utility hub (power supplies, serial consoles) is accessible from the outside • Master runs ntp server – important to keep time in sync • All other protocols pass through master to the outside • FlexLM license server for apps is outside of cluster

Recommend


More recommend