PAD Cluster: An Open, Modular and Low Cost High Performance Computing System Volnys Borges Bernal Sergio Takeo Kofuji Guilherme Matos Sipahi Marcio Lobo Netto Laboratório de Sistemas Integráveis, EPUSP Alan G. Anderson Elebra Defesa e Controles Ltda
Agenda • Main Objectives • PAD Cluster E nvironment • PAD Cluster Architecture • Communication Libraries • System Administrator Tools • Operator Tools • User Tools • Development E nvironment
PAD Cluster • Main goals – Parallel Cluster Based Computing E nvironment • Based on Commodity Components • High Performance Communication Medium • Development E nvironment for Fortran77, fortran90 & HPF • MPI Interface • IE E E POSIX UNIX Interface • X-Windows Interface – Initial Application: • RAMS ( Regional Atmospheric Modeling System ) • Development: LSI-E PUSP + E lebra, FINE P support
PAD Cluster • Characteristics – Use of High Performance Commodities Components – Linux Operating System • Important: – Integration • Hardware components • Software subsystems
PAD Cluster E nvironment Configuration & Operation User Interface and Utilities CDE PAD-ptools Multiconsole Cluster Windows Parallel UNIX Partitioning Interface utilities Monitoring Clustermagic LSF POSIX System Configuration Job Unix & Replication Scheduling Interface Development Tools Compilers Tools Libraries GNU Portland MPI MPICH-FULL C, C++ Profiler MPICH F77 FULL Myrinet Portland Portland API/BPI F77, F90 F77. F90, BLAS, LaPack HPF Debugger BLACS ScalaPack
PAD Cluster Architecture Processing Node Processing Node • System Architecture Processing Node – Processing nodes Processing Myrinet Node switch – Access Workstation Processing Synchronization Node Hardware – Administration Fast-Ethernet Processing Multi-serial Workstation Switch Node Processing – Fast-ethernet switch Node Processing – Myrinet Switch Node – Synchronization Hardware Access Workstation Administration Workstation to external network
PAD Cluster Architecture • Node Architecture Intel Pentium II Intel Pentium II Intel Pentium II Intel Pentium II RAM RAM 333 MHz 333 MHz 333 MHz 333 MHz PCI Bridge Lm 78 PCI Bridge Lm 78 Myrinet SCSI Fast Ethernet Myrinet SCSI Fast Ethernet Controller Controller Controller Controller Controller Controller
Communication Infrastructure • Primary Network – Fast-E thernet – General purpose network • For traditional network services (NFS, DNS, SNMP, XNTP, … ) – Operating System TCP/ IP Stack
Communication Infrastructure • High Performance Network – Myrinet – For application data – Communication Libraries: • MPICH over Operating System TCP/ IP Stack • FULL user level interface library • MPICH-FULL user level interface library
Communication Libraries • MPICH Library – MPI over TCP/ IP stack • FULL Library – User level communication library – Developed in LSI-E PUSP in 1998 – Implementation Based on Cornell’s UNE T • MPICH-FULL Library – User level communication library – Internode communication: MPICH + FULL – Intranode communication: MPICH + Shared Memory
Communication Libraries • MPI-FULL performance Performance of Myrinet with MPICH-FULL Performance of Myrinet with MPICH-FULL Shared Memory ( 2 processes in one node) 2 processes (1 process per node) One 333 MHz dual node Two 333 MHz dual nodes 60 60 50 50 Mbytes/s Mbytes/s 40 40 30 30 20 20 10 10 0 0 0 200000 400000 600000 800000 1000000 1200000 0 200000 400000 600000 Size of Package in bytes Size of package in bytes Performance of Myrinet with MPICH-FULL 4 processes (2 processes per node) Two 333 MHz dual nodes 60 50 MBytes/s 40 30 20 10 0 0 200000 400000 600000 800000 1000000 1200000 Size of Package in bytes
Communication Infrastructure • Synchronization Hardware – Support for collective MPI operations – Implemented in FPGA – Interfaces for 8 nodes – Based on PAPE RS – Operations • barrier • broadcast • allgather • allreduce – Global Wall Clock
Communication Infrastructure • Serial Lines – Connects each node to the administration workstation – Allows remote console on the administration workstation
System Administrator Tools • ClusterMagic – Two main funcions: • Cluster Configuration • Node Replication – Advantages • E asy configuration / reconfiguration • Assure uniformity • Fast node replication
System Administrator Tools • Cluster Magic: Cluster Configuration operator cluster.conf cluster magic generated files hosts HOSTNAME bootptab hosts.equiv network rhosts ifcfg-eth0 DNS server files nsswitch.conf fstab resolv.conf exports profile lilo.conf inittab issue node node ifcfg-lo issue.net commun specific adm files files motd files
System Administrator Tools • Cluster Magic: Node Replication – Node installation based on the replication of a “Womb Node” – ClusterMagic replication diskette: • boots a small Linux System • disk partitioning • womb image copying • configuration files instalation • Boot sector initialization – Automatic process – Takes about 12 minutes
Operator Tools • Xadmin – Cluster Partitioning – Remote Commands • Multiconsole – Node console access • Job Scheduling – Job submission – LSF integrated with Cluster Partitioning • Cluster Monitoring
Operator Tools • Xadmin – Node partitioning N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 Cluster partitioning tool N0 N1 N2 N3 N4 N5 N6 N7 N8 N9 N10 N11 P1 P2 P3
Operator Tools • Xadmin – Remote Commands
Operator Tools • Multiconsole
Operator Tools • Cluster Monitoring – Java + SNMP agents
User Tools • PAD-ptools – Parallel versions of UNIX utilities – pcp, pls, pcat, … – Integratded with cluster partitioning • LSF – Job submission and control • mpirun – MPICH, MPI-FULL
Development E nvironment • Portland – Fortran77 – Fortran90 – HPF – Profiler – Debugger • Libraries – BLAS, BLACS, LaPack, ScaLaPack • TotalView debbuger • VAMPIR profiler
Conclusions • Complete product system: – E lebra Vortix Cluster ( PAD Cluster ) • www.elebra.com.br/ aero • Several Developments: – Hardware • Collective operations, Synchronization and Global Clock – Software • Communication Libraries • Cluster Tools • Communication Drivers
Future Works • University of São Paulo + Purdue University + University of Pittsburg – Hardware for collective operations and synchronization with PCI 64 bits Interface • University of São Paulo + ICS-FOTH ( Greece ) – ATM Like Switch on 2.4 Gbps/ s • University of São Paulo – New cluster administration, management and secure tools – High Availability Data Base applications
Acknowledgments • FINE P • LSI-E PUSP Development Team • E lebra Development Team
Recommend
More recommend