HPC Analytics Dan Stanzione Fulton High Performance Computing - PowerPoint PPT Presentation

HPC Analytics Dan Stanzione Fulton High Performance Computing dstanzi@asu.edu 2/20/05

Theme Gap between data and knowledge (as has been discussed here before) High Performance Computing continues to exponentially increase our ability to generate data This can be an enabler of new science... ...but also a huge obstacle ...or an excuse not to think

Outline How much data is “large”. Evolution of system design to deal with large data What to do with it all - Analytics

How Much Data? What is a “large” dataset nowadays? My current machine: 2+ Tflops Network bisection bandwidth ~1Tb/s I/O subsystem writes ~500MB/s (30 GB/minute)

How Much Data? Mars project: ~60TB One ASU faculty member has contacted me about a ~2 Petabyte dataset. A Chilean observatory can produce more than 1TB an hour (12 hours data must be processed before next pass starts...) A potential Australian array telescope would produce multiple EXABYTES per year by 2010. Not unique to astronomy...

How Much Data? Machines will be constructed in next 12 months with several tens of thousands of processors (hundreds of TF) Network bandwidth >10TB/sec 1PB/2 minutes 1 Exabyte per 30 hours 1 Zettabyte during machine 3 yr. lifetime (yottabytes are next, if anyone’s counting...) Google has much more computation, much less network/flop

Evolution of Storage Systems Evolution at all levels: RAW/Text Files -> Hierarchical Formats -> Schemas - > Database Filesystems -> LVM -> Parallel Filesystems -> Global Name Space/Storage Request Brokers Single disk volumes -> RAID1-5 -> RAID 10 -> Storage Hierarchies

HPC Storage Hierarchy Compute nodes Interconnection Network Master Node Internet or Internal Network Basic Beowulf

Tier 1 Storage In Cluster High Speed Scratch Parallel Filesystem I/O Nodes Compute Nodes Beowulf Cluster Interconnection Network Master Node Public Network Parallel Filesystems support this: PVFS, Panasas, Lustre, IBRIX -- MPI I/O is the interface

Tier 2 Storage Shared Home Directories Parallel Filesystem I/O Nodes Compute Nodes Beowulf Cluster Interconnection Network Master Node Home Directory Server (May be direct-attached to Master) Public Network

Tier 3 Storage Campus-Wide Research Storage Cluster B Parallel Filesystem I/O Nodes Campus Research Network Compute Nodes Interconnection Network Interconnection Network Public Network Master Node Public Network Cluster C Other Research Servers (non-cluster) Interconnection Network Public Network Public Network Campus Storage Mirrors Campus Storage Servers

We can build Multi-PB Storage Systems - Now What? Applications spit out lots of this data (or sensors/ sequencers/instruments wrapped in applications). Status Quo: Applications codes generate FORTRAN unformatted or ASCII text data to a (multitude of) files Some domain exception (.pdb, gridgen)

Three problems: Too many files (my worst offender has 750,000 - find anything useful in that). Files too big (one student generated 700GB in 18 hours) Too many formats (can’t connect weather and ocean, application and visualization).

Things are happening Broad Domain Frameworks taking hold: e.g. ESMF (Earth System Modeling Framework) - Connect WRF (climate) to ADCIRC (Ocean) Hierarchical, standard, descriptive data formats Broader introduction of metadata is the key... This is the right trend, but has costs...

Costs of Frameworks Application complexity goes way up Converse -> value of applications written “outside”community goes way down. XML is not the most efficient format in the world... XML: FORTRAN Raw: <particle> <coordinates> 10 0 10 </coordinates> 0a000a0c0908 <velocity> (6 bytes) <x> 12 </x> <y> 9 </y> <z> 8 </z> </velocity> </particle> (~100 BYTES)

Costs of Frameworks Enterprise-class backed-up storage: ~$10,000/Terabyte Cost of 10-1 inefficiency on one PB of raw data: $100,000,000.00 In fairness, compressed XML mitigates a fair amount of this... but an app-specific binary format will always win

HPC Analytics We can build systems, we can make filesystems, we can create well-ordered files This can roughly be called “Data Management” Well-ordered data is a foundation, but still not knowledge. The next phase is the emerging field of Analytics

Analytics SC05 - “HPC Analytics Challenge” 11/05 “...showcase innovative techniques of rigorous data analysis...” Dept. of Energy - Visual Analytics Center solicitation 10/05. PNNL NVAC (Nat’l Visualization and Analytics Center) Recommended Reading: “Illuminating the Path” - National R&D Agenda in Visual Analytics http://nvac.pnl.gov/agenda.stm

Analytics, acoording to the “Path”: • The science of analytical reasoning • Visual representations and interaction techniques • Data representations and transformations • Production, presentation, and dissemination.

State of Analytics At SC05, all five finalists did Visualization Not an expansive view of analytics... One used data mining to produce visualizations While much, much quality work has been done in visualization techniques, ...visualizations are still used as much for fundraising as science

Of course, *I* wouldn’t use visulizations for this...

HPC Applications

Visualization Advancing 3D visualization does add something Decision Theater Formats are a key to making this routine. More tools beyond Excel, Matlab Need to accelerate to real-time, “what-if” scenario Hierarchy matters here - don’t render whole earth at 30cm resolution

Analytics Beyond Visualization Databases are a key to the HPC future - See Dr. Chen’s earlier talk for an excellent introduction Large databases of small records well understood Large databases of large, sparse records of ill- conforming data not understood. Experimental Management tools increasing in value Frameworks for parameter study, goal-directed search

Analytics Beyond Visualization Two more technologies must be imported from other fields: Data Mining (database-enabled) In large datasets, the trends are the knowledge Acxiom is a good model (and, gets them out of the junk mail business). Search One Word: Google Pre-(multi)-indexing, divided search space search multi-PB space in 0.01 seconds... by using a massive cluster to do most work ahead of time.

Takeaways: Intelligent I/O Standard Formats Hierarchy - multiple views of data Database/Data Mining/ Search Visualization All of the above require more sophisticated application codes, more use of tools: Computational Science Literacy

HPC Analytics Dan Stanzione Fulton High Performance Computing - PowerPoint PPT Presentation

HPC Analytics Dan Stanzione Fulton High Performance Computing dstanzi@asu.edu 2/20/05 Theme Gap between data and knowledge (as has been discussed here before) High Performance Computing continues to exponentially increase our ability to

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team

building software with ease kenneth.hoste@ugent.be HPC UGENT About HPC UGent: central

STRESS: When is it a problem? Fidgeting Picking (skin) Nail biting Stomach ache

Mark 5:1-20 New International Version They went across the lake to the region of the Gerasenes. 2

Preparing Software Engineers for the real world Ed Yourdon ed@yourdon.com

7. God is rescuer; we are rescued @simongharris @burlington_ips #whodoyouthinkyouare? THE

valgrind code analyzer Valgrind is another injection-based profiler/analyzer Can be used to

First Foreign-Born Players Esteban Bell , Hank Biasatti, 1946

r Prt rs t

What is a function? Functions are like the buttons on a calculator. What are some of the

HPC Analytics Dan Stanzione Fulton High Performance Computing - PowerPoint PPT Presentation

HPC Analytics Dan Stanzione Fulton High Performance Computing dstanzi@asu.edu 2/20/05 Theme Gap between data and knowledge (as has been discussed here before) High Performance Computing continues to exponentially increase our ability to

HPC @ SAO S.G. Korzennik - SAO HPC Analyst hpc@cfa February 2013 SGK ( hpc@cfa ) HPC @ SAO

Uni.lu HPC School 2020 PS6: HPC Containers: Singularity Uni.lu High Performance Computing (HPC)

The HPC Skill Tree A Brief Overview Kai Himstedt On Behalf of the HPC-CF Board BoF:

Analytics and Data Summit 2020 Analytics and Data Summit 2020 Analytics and Data Summit 2020

Whats new in HPC? Gregory Bauer To keep up-to-date on HPC HPC Guru -

UL HPC School 2017[bis] PS1: Getting Started on the UL HPC platform UL High Performance

UL HPC School 2017 PS5: Advanced Scheduling with SLURM and OAR on UL HPC clusters UL High

UL HPC School 2017 PS1: Getting Started on the UL HPC platform UL High Performance Computing

CONTAINERS DEMOCRATIZE HPC CJ Newburn, Principal Architect for HPC, NVIDIA GTC19 S9525 -

Computer Security Summer Scholars 2016 Ma7 Vander Werf HPC System Administrator Security in HPC

Building a Grid System for HPC HPC on Grid High Performance Computing (HPC): Use of computer

HPC IN EUROPE Organisation of public HPC resources Context Focus on publicly-funded HPC

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, PhD.

HPC platforms @ UL Overview (as of 2013) and Usage http://hpc.uni.lu S. Varrette, H. Cartiaux

MATLAB on UL HPC Checkpointing &amp; parallel execution UL High Performance Computing (HPC) Team

building software with ease kenneth.hoste@ugent.be HPC UGENT About HPC UGent: central

STRESS: When is it a problem? Fidgeting Picking (skin) Nail biting Stomach ache

Mark 5:1-20 New International Version They went across the lake to the region of the Gerasenes. 2

Preparing Software Engineers for the real world Ed Yourdon ed@yourdon.com

7. God is rescuer; we are rescued @simongharris @burlington_ips #whodoyouthinkyouare? THE

valgrind code analyzer Valgrind is another injection-based profiler/analyzer Can be used to

First Foreign-Born Players Esteban Bell , Hank Biasatti, 1946

r Prt rs t

What is a function? Functions are like the buttons on a calculator. What are some of the

MATLAB on UL HPC Checkpointing & parallel execution UL High Performance Computing (HPC) Team