High Performance In-Situ Visualization on Thousands of GPUs Jeroen Bédorf Evghenii Gaburov Simon Portegies Zwart Peter Messmer Leiden Observatory
• • • •
Compute machine Simulation I/O layer disk I/O software Storage analysis & visualization disk I/O software I/O layer software Ex-situ visualization machine
Compute & in-situ visualization machine Simulation I/O layer analysis & visualization, disk I/O software simulation steering sw Storage
“Hoax object” Discovered at SC14!
Gravitational tree code :: Bonsai Showcased at GTC12 & SC14 Gordon Bell Prize Finalist (2014) Features: • Scales up to 25 Pflops on Titan supercomputer • Async parallel I/O • In-situ (parallel) visualization http://github.com/treecode/Bonsai
Gravitational tree code :: Bonsai Showcased at GTC12 & SC14 Gordon Bell Prize Finalist (2014) Features: • Scales up to 25 Pflops on Titan supercomputer • Async parallel I/O • In-situ ( parallel ) visualization http://github.com/treecode/Bonsai
Compute & in-situ visualization machine Bonsai I/O layer analysis & visualization, simulation steering sw In-situ visualization pipeline: 1. Simulation step 2. Data partitioning 3. OpenGL rendering 4. Parallel compositing 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms Display Display (240 ms) Display Compositing 1 Simulation step (80 ms) 2 Data partition (50 ms) 3 OpenGL rendering (60 ms) 4 Compositing (50 ms) Simulation step …
2. Data partitioning
9 1 8 2 6 3 4 5 7
9 1 8 2 6 3 4 5 7
Space Filling Curve (SFC) Domain decomposition in Bonsai
depth Ray casting
depth Ray casting Sampling data
depth Ray casting Sampling data Shading
5 4 3 2 1 depth Ray casting Sampling data Shading Compositing
9 1 8 2 P L Q 6 3 4 5 7 P Q
9 1 8 2 P L Q 6 3 4 5 7 P Q
9 1 8 2 P L Q 9 6 3 1 4 5 2 7 P Q
9 1 8 2 P L Q 9 6 3 1 4 5 2 7 P Q
9 1 7 6 5 8 2 P 4 L Q 5 6 3 9 3 1 4 4 5 2 3 7 P Q
4 7 1 8 5 P L 2 Q 9 3 6 P Q
1 4 7 8 5 P 9 L 2 Q 7 6 8 5 4 3 9 1 2 3 6 P Q
Recursive multi-section domain decomposition
Every new in-situ data update Recursive multi-section SFC Both a CPU and Interconnect heavy operation
Every new in-situ data update Recursive multi-section SFC Both a CPU and Interconnect heavy operation
GPU-0
GPU-1
GPU-2
GPU-3
GPU-4
GPU-5
GPU-6
GPU-7
GPU-8
GPU-0 GPU-1 GPU-2 GPU-3 GPU-4 GPU-5 GPU-6 GPU-7 GPU-8
Final image
9 7 6 8 5 4. Parallel compositing 4 3 1 2 P Q
proc 0 proc 1 proc 2 proc 3 proc 4 proc 5 proc 6 proc 7
proc 0 proc 1 proc 2 proc 3 proc 4 proc 5 proc 6 proc 7
proc 0 proc 1 proc 2 proc 3 proc 4 proc 5 proc 6 proc 7 G1 G7 G3 G6
proc 0 proc 1 proc 2 proc 3 proc 4 proc 5 proc 6 proc 7 G1 G7 G3 G6
proc 0 proc 1 proc 2 proc 3 proc 4 proc 5 proc 6 proc 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 7 7 7 7 1 1 1 1,3 1,3 3 7 7 7 7 7 1 1 1,6 1,3,6 1,3,6 3 7 7 7 7 7 6 3,6 3,6 3 7 7 7 7 7 6 6 6 6 6 6 MPI_Alltoallv(..) A bit of math & data exchange is done with a single operation:
proc 0 proc 1 proc 2 proc 3 proc 4 proc 5 proc 6 proc 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 7 7 7 7 1 1 1 1,3 1,3 3 7 7 7 7 7 1 1 1,6 1,3,6 1,3,6 3 7 7 7 7 7 6 3,6 3,6 3 7 7 7 7 7 6 6 6 6 6 6 P2: blends pixels from G1 & G3 P3: blends pixels from G1, G3 & G6 P4: blends pixels from G3 & G6
proc 0 proc 1 proc 2 proc 3 proc 4 proc 5 proc 6 proc 7 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 7 7 7 7 7 1 1 1 1+3 1+3 3 7 7 7 7 7 1 1 1+6 1+3+6 1+3+6 3 7 7 7 7 7 6 3+6 3+6 3 7 7 7 7 7 6 6 6 6 6 6 Glue scan-lines together with a single operation: MPI_Gather(..)
10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms Compositing 1 Simulation step (80 ms) 2 Data partition (50 ms) 3 OpenGL rendering (60 ms) 4 Compositing (50 ms) Simulation step … Display Display (240 ms) Display
10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 1 Simulation step (80 ms) 2 Data partition (50 ms) 3 OpenGL rendering (60 ms) 4 Compositing (50 ms) Display
10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms Simulation step 1 Simulation step (80 ms) Simulation step (80 ms) Simulation step (80 ms) Simulation step … 2 Data partition (50 ms) 3 OpenGL rendering (60 ms) 4 Compositing (50 ms) Display
10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms Simulation step 1 Simulation step (80 ms) Simulation step (80 ms) Simulation step (80 ms) Simulation step … Data partition (50 ms) 2 Data partition (50 ms) Data partition (50 ms) Data partition 3 OpenGL rendering (60 ms) 4 Compositing (50 ms) Display
10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms Simulation step 1 Simulation step (80 ms) Simulation step (80 ms) Simulation step (80 ms) Simulation step … Data partition (50 ms) 2 Data partition (50 ms) Data partition (50 ms) Data partition OpenGL rendering OpenGL rendering (60 ms) OpenGL rendering (60 ms) 3 OpenGL rendering (60 ms) OpenGL rendering Compositing Compositing (50 ms) Compositing (50 ms) Compositing (50 ms) 4 Compositing (50 ms) Display Display (60 ms) Display (60 ms) Display (60 ms) Display
4 fps 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms Compositing 1 Simulation step (80 ms) 2 Data partition (50 ms) 3 OpenGL rendering (60 ms) 4 Compositing (50 ms) Simulation step … Display Display (240 ms) Display 16 fps 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms Simulation step 1 Simulation step (80 ms) Simulation step (80 ms) Simulation step (80 ms) Simulation step … Data partition (50 ms) 2 Data partition (50 ms) Data partition (50 ms) Data partition OpenGL rendering OpenGL rendering (60 ms) OpenGL rendering (60 ms) 3 OpenGL rendering (60 ms) OpenGL rendering Compositing Compositing (50 ms) Compositing (50 ms) Compositing (50 ms) 4 Compositing (50 ms) Display Display (60 ms) Display (60 ms) Display (60 ms) Display
4 fps 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms Compositing 1 Simulation step (80 ms) 2 Data partition (50 ms) 3 OpenGL rendering (60 ms) 4 Compositing (50 ms) Simulation step … Display Display (240 ms) Display • 16 bit colors • delegated MPI_Alltoallv with MPI rank placement • dedicated remote displaying machine to gather final image • image compression 15 fps 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms 10 ms Simulation step 1 Simulation step (80 ms) Simulation step (80 ms) Simulation step (80 ms) Simulation step … Data partition (50 ms) 2 Data partition (50 ms) Data partition (50 ms) Data partition OpenGL rendering OpenGL rendering (60 ms) OpenGL rendering (60 ms) 3 OpenGL rendering (60 ms) OpenGL rendering Compositing Compositing (50 ms) Compositing (50 ms) Compositing (50 ms) 4 Compositing (50 ms) Display Display (60 ms) Display (60 ms) Display (60 ms) Display http://github.com/treecode/Bonsai
• In-situ visualization as I/O workflow (e.g. ADIOS) • Take advantage of existing software (e.g. ParaView) • Interoperability with job schedulers (e.g. slurm) • More use cases (astro, chem, bio, automotive, aerospace)
Recommend
More recommend