A Tropical Semiring Multiple Matrix-Product Library on GPUs: (not - PowerPoint PPT Presentation

A Tropical Semiring Multiple Matrix-Product Library on GPUs: (not just) a step towards RNA-RNA Interaction Computations HiCOMB 2020 19th IEEE International Workshop on High Performance Computational Biology Prerana Ghalsasi Brandon Gildemaster prerana.ghalsasi@colostate.edu brandon.gildemaster@colostate.edu Sanjay Rajopadhye sanjay.rajopadhye@colostate.edu

Overview • Background / motivation • Algorithm • Parallelization • Memory optimizations • GPU matrix-matrix multiplication library • Modified matrix-matrix multiplication library • Performance results • Next steps

Background / Motivation • RNA-RNA Interaction (RRI) plays an important role in biological processes – Gene expression A U U C G C A G C A U A C G G A • Certain classes of RRI are well studied – Shown to play roles in various diseases A U U C G C A G C A U A C G G A – Other classes are not as well studied • Biological function can be interpreted from interaction structure • Problem: Current tools to predict structure are slow – O(N^4) space and O(N^6) time complexity • Goal: Utilize massive parallelism of GPUs for acceleration while managing memory constraints

Algorithms • Base pair maximization and free energy minimization • O(N)^6 time and O(N)^4 space • piRNA, BPPart, BPMax • Much work on single strand folding, little on RRI

Algorithm i 1 j 1 i 2 j 2

Algorithm • BPMax – Maximizes the score of weighted interactions – Restricts certain structures • Fills up 4D dynamic programming table i 1 – Trapezoidal grid of trapezoids • Full recurrence equation is complex – One O(N^6) term – Several O(N^5) terms and constant lookups j 1 • Double max reduction (boxed in red) is the most dominant O(N^6) term – Most important optimization for performance i 2 j 2

Algorithm Memory space Set of points evaluated • Skip bottom half of each matrix – Subsequence [i,j] is the same as subsequence [j,i] • Top right corner also can be skipped – Controlled by window size – Limits range of intra-RNA interaction Window size

Parallelization • Imbalanced workload • Naive parallelization: all points along a diagonal can be computed in parallel – Poor locality – No optimizations such as vectorization • Key insight: The double max reduction can be cast as specialized matrix-matrix multiplication – Rearrange order of evaluation – Apply memory transformations to the dynamic programming table Depiction of naive parallelization: all terms for the red cells are evaluated in parallel

Double max reduction

Double max reduction • Evaluation of blue cell is the maximum of the pairwise addition of the row and column of red cells • Interchanging j and k loops exploits vectorization on CPUs – Basically doing tropical matrix multiplication • Can be applied to all points in one matrix in parallel – + And all matrices along a diagonal to exploit coarse grain parallelism + = MAX + +

Double max reduction • Imbalanced workload A C Requires two max-plus operations Requires one max-plus operations C A B B = *

Double max reduction • Pad each matrix with an extra row and column – Shift cells in each matrix one row to the right • Initialize white cells to max-plus semiring additive identity • MAX( C[0,3] , -∞ + B[0,3]) = C[0,3] Avoids thread divergence C A B = *

Thread divergence Thread 1 in thread block 0: 2 iterations • One program counter (PC) per thread warp • Thread 3 in thread block 0: 1 iteration PC loads instruction and all threads execute it • Divergence introduces overhead – Threads must be masked (basically turned on/off) Image from NVIDIA Volta architecture whitepaper

Matrix Multiplication • Visualizing iteration space k k i i = = j j

Triangular or Trapezoidal Matrix Multiplication k k i • Goal: Get as close to the iteration space on the left i without introducing thread divergence • Thread divergence happens at the warp level in CUDA j – Diverging threads in a warp execute different j instructions • Skip computations at the thread-block level • No standard library performs triangular-triangular matrix multiplication 6x the amount of work! – Triangular-square

Algorithm • Skip computations at thread block level Step 1 Step 2 Step 3 Step 1 Step 2 Step 3 = =

Modifications • Two memory transformations N 2 *M 2 → N*M*W 2 • • 102 GB → 10.5 GB for N = M = 400 and W = 128 i 2 , j 2 → i 2 , j 2 -i 2 i 1 , j 1 → i 1 + N - j 1 , j 1

Final algorithm Step 1 Step 3 Step 2 The sub patch of C the thread block will compute Blue and red cells are loaded from global to shared memory during each step = The computation performed in shared memory * * * during each step

GPU Library • Library call multiplies a column of matrices by another column of matrices in the max-plus semiring = * The full double reduction for blue/green matrices requires two library calls One call to the GPU library

Max plus theoretical peak • Can’t utilize FMA or tensor cores Architecture Memory Cores Clock speed Calculated peak GTX 980 Maxwell 4 GB 2048 1216 MHz 2490 GTX 1060 Pascal 6 GB 1280 1708 MHz 2184 Titan V Volta 12 GB 5120 1455 MHz 7450

Square matrix multiplication library Library performance • We developed a square matrix multiplication library which attains close to machine peak – Performs many unnecessary computations • A trapezoidal matrix multiplication library which does less operations Trapezoidal matrix multiplication library – but introduces some irregularities affecting performance • Graphs showing performance of a single library call on a column of 50 matrices

Library performance = * • Graph is showing effective operations per second: counting only the operations on cells that matter divided by runtime • Previous graph was showing performance considering all operations – This graph is more specific to BPMax • When computing operations per second and ignoring useless computations (effective ops/second) the trapezoidal library performance is higher – Because it is doing less operations

Full BPMax performance • At the time of paper submission we completed the full implementation of BPMax on a GPU • CPU experiments ran with the original BPMax implementation – Naive CPU implementation / parallelization – We plan to implement an optimized CPU version for a more fair comparison • Intel(R) Xeon(R) E-2278G CPU – 5 GHz max clock speed – 16 cores • GPU results include data transfer time from CPU to GPU and back • BPMax attains ~.5 Giga ops /second currently

Current / future work • Current library call attains ~10-11% of theoretical peak of GPU across 3 architectures – Room for 10x improvement • Bottleneck: Memory mappings we implemented introduce thread divergence with memory loads – We are exploring alternate strategies that reduce memory requirements without introducing irregularities • Optimized CPU implementation of BPMax that exploits vectorization / multithreading

Current work - eliminating thread divergence with memory loads • Problem: current memory map introduces thread divergence with memory loads When loading values into shared memory, threads that load values – But not on the computation level that were shifted out from memory transformations have thread divergence 1 2 3 if (value in physical memory) load into shared memory else pad with additive identity

Current work - possible solution 1 • Pad each matrix out to the next multiple of the thread block dimensions – In this example the memory allocation is worse simply because the problem size is so small – For larger RNA / window sizes it will save memory and eliminate divergence Pad with additive identity 1 2 3 12x12 12x8 12x15

Current work - possible solution 2 • Allocate memory based on the dimensions of the thread blocks • This is the minimum memory we can allocate while avoiding thread divergence – Since it is based off the thread block dimensions 4x4 thread block dimensions Logical thread block mappings, size is Physical memory allocation configurable (each color is a thread block) 2 x 2 t h r e a d b l o c k d i m e n s i o n s Matrix dimensions based on RNA size

A Tropical Semiring Multiple Matrix-Product Library on GPUs: (not - PowerPoint PPT Presentation

A Tropical Semiring Multiple Matrix-Product Library on GPUs: (not just) a step towards RNA-RNA Interaction Computations HiCOMB 2020 19th IEEE International Workshop on High Performance Computational Biology Prerana Ghalsasi Brandon Gildemaster

Tropical Cyclones Tropical Cyclones Part 10 Part 10 W hat is a Tropical Cyclone? A tropical

Max-plus Polynomials and Their Roots Vladimir V. Podolskii joint with Dima Grigoriev Steklov

Product Section Product Section New Product Introduction New Product Introduction Product

The MJO in Tropical Total Ozone The MJO in Tropical Total Ozone The MJO in Tropical Total Ozone

Matrix Multiplication Matrix Multiplication via Matrix-Vector Mult Defn. If matrix A is m n

On the Resolution Semiring Soutenance de Thse M arc B agnol Institut de Mathmatiques de

Semiring parsing Arnd Hartmanns Probabilistic grammars Motivation Natural language is

Presentation 7.3b: Multiple linear regression Murray Logan 09 Aug 2016 library (GGally) library

Gov 2000: 10. Multiple Regression in Matrix Form Matthew Blackwell Fall 2016 1 / 64 1. Matrix

Tropical Connected Sets problem Eugene Vagin 07.04.2020 Eugene Vagin Tropical Connected Sets

[3] The Matrix What is a matrix? Traditional answer Neo: What is the Matrix? Trinity: The answer

Library Department FY 2021 Library Department FY 2021 Library Organization Chart Springfield

Three notions of tropical rank for symmetric matrices Dustin Cartwright and Melody Chan UC

Building an IoT Platform with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Introductory Matrix Operations Matrix Entries Defn. For matrix A , notation a ij means the en-

Liberating Communication with Matrix matthew@matrix.org http://www.matrix.org What is Matrix?

Trigger and Data Acquisition (II) Brigitte Vachon (McGill) HCPSS 2010 HCPSS 2010 Brigitte

Deadly Embrace: Sovereign and Financial Balance Sheet Doom Loops Emmanuel Farhi Jean Tirole

Interrupts in Zynq Systems C r i s t i a n S i s t e r n a U n i v e r s i d a d N a c i o n a l

Comparing User-Provided Tests to Developer-Provided Tests Ren Just, Chris Parnin, Ian Drosos,

Fault-tolerant design techniques slides made with the collaboration of: Laprie, Kanoon, Romano

Evaluation CS 197 | Stanford University | Michael Bernstein Administrivia Evaluation plan

R EVIEW OF C HAPTER 2 H OW TO D EVELOP A VB A PPLICATION Design the Interface for the user

Pentium 4 Architecture Breakdown Key differences from the PIII Using the P4s