High-Performance Machine Learning for Weather Prediction - - PowerPoint PPT Presentation

high performance machine learning for weather prediction
SMART_READER_LITE
LIVE PREVIEW

High-Performance Machine Learning for Weather Prediction - - PowerPoint PPT Presentation

High-Performance Machine Learning for Weather Prediction Applications Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology NVIDIA GTC at San Jose, CA May 8-11, 2017 H.


slide-1
SLIDE 1

High-Performance Machine Learning for Weather Prediction Applications

Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology NVIDIA GTC at San Jose, CA May 8-11, 2017

  • H. Ltaief

1 / 35

slide-2
SLIDE 2

Outline

1

Computational Statistics for Climate/Weather Prediction Applications

2

Dense Cholesky-based Matrix Computations

3

Tile Low-Rank Cholesky-based Matrix Approximation

4

KBLAS

5

What’s Next?

  • H. Ltaief

2 / 35

slide-3
SLIDE 3

Computational Statistics for Climate/Weather Prediction Applications

Outline

1

Computational Statistics for Climate/Weather Prediction Applications

2

Dense Cholesky-based Matrix Computations

3

Tile Low-Rank Cholesky-based Matrix Approximation

4

KBLAS

5

What’s Next?

  • H. Ltaief

3 / 35

slide-4
SLIDE 4

Computational Statistics for Climate/Weather Prediction Applications

Computational Statistics for Climate/Weather Prediction Applications

Applications from climate and weather science often deal with a very large number of measurements regularly or irregularly located in geographical region. In geospatial statistics, these data are usually modeled as a realization from Gaussian spatial random field. This translates into evaluating the log-likelihood function, involving a large dense (but data-sparse) covariance matrix.

  • H. Ltaief

4 / 35

slide-5
SLIDE 5

Computational Statistics for Climate/Weather Prediction Applications

Geospatial Statistics: Learning using Cholesky

Multivariate large spatial data sets in climate/weather modeling

(a) Problem Definition. (b) Soil moisture.

Figure: Climate/weather model.

  • H. Ltaief

5 / 35

slide-6
SLIDE 6

Computational Statistics for Climate/Weather Prediction Applications

Geospatial Statistics: Prediction using Schur Complement

Z1 Z2

  • = Nm+n(

µ1 µ2

  • ,

Σ11 Σ12 Σ21 Σ22

  • )

Z1|Z2 ≈ Nm(

  • µ1 + Σ12Σ−1

22 (Z2 − µ2)

  • ,
  • Σ11 − Σ12Σ22−1Σ21
  • )
  • H. Ltaief

6 / 35

slide-7
SLIDE 7

Dense Cholesky-based Matrix Computations

Outline

1

Computational Statistics for Climate/Weather Prediction Applications

2

Dense Cholesky-based Matrix Computations

3

Tile Low-Rank Cholesky-based Matrix Approximation

4

KBLAS

5

What’s Next?

  • H. Ltaief

7 / 35

slide-8
SLIDE 8

Dense Cholesky-based Matrix Computations

Matrix Form

The Cholesky factorization of an N × N real symmetric, positive-definite matrix A has the form A = LLT, where L is an N × N real lower triangular matrix with positive diagonal elements.

  • H. Ltaief

8 / 35

slide-9
SLIDE 9

Dense Cholesky-based Matrix Computations

LAPACK DPOTRF

UPDATE PANEL

(a) First step.

F I N A L UPDATE PANEL

(b) Second step.

F I N A L PANEL

(c) Third step.

Figure: Block Algorithms.

  • H. Ltaief

9 / 35

slide-10
SLIDE 10

Dense Cholesky-based Matrix Computations

PLASMA/CHAMELEON DPOTRF

Figure: Tile Algorithms.

  • H. Ltaief

10 / 35

slide-11
SLIDE 11

Dense Cholesky-based Matrix Computations

Mat´ ern Kernel: θ1

20000 40000 60000 80000 0.6 0.8 1.0 1.2 1.4 1.6 Matrix Size θ1

  • H. Ltaief

11 / 35

slide-12
SLIDE 12

Dense Cholesky-based Matrix Computations

Mat´ ern Kernel: θ2

20000 40000 60000 80000 0.05 0.10 0.15 0.20 Matrix Size θ2

  • H. Ltaief

12 / 35

slide-13
SLIDE 13

Dense Cholesky-based Matrix Computations

Mat´ ern Kernel: θ3

20000 40000 60000 80000 0.46 0.48 0.50 0.52 0.54 Matrix Size θ3

  • H. Ltaief

13 / 35

slide-14
SLIDE 14

Dense Cholesky-based Matrix Computations

Maximum Likelihood Performance on 32 HSW cores + 8 K80 GPUs w/ StarPU

1000 1500 2000 2500 3000 3500 4000 4500 5000 5500 25000 30000 35000 40000 45000 50000 55000 60000 65000 70000 75000 80000 85000 90000 95000 100000 Gflop/s Matrix size Haswell+8K80 Chameleon-Likelihood perf.

  • H. Ltaief

14 / 35

slide-15
SLIDE 15

Dense Cholesky-based Matrix Computations

Maximum Likelihood Performance on 20 BDW cores + 8 P100 GPUs w/ StarPU

0.1 1 10 100 1000 10000

10000 20000 30000 40000 50000 60000 70000 80000 90000 100000

Time (s) Matrix Size

DGX-1

  • H. Ltaief

15 / 35

slide-16
SLIDE 16

Dense Cholesky-based Matrix Computations

Real Datasets w/ Mississipi Basin (Soil moisture)

  • H. Ltaief

16 / 35

slide-17
SLIDE 17

Dense Cholesky-based Matrix Computations

Covariance Matrix Problems

Ubiquitous in computational science and engineering Symmetric, positive-definite matrix structure (Apparently) Dense matrices Often data-sparse Decay of parameter correlations with distance Hierarchically of low rank

  • H. Ltaief

17 / 35

slide-18
SLIDE 18

Tile Low-Rank Cholesky-based Matrix Approximation

Outline

1

Computational Statistics for Climate/Weather Prediction Applications

2

Dense Cholesky-based Matrix Computations

3

Tile Low-Rank Cholesky-based Matrix Approximation

4

KBLAS

5

What’s Next?

  • H. Ltaief

18 / 35

slide-19
SLIDE 19

Tile Low-Rank Cholesky-based Matrix Approximation

Matrix Rank X-ray: Hierarchically Low Rank

5 10 15 5 10 15 500 133 47 35 154 83 44 33 59 49 38 30 40 37 33 29 33 32 30 27 133 500 139 48 86 163 86 44 50 59 50 38 37 41 37 33 32 33 32 30 47 139 500 137 44 86 153 83 38 49 59 49 33 37 41 37 30 32 33 32 35 48 137 500 33 44 83 164 30 38 49 59 29 33 37 41 27 30 32 33 154 86 44 33 500 134 48 34 165 84 44 33 59 50 38 30 41 37 33 30 83 163 86 44 134 500 139 48 86 163 85 44 49 59 50 38 37 40 37 33 44 86 153 83 48 139 500 137 44 86 172 86 38 50 59 49 33 37 40 37 33 44 83 164 34 48 137 500 33 44 86 166 30 39 50 59 29 33 37 41 59 50 38 30 165 86 44 33 500 143 48 35 164 85 44 33 59 49 38 31 49 59 49 38 84 163 86 44 143 500 143 48 84 159 87 44 49 59 49 38 38 50 59 49 44 85 172 86 48 143 500 134 44 86 156 81 38 49 58 49 30 38 49 59 33 44 86 166 35 48 134 500 33 45 86 157 30 39 49 59 40 37 33 29 59 49 38 30 164 84 44 33 500 138 48 35 162 86 44 33 37 41 37 33 50 59 50 39 85 159 86 45 138 500 142 48 85 165 85 45 33 37 41 37 38 50 59 50 44 87 156 86 48 142 500 133 44 84 159 85 29 33 37 41 30 38 49 59 33 44 81 157 35 48 133 500 33 44 81 157 33 32 30 27 41 37 33 29 59 49 38 30 162 85 44 33 500 142 47 34 32 33 32 30 37 40 37 33 49 59 49 39 86 165 84 44 142 500 136 48 30 32 33 32 33 37 40 37 38 49 58 49 44 85 159 81 47 136 500 130 27 30 32 33 30 33 37 41 31 38 49 59 33 45 85 157 34 48 130 500

  • H. Ltaief

19 / 35

slide-20
SLIDE 20

Tile Low-Rank Cholesky-based Matrix Approximation

Dense Linear Algebra Renaissance

  • H. Ltaief

20 / 35

slide-21
SLIDE 21

Tile Low-Rank Cholesky-based Matrix Approximation

HiCMA DPOTRF

The low-rank tile Cholesky algorithm can be expressed with the following four computational kernels:

HCORE DPOTRF: The kernel performs the Cholesky factorization of a diagonal (lower triangular) tile. It is similar to DPOTRF since the diagonal tiles are dense. HCORE DTRSM: The operation applies an update to an off-diagonal low-rank tile of the input matrix, resulting from factorization of the diagonal tile above it and overrides it with the final elements of the output matrix: V(i,k) = V(i,k) × D−1

(k,k). The

  • peration is a triangular solve.

HCORE DSYRK: The kernel applies updates to a diagonal (lower triangular) tile of the input matrix, resulting from factorization of the low-rank tiles to the left of it: D(j,j) = D(j,j) − (U(j,k) × V T

(j,k)) × (U(j,k) × V T (j,k))T . The operation is a

symmetric rank-k update. HCORE DGEMM: The operation applies updates to an off-diagonal low-rank tile of the input matrix, resulting from factorization of the low-rank tiles to the left of it. The

  • peration involves two QR factorizations, one reduced SVD (depending on the

rank and/or the accuracy parameter) and two matrix-matrix multiplications.

  • H. Ltaief

21 / 35

slide-22
SLIDE 22

Tile Low-Rank Cholesky-based Matrix Approximation

HiCMA DPOTRF

  • H. Ltaief

22 / 35

slide-23
SLIDE 23

Tile Low-Rank Cholesky-based Matrix Approximation

Tile Low Rank Cholesky: Memory Footprint

Akbudak et al., accepted at ISC17

  • H. Ltaief

23 / 35

slide-24
SLIDE 24

Tile Low-Rank Cholesky-based Matrix Approximation

Dense Linear Algebra Renaissance

  • H. Ltaief

24 / 35

slide-25
SLIDE 25

Tile Low-Rank Cholesky-based Matrix Approximation

KAUST BLAS Poster@GTC17 P7223 (Ali Charara)

  • H. Ltaief

25 / 35

slide-26
SLIDE 26

KBLAS

Outline

1

Computational Statistics for Climate/Weather Prediction Applications

2

Dense Cholesky-based Matrix Computations

3

Tile Low-Rank Cholesky-based Matrix Approximation

4

KBLAS

5

What’s Next?

  • H. Ltaief

26 / 35

slide-27
SLIDE 27

KBLAS

Advanced Batched BLAS Operations: HBLAS

Context: Very small sizes! Batch operation executions at each level of the tree Currently fixed sizes (need to handle variable sizes) Recursive formulation, stressing register usage Convert into batch of large GEMMs Minimize data transfer Enhance data locality Increase arithmetic intensity State-of-the-art implementations not well optimized for this scope or not supported

  • H. Ltaief

27 / 35

slide-28
SLIDE 28

KBLAS

Advanced Batched BLAS Operations: HBLAS

HBLAS Matrix computations: Level 3 BLAS: SYRK, TRMM, TRSM Factorizations: POTRF Solves: POTRS, POSV, POTRI, POTI HBLAS Matrix compression: Batch QR factorizations Batch SVD

  • H. Ltaief

28 / 35

slide-29
SLIDE 29

KBLAS

Advanced Batched BLAS Operations: HBLAS

Batches of Batched

  • Rec. Batch DPOTRF
  • Rec. Batch DTRSM
  • Rec. Batch DSYRK
  • Rec. Batch DPOTRF

Profiling shows 76% of time is spent in batch DGEMM (MAGMABLAS).

  • H. Ltaief

29 / 35

slide-30
SLIDE 30

KBLAS

Performance Results: Batched Level 3 BLAS on NVIDIA K40 GPUs

  • H. Ltaief

30 / 35

slide-31
SLIDE 31

KBLAS

Performance Results: Batched Solves on NVIDIA K40 GPUs

  • H. Ltaief

31 / 35

slide-32
SLIDE 32

KBLAS

Performance Results: Batched Schur Complements on NVIDIA K40 GPUs

1 2 4 8 16 32 64 128 256 512 1024 32 64 128 256 512 GFlops/s max-N (Batch=2048) KBLAS-vDschur-P16 KBLAS-vDschur-P32 KBLAS-vDschur-P64 KBLAS-vDschur-P128

  • H. Ltaief

32 / 35

slide-33
SLIDE 33

What’s Next?

Outline

1

Computational Statistics for Climate/Weather Prediction Applications

2

Dense Cholesky-based Matrix Computations

3

Tile Low-Rank Cholesky-based Matrix Approximation

4

KBLAS

5

What’s Next?

  • H. Ltaief

33 / 35

slide-34
SLIDE 34

What’s Next?

HiCMA Software Stack

  • H. Ltaief

34 / 35

slide-35
SLIDE 35

Acknowledgments

Students/Collaborators/Vendors

Extreme Computing Research Center @ KAUST: S. Abdullah, K. Akbudak, W. Boukaram, A. Charara, G. Ch´ avez, M. Genton, D. Keyes, A. Litvinenko, A. Mikhalev,

  • D. Sukkari, G. Turkiyyah and Y. Sun

Innovative Computing Laboratory @ UTK: PLASMA/MAGMA/PaRSEC Teams INRIA/INP/LaBRI Bordeaux, France: Runtime/HiePACS Teams Max-Planck Institute@Leipzig, Germany: R. Kriemann KAUST Supercomputing Lab and IT Research Computing support NVIDIA GPU Research Center Intel Parallel Computing Center Cray Center of Excellence

  • H. Ltaief

35 / 35