Towards a multifrontal QR factorization for heterogeneous - PowerPoint PPT Presentation

Towards a multifrontal QR factorization for heterogeneous architectures over runtime systems Preliminary work on multicore architectures Florent Lopez, Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon MUMPS Users Group Meeting, May 29-30, 2013

Context of the work

The Multifrontal method The multifrontal QR factorization is guided by a graph called elimination tree : pivots are eliminated associated with a relatively small dense matrix called frontal matrix (or, simply, front ) which contains the k columns related to the pivots and all the other coefficients concerned by their elimination 3/28 MUMPS Users Group Meeting, May 29-30, 2013 • at each node of the tree k • each node of the tree is

Accelerated architectures New generation of accelerators: MIC (Many Integrated Core) 4/28 algorithms to achieve performance Accelerated architectures including GPUs are extremely popular architecture (wrt the latest generation of CPUs) in the HPC community: MUMPS Users Group Meeting, May 29-30, 2013 • high density of computational units clocked at low frequencies ◦ high potential performance peak for parallel applications ◦ limited power consumption • 60 cores (In-order core derived from Pentium: energy efficient) • clocked at 1053 MHz • 240 threads (with 4 hyper threading sibling per core) • advanced VPU per core (512-bit SIMD) Incompatible programming models ⇒ specific kernels and

Accelerated architectures (ACC: GPU or MIC) CPU 5/28 Elimination tree RAM RAM RAM RAM ACC Heterogeneous platform CPU CPU CPU ACC RAM ACC CPU CPU CPU CPU MUMPS Users Group Meeting, May 29-30, 2013 • an extremely heterogeneous workload • a heterogeneous architecture • mapping tasks is challenging

Accelerated architectures (ACC: GPU or MIC) CPU 5/28 Elimination tree RAM RAM RAM RAM ACC Heterogeneous platform CPU CPU CPU ACC RAM ACC CPU CPU CPU CPU MUMPS Users Group Meeting, May 29-30, 2013 • management of data consistency by hand • architecture dependant approach • difficult to maintain

Accelerated architectures (ACC: GPU or MIC) RAM 5/28 consistency in a dynamic way. system capable of handling the scheduling and the data Another option is to exploit the features of a modern runtime Runtime system (CPU, GPU, SPU) drivers DSM scheduler StarPU Elimination tree RAM RAM Heterogeneous platform RAM ACC CPU CPU CPU CPU ACC RAM ACC CPU CPU CPU CPU MUMPS Users Group Meeting, May 29-30, 2013

Runtime systems Runtime system: abstract layer between application and machine with the following features: each type of processing unit) 6/28 MUMPS Users Group Meeting, May 29-30, 2013 • automatic detection of the task dependencies • dynamic task scheduling on different types of processing units. • management of multi-versioned tasks (an implementation for • consistency management of manipulated data.

Runtime systems Objective of the study: evaluate the usability and the effectiveness of a general purpose runtime system with complex and irregular workload such as a sparse factorization This approach is widely adopted in the case of dense linear algebra: it is challenging for complex and irregular problems such as sparse linear algebra (related work on PaStiX with Pierre Ramet) 7/28 MUMPS Users Group Meeting, May 29-30, 2013 • exploitation of GPU-accelerated architectures • PLASMA (QUARK) • DPLASMA (PaRSEC) • MAGMA-MORSE (StarPU) • FLAME (SuperMatrix)

Multifrontal QR factorization on multicores over StarPU

parallelism and scheduling strategy in qr_mumps In qr_mumps node and tree parallelism are exploited consistently, by partitioning the frontal matrices and replacing the elimination tree with a DAG: the scheduler efficiency is constrained by the tasks search-space: the scheduling complexity depends on the number of active fronts and therefore is not very scalable. Replace the ad hoc scheduler in qr_mumps with a general purpose runtime system 9/28 MUMPS Users Group Meeting, May 29-30, 2013

The multifrontal QR factorization: StarPU integration StarPU Task 10/28 placement, StarPU decides where to run a task dependencies among tasks ... Output m Output 1 ... Input n Input 1 Priority -Code SPU -Code GPU -Code CPU MUMPS Users Group Meeting, May 29-30, 2013 • Depending on the input/output, StarPU detects the • Depending on the availability of resources and the data

The multifrontal QR factorization: StarPU integration id 1 11/28 explicit dependency (data hazard) detected dependency id 3 id 2 MUMPS Users Group Meeting, May 29-30, 2013 Original sequence: Equivalent StarPU code: 1: submit_task( fun 1 , A: inout, B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: submit_task( fun 2 , A: inout, id = id 2 ) 2: fun 2 (A: inout) 3: declare_dependency( id 3 ← id 1 ) 3: fun 1 (C: inout, D: in) 4: submit_task( fun 1 , C: inout, D: in, id = id 3 )

The multifrontal QR factorization: StarPU integration id ) 12/28 explicit dependency (data hazard) detected dependency id 1 id ) 4: submit_task( fun , C: inout, D: in, id 3: declare_dependency( id Original sequence: id ) 2: submit_task( fun , A: inout, id Equivalent StarPU code: MUMPS Users Group Meeting, May 29-30, 2013 1: submit_task( fun 1 , A: inout, B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: fun 2 (A: inout) 3: fun 1 (C: inout, D: in)

The multifrontal QR factorization: StarPU integration 4: submit_task( fun , C: inout, D: in, id 12/28 explicit dependency (data hazard) detected dependency id 2 id 1 id ) id ) Original sequence: 3: declare_dependency( id Equivalent StarPU code: MUMPS Users Group Meeting, May 29-30, 2013 1: submit_task( fun 1 , A: inout, B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: submit_task( fun 2 , A: inout, id = id 2 ) 2: fun 2 (A: inout) 3: fun 1 (C: inout, D: in)

The multifrontal QR factorization: StarPU integration 4: submit_task( fun , C: inout, D: in, id 12/28 explicit dependency (data hazard) detected dependency id 2 id 1 id ) id ) Original sequence: 3: declare_dependency( id Equivalent StarPU code: MUMPS Users Group Meeting, May 29-30, 2013 1: submit_task( fun 1 , A: in out , B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: submit_task( fun 2 , A: in out, id = id 2 ) 2: fun 2 (A: inout) 3: fun 1 (C: inout, D: in)

The multifrontal QR factorization: StarPU integration id 1 12/28 explicit dependency (data hazard) detected dependency id 3 id 2 id ) Original sequence: 3: declare_dependency( id Equivalent StarPU code: MUMPS Users Group Meeting, May 29-30, 2013 1: submit_task( fun 1 , A: inout, B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: submit_task( fun 2 , A: inout, id = id 2 ) 2: fun 2 (A: inout) 3: fun 1 (C: inout, D: in) 4: submit_task( fun 1 , C: inout, D: in, id = id 3 )

The multifrontal QR factorization: StarPU integration id 1 12/28 explicit dependency (data hazard) detected dependency id 3 id 2 MUMPS Users Group Meeting, May 29-30, 2013 Original sequence: Equivalent StarPU code: 1: submit_task( fun 1 , A: inout, B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: submit_task( fun 2 , A: inout, id = id 2 ) 2: fun 2 (A: inout) 3: declare_dependency( id 3 ← id 1 ) 3: fun 1 (C: inout, D: in) 4: submit_task( fun 1 , C: inout, D: in, id = id 3 )

The multifrontal QR factorization: StarPU integration Output m 13/28 and let StarPU do all the work call submit_task(operation1, i1, ..., in, o1, ..., om) with call operation1(i1, ..., in, o1, ..., om) The easy way: replace all the ... Output 1 StarPU Task ... Input n Input 1 Priority -Code SPU -Code GPU -Code CPU MUMPS Users Group Meeting, May 29-30, 2013

The multifrontal QR factorization: StarPU integration StarPU Task 13/28 to limit the memory consumption scheduling job too complex and memory consuming ... Output m Output 1 ... Input n Input 1 Priority -Code SPU -Code GPU -Code CPU MUMPS Users Group Meeting, May 29-30, 2013 • the DAG may have millions of nodes which makes the • the scheduling of activation tasks have to be controlled in order

The multifrontal QR factorization: StarPU integration StarPU Task -Code CPU -Code GPU -Code SPU Priority Input 1 Input n ... Output 1 Output m ... Our approach: We give to StarPU a limited view of the DAG; this is achieved by defining tasks that submit other tasks. 13/28 MUMPS Users Group Meeting, May 29-30, 2013

The multifrontal QR factorization: dynamic construction of the DAG The activation tasks in charge of allocating the memory and preparing the data structures needed for processing a front are the ideal candidates to submit the numerical tasks 14/28 MUMPS Users Group Meeting, May 29-30, 2013

Towards a multifrontal QR factorization for heterogeneous - PowerPoint PPT Presentation

Towards a multifrontal QR factorization for heterogeneous architectures over runtime systems Preliminary work on multicore architectures Florent Lopez, Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon MUMPS Users Group

Sparse direct solvers on top of runtime systems ANR SOLHAR E. Agullo, G. Bosilca, A. Buttari, A.

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

An Introduction to Coupling Conditions Homogeneous Heterogeneous Domain Decomposition Problems

Integer Factorization Methods Modular Arithmetic Trial division, Pollards p 1 , Division

Mac Lane and Factorization Walter Tholen York University, Toronto June 15, 2006 Walter Tholen

A Model For Mixed Linear-Tropical Matrix Factorization James Hook, Sanjar Karaev, Pauli Miettinen

Factoring Done by:Rashed salmeen Grade:9ASP2 Prime factorization Prime factorization:is finding

Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender

Incomplete Factorization by Local Exact Factorization (ILUE) Johannes Kraus and Maria Lymbery

Compressed Factorization: Fast and Accurate Low-Rank Factorization of Compressively-Sensed Data

Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department

LU -factorization and probabilities Vincent Vigon 6 septembre 2007 Vincent Vigon () LU

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

The Complexity of Homomorphism Factorization Kevin M. Berg University of Colorado Boulder August

Factorization Methods Bernd Schr oder Bernd Schr oder Louisiana Tech University, College

Audit Entrance Better Together: Moss Adams & Metro Audit Committee Dear Audit Committee

19

Plant health surveys in Europe: background, current situation and next challenges Massimo Faccoli

Soil Series Soil Series Understanding Soil Understanding Soil Surveys & Map Units Surveys

The trends of the wind characteristics over the territory of the Vostochny cosmodrome Zolotukhina

Speed Management in NZ

A Lightning Mapping Array for West Texas D EPLOYMENT AND RESEARCH PLANS Eric Bruning TTU

FROST ROAD ELEMENTARY SCHOOL ADDITION SCHEMATIC DESIGN 8606 - 162 STREET, SURREY, BC

Towards a multifrontal QR factorization for heterogeneous - PowerPoint PPT Presentation

Towards a multifrontal QR factorization for heterogeneous architectures over runtime systems Preliminary work on multicore architectures Florent Lopez, Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon MUMPS Users Group

Sparse direct solvers on top of runtime systems ANR SOLHAR E. Agullo, G. Bosilca, A. Buttari, A.

Tensor Factorization via Matrix Factorization Volodymyr Kuleshov Arun Tejasvi Chaganty Percy

Coverage in Heterogeneous Coverage in Heterogeneous Networks Xiaoli Chu King s College

An Introduction to Coupling Conditions Homogeneous Heterogeneous Domain Decomposition Problems

Integer Factorization Methods Modular Arithmetic Trial division, Pollards p 1 , Division

Mac Lane and Factorization Walter Tholen York University, Toronto June 15, 2006 Walter Tholen

A Model For Mixed Linear-Tropical Matrix Factorization James Hook, Sanjar Karaev, Pauli Miettinen

Factoring Done by:Rashed salmeen Grade:9ASP2 Prime factorization Prime factorization:is finding

Online-Updating Regularized Kernel Matrix Factorization Models for Large-Scale Recommender

Incomplete Factorization by Local Exact Factorization (ILUE) Johannes Kraus and Maria Lymbery

Compressed Factorization: Fast and Accurate Low-Rank Factorization of Compressively-Sensed Data

Matrix Factorization and Factorization Machines for Recommender Systems Chih-Jen Lin Department

LU -factorization and probabilities Vincent Vigon 6 septembre 2007 Vincent Vigon () LU

L101: Matrix Factorization In a nutshell Matrix factorization/completion you know? In NLP?

The Complexity of Homomorphism Factorization Kevin M. Berg University of Colorado Boulder August

Factorization Methods Bernd Schr oder Bernd Schr oder Louisiana Tech University, College

Audit Entrance Better Together: Moss Adams &amp; Metro Audit Committee Dear Audit Committee

19

Plant health surveys in Europe: background, current situation and next challenges Massimo Faccoli

Soil Series Soil Series Understanding Soil Understanding Soil Surveys &amp; Map Units Surveys

The trends of the wind characteristics over the territory of the Vostochny cosmodrome Zolotukhina

Speed Management in NZ

A Lightning Mapping Array for West Texas D EPLOYMENT AND RESEARCH PLANS Eric Bruning TTU

FROST ROAD ELEMENTARY SCHOOL ADDITION SCHEMATIC DESIGN 8606 - 162 STREET, SURREY, BC

Audit Entrance Better Together: Moss Adams & Metro Audit Committee Dear Audit Committee

Soil Series Soil Series Understanding Soil Understanding Soil Surveys & Map Units Surveys