Towards a multifrontal QR factorization for heterogeneous architectures over runtime systems Preliminary work on multicore architectures Florent Lopez, Joint work with IRIT Toulouse, LaBRI / Inria Bordeaux, LIP / Inria Lyon MUMPS Users Group Meeting, May 29-30, 2013
Context of the work
The Multifrontal method The multifrontal QR factorization is guided by a graph called elimination tree : pivots are eliminated associated with a relatively small dense matrix called frontal matrix (or, simply, front ) which contains the k columns related to the pivots and all the other coefficients concerned by their elimination 3/28 MUMPS Users Group Meeting, May 29-30, 2013 • at each node of the tree k • each node of the tree is
Accelerated architectures New generation of accelerators: MIC (Many Integrated Core) 4/28 algorithms to achieve performance Accelerated architectures including GPUs are extremely popular architecture (wrt the latest generation of CPUs) in the HPC community: MUMPS Users Group Meeting, May 29-30, 2013 • high density of computational units clocked at low frequencies ◦ high potential performance peak for parallel applications ◦ limited power consumption • 60 cores (In-order core derived from Pentium: energy efficient) • clocked at 1053 MHz • 240 threads (with 4 hyper threading sibling per core) • advanced VPU per core (512-bit SIMD) Incompatible programming models ⇒ specific kernels and
Accelerated architectures (ACC: GPU or MIC) CPU 5/28 Elimination tree RAM RAM RAM RAM ACC Heterogeneous platform CPU CPU CPU ACC RAM ACC CPU CPU CPU CPU MUMPS Users Group Meeting, May 29-30, 2013 • an extremely heterogeneous workload • a heterogeneous architecture • mapping tasks is challenging
Accelerated architectures (ACC: GPU or MIC) CPU 5/28 Elimination tree RAM RAM RAM RAM ACC Heterogeneous platform CPU CPU CPU ACC RAM ACC CPU CPU CPU CPU MUMPS Users Group Meeting, May 29-30, 2013 • management of data consistency by hand • architecture dependant approach • difficult to maintain
Accelerated architectures (ACC: GPU or MIC) RAM 5/28 consistency in a dynamic way. system capable of handling the scheduling and the data Another option is to exploit the features of a modern runtime Runtime system (CPU, GPU, SPU) drivers DSM scheduler StarPU Elimination tree RAM RAM Heterogeneous platform RAM ACC CPU CPU CPU CPU ACC RAM ACC CPU CPU CPU CPU MUMPS Users Group Meeting, May 29-30, 2013
Runtime systems Runtime system: abstract layer between application and machine with the following features: each type of processing unit) 6/28 MUMPS Users Group Meeting, May 29-30, 2013 • automatic detection of the task dependencies • dynamic task scheduling on different types of processing units. • management of multi-versioned tasks (an implementation for • consistency management of manipulated data.
Runtime systems Objective of the study: evaluate the usability and the effectiveness of a general purpose runtime system with complex and irregular workload such as a sparse factorization This approach is widely adopted in the case of dense linear algebra: it is challenging for complex and irregular problems such as sparse linear algebra (related work on PaStiX with Pierre Ramet) 7/28 MUMPS Users Group Meeting, May 29-30, 2013 • exploitation of GPU-accelerated architectures • PLASMA (QUARK) • DPLASMA (PaRSEC) • MAGMA-MORSE (StarPU) • FLAME (SuperMatrix)
Multifrontal QR factorization on multicores over StarPU
parallelism and scheduling strategy in qr_mumps In qr_mumps node and tree parallelism are exploited consistently, by partitioning the frontal matrices and replacing the elimination tree with a DAG: the scheduler efficiency is constrained by the tasks search-space: the scheduling complexity depends on the number of active fronts and therefore is not very scalable. Replace the ad hoc scheduler in qr_mumps with a general purpose runtime system 9/28 MUMPS Users Group Meeting, May 29-30, 2013
parallelism and scheduling strategy in qr_mumps In qr_mumps node and tree parallelism are exploited consistently, by partitioning the frontal matrices and replacing the elimination tree with a DAG: the scheduler efficiency is constrained by the tasks search-space: the scheduling complexity depends on the number of active fronts and therefore is not very scalable. Replace the ad hoc scheduler in qr_mumps with a general purpose runtime system 9/28 MUMPS Users Group Meeting, May 29-30, 2013
The multifrontal QR factorization: StarPU integration StarPU Task 10/28 placement, StarPU decides where to run a task dependencies among tasks ... Output m Output 1 ... Input n Input 1 Priority -Code SPU -Code GPU -Code CPU MUMPS Users Group Meeting, May 29-30, 2013 • Depending on the input/output, StarPU detects the • Depending on the availability of resources and the data
The multifrontal QR factorization: StarPU integration id 1 11/28 explicit dependency (data hazard) detected dependency id 3 id 2 MUMPS Users Group Meeting, May 29-30, 2013 Original sequence: Equivalent StarPU code: 1: submit_task( fun 1 , A: inout, B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: submit_task( fun 2 , A: inout, id = id 2 ) 2: fun 2 (A: inout) 3: declare_dependency( id 3 ← id 1 ) 3: fun 1 (C: inout, D: in) 4: submit_task( fun 1 , C: inout, D: in, id = id 3 )
The multifrontal QR factorization: StarPU integration id ) 12/28 explicit dependency (data hazard) detected dependency id 1 id ) 4: submit_task( fun , C: inout, D: in, id 3: declare_dependency( id Original sequence: id ) 2: submit_task( fun , A: inout, id Equivalent StarPU code: MUMPS Users Group Meeting, May 29-30, 2013 1: submit_task( fun 1 , A: inout, B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: fun 2 (A: inout) 3: fun 1 (C: inout, D: in)
The multifrontal QR factorization: StarPU integration 4: submit_task( fun , C: inout, D: in, id 12/28 explicit dependency (data hazard) detected dependency id 2 id 1 id ) id ) Original sequence: 3: declare_dependency( id Equivalent StarPU code: MUMPS Users Group Meeting, May 29-30, 2013 1: submit_task( fun 1 , A: inout, B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: submit_task( fun 2 , A: inout, id = id 2 ) 2: fun 2 (A: inout) 3: fun 1 (C: inout, D: in)
The multifrontal QR factorization: StarPU integration 4: submit_task( fun , C: inout, D: in, id 12/28 explicit dependency (data hazard) detected dependency id 2 id 1 id ) id ) Original sequence: 3: declare_dependency( id Equivalent StarPU code: MUMPS Users Group Meeting, May 29-30, 2013 1: submit_task( fun 1 , A: in out , B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: submit_task( fun 2 , A: in out, id = id 2 ) 2: fun 2 (A: inout) 3: fun 1 (C: inout, D: in)
The multifrontal QR factorization: StarPU integration id 1 12/28 explicit dependency (data hazard) detected dependency id 3 id 2 id ) Original sequence: 3: declare_dependency( id Equivalent StarPU code: MUMPS Users Group Meeting, May 29-30, 2013 1: submit_task( fun 1 , A: inout, B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: submit_task( fun 2 , A: inout, id = id 2 ) 2: fun 2 (A: inout) 3: fun 1 (C: inout, D: in) 4: submit_task( fun 1 , C: inout, D: in, id = id 3 )
The multifrontal QR factorization: StarPU integration id 1 12/28 explicit dependency (data hazard) detected dependency id 3 id 2 MUMPS Users Group Meeting, May 29-30, 2013 Original sequence: Equivalent StarPU code: 1: submit_task( fun 1 , A: inout, B: in, id = id 1 ) 1: fun 1 (A: inout, B: in) 2: submit_task( fun 2 , A: inout, id = id 2 ) 2: fun 2 (A: inout) 3: declare_dependency( id 3 ← id 1 ) 3: fun 1 (C: inout, D: in) 4: submit_task( fun 1 , C: inout, D: in, id = id 3 )
The multifrontal QR factorization: StarPU integration Output m 13/28 and let StarPU do all the work call submit_task(operation1, i1, ..., in, o1, ..., om) with call operation1(i1, ..., in, o1, ..., om) The easy way: replace all the ... Output 1 StarPU Task ... Input n Input 1 Priority -Code SPU -Code GPU -Code CPU MUMPS Users Group Meeting, May 29-30, 2013
The multifrontal QR factorization: StarPU integration StarPU Task 13/28 to limit the memory consumption scheduling job too complex and memory consuming ... Output m Output 1 ... Input n Input 1 Priority -Code SPU -Code GPU -Code CPU MUMPS Users Group Meeting, May 29-30, 2013 • the DAG may have millions of nodes which makes the • the scheduling of activation tasks have to be controlled in order
The multifrontal QR factorization: StarPU integration StarPU Task -Code CPU -Code GPU -Code SPU Priority Input 1 Input n ... Output 1 Output m ... Our approach: We give to StarPU a limited view of the DAG; this is achieved by defining tasks that submit other tasks. 13/28 MUMPS Users Group Meeting, May 29-30, 2013
The multifrontal QR factorization: dynamic construction of the DAG The activation tasks in charge of allocating the memory and preparing the data structures needed for processing a front are the ideal candidates to submit the numerical tasks 14/28 MUMPS Users Group Meeting, May 29-30, 2013
Recommend
More recommend