A Tiling Based Programming Model and Its Suppor7ve Tools - PowerPoint PPT Presentation

A ¡Tiling ¡Based ¡Programming ¡Model ¡and ¡ ¡ Its ¡Suppor7ve ¡Tools ¡ Didem ¡Unat , ¡Burak ¡Bastem, ¡Nufail ¡Farooqi ¡ ¡ Koç ¡University, ¡Istanbul ¡ Weiqun ¡Zhang, ¡Tan ¡Nguyen, ¡John ¡Shalf, ¡Ann ¡Almgren ¡ Lawrence ¡Berkeley ¡Na:onal ¡Laboratory ¡ ¡ ¡ 25 ¡Jan ¡2016, ¡SPPEXA ¡Symposium, ¡Munich, ¡Germany ¡ 1 ¡

¡ Abstract ¡Machine ¡Model ¡ ¡ (for ¡emerging ¡node ¡architectures) ¡ ¡ Download ¡the ¡CAL ¡AMM ¡doc ¡ ¡from ¡hUp://www.cal-‑design.org/ ¡ (Low Capacity, High Bandwidth) 3D Stacked (High Capacity, Memory Low Bandwidth) DRAM Thin Cores / Accelerators Fat Core NVRAM Fat Core Core Coherence Domain “Abstract ¡machine ¡models ¡and ¡proxy ¡architectures ¡for ¡exascale ¡compuGng”, ¡ In ¡ Proceedings ¡of ¡the ¡1st ¡Interna:onal ¡Workshop ¡on ¡Hardware-‑SoQware ¡Co-‑Design ¡for ¡High ¡ Performance ¡Compu:ng ¡(Co-‑HPC ¡'14). ¡IEEE ¡Press ¡

NERSC ¡new ¡system: ¡Cori ¡ • Minimize ¡data ¡movement ¡by ¡ respecGng ¡the ¡topology ¡and ¡ hierarchy ¡ ¡ • The ¡new ¡NERSC ¡system, ¡Cori ¡ – Mesh ¡Network ¡w/quadrants ¡ KNL ¡ ¡ – No ¡coherence ¡domains ¡yet ¡ Mesh ¡on-‑chip ¡network ¡ – Heterogeneous ¡memory ¡ subsystem ¡ ¡ Move ¡away ¡from ¡compute-‑centric ¡to ¡data-‑centric ¡ programming ¡model ¡

TiDA: ¡Tiling ¡as ¡a ¡Durable ¡AbstracGon ¡ Tiling ¡is ¡a ¡well-‑known ¡loop ¡transformaGon ¡for ¡parallelism ¡and ¡data ¡locality ¡ – Why ¡not ¡elevate ¡it ¡to ¡the ¡programming ¡model? ¡ TiDA ¡makes ¡Gling ¡part ¡of ¡the ¡data ¡structure ¡declaraGon ¡ – Each ¡array ¡is ¡extended ¡with ¡ metadata ¡to ¡manage ¡memory ¡affiniGes ¡ – Metadata ¡ follows ¡the ¡array ¡ through ¡the ¡code ¡ ¡ ¡ ¡ ¡ Box 3 Tile (1,1) Tile (1,2) Box 5 Box 4 Tile (2,1) Tile (2,2) Box 2 Tile (3,1) Tile (3,2) Box 1 Box 2 Tiled Box 2 4 ¡

Tiling ¡introduces ¡more ¡parallelism ¡ • OpenMP ¡ ¡ – #pragma omp for ¡ is ¡generally ¡used ¡to ¡introduce ¡data ¡parallelism ¡ • For ¡a ¡N ¡=128^3 ¡problem ¡on ¡1000 ¡cores, ¡one ¡level ¡loop ¡parallelism ¡is ¡ not ¡sufficient ¡ ¡ – #pragma omp for collapse(2) • Collapse ¡clause ¡doesn’t ¡decompose ¡the ¡data ¡space ¡ – It ¡flaUens ¡the ¡iteraGon ¡space, ¡have ¡to ¡introduce ¡Gling ¡manually ¡ ¡ • Tiling ¡allows ¡mulGdimensional ¡decomposiGon ¡of ¡data ¡ ¡ – Each ¡Gle ¡represents ¡an ¡independent ¡unit ¡of ¡work ¡ – Task ¡scheduler ¡can ¡work ¡at ¡a ¡Gle ¡granularity ¡ ¡ • MulG-‑level ¡parallelism ¡ – Coarse-‑grain ¡parallelism: ¡across ¡Gles ¡ – Fine-‑grain ¡parallelism: ¡vectorizaGon, ¡instrucGon ¡ordering ¡within ¡ Gle ¡ ¡ ¡ ¡

Tiling ¡improves ¡data ¡locality ¡ • Horizontal ¡data ¡movement ¡ – Respect ¡Gle ¡topology ¡when ¡place ¡ Gles ¡on ¡the ¡chip ¡ – If ¡adjacent ¡Gles ¡share ¡much ¡of ¡the ¡ data, ¡we ¡need ¡to ¡schedule ¡them ¡ to ¡the ¡adjacent ¡threads ¡(threads ¡ within ¡the ¡same ¡socket) ¡ ¡ SMC&Code&with&53&Species& 1& • VerGcal ¡data ¡movement ¡ Bytes&per&Flop& 0.5& • Tiling ¡shrinks ¡the ¡working ¡set ¡size ¡to ¡ 64&kB&Cache& 256&kB&Cache& fit ¡it ¡to ¡available ¡cache ¡ 1&MB&Cache& 0.25& 4&MB&Cache& Unlimited&Cache& 0.125& 2& 4& 8& 16& 32& 64& 128& Tile&Size&

Three ¡Simple ¡AbstracGons ¡in ¡TiDA ¡ • Logical ¡Tiles ¡ – These ¡are ¡logical ¡parGGons ¡of ¡data ¡ – Their ¡size ¡can ¡be ¡different ¡for ¡each ¡loop ¡nest ¡ ¡ • Regional ¡Tiles ¡ – Support ¡NUMA ¡architectures ¡by ¡allocaGng ¡a ¡group ¡of ¡Gles ¡ conGguously ¡in ¡memory ¡ ¡ • Tile ¡iterator ¡ – Hides ¡traversal ¡of ¡Gles ¡from ¡the ¡user ¡ – Decouples ¡the ¡loop ¡iteraGons ¡and ¡parallelizaGon ¡from ¡the ¡ loop ¡body ¡ – Can ¡be ¡used ¡for ¡different ¡execuGon ¡models ¡ ¡ Didem ¡Unat, ¡Koç ¡University ¡ 7 ¡

Regional ¡Tiles ¡ Each ¡structured ¡grid ¡is ¡divided ¡into ¡regions ¡and ¡mapped ¡on ¡to ¡a ¡different ¡ • NUMA ¡node ¡(or ¡cache ¡coherence ¡domains) ¡ – TiDA ¡uses ¡HWLOC ¡to ¡discover ¡NUMA ¡nodes ¡and ¡distribute ¡regions ¡to ¡different ¡ NUMA ¡nodes ¡ • A ¡programmer ¡can ¡set ¡the ¡region ¡geometry ¡using ¡an ¡env ¡var ¡or ¡in ¡the ¡ program ¡ – export ¡TiDA_REGIONS=x,y,z ¡ Didem ¡Unat, ¡Koç ¡University ¡ 8 ¡

Logical ¡Tiles ¡ • Logically ¡Gles ¡regions ¡ – No ¡memory ¡allocaGon ¡is ¡required ¡ – Only ¡how ¡the ¡data ¡traversed ¡differs ¡ ¡ • Designed ¡for ¡improving ¡cache ¡reuse ¡within ¡a ¡NUMA ¡node ¡ Didem ¡Unat, ¡Koç ¡University ¡ 9 ¡

Regional ¡Tiles ¡and ¡Ghost ¡Cells ¡ Regions ¡represent ¡disjoint ¡memory ¡locaGons ¡ • They ¡introduce ¡ghost ¡cells ¡that ¡keep ¡data ¡needed ¡from ¡other ¡regions ¡ • TiDA ¡provides ¡ fill_5leboundary() ¡rouGne ¡to ¡update ¡the ¡ghost ¡cells ¡in ¡a ¡ • program. ¡ The ¡programmer ¡is ¡responsible ¡for ¡where ¡to ¡call ¡this ¡rouGne ¡but ¡TiDA ¡ • handles ¡communicaGon ¡between ¡regions ¡ Didem ¡Unat, ¡Koç ¡University ¡ 10 ¡

¡ Dynamic ¡Tile ¡and ¡StaGc ¡Region ¡Sizes ¡ • Tile ¡Size ¡ – Tile ¡size ¡is ¡parameterized ¡and ¡ local ¡ – Tile ¡size ¡is ¡ dynamic ¡with ¡the ¡help ¡of ¡the ¡Gle ¡iterator ¡ • Requires ¡no ¡reallocaGon ¡ • Some ¡loops ¡do ¡not ¡benefit ¡from ¡Gling ¡(element-‑wise ¡updates, ¡no ¡reuse) ¡ ¡ ¡ ¡ • Region ¡Size ¡ ¡ – Region ¡size ¡is ¡parameterized ¡and ¡ global , ¡can ¡be ¡set ¡at ¡the ¡launch ¡Gme ¡ – Different ¡arrays ¡can ¡have ¡different ¡region ¡sizes ¡ – Region ¡size ¡is ¡ sta5c ¡ otherwise ¡it ¡would ¡require ¡reallocaGon ¡of ¡data ¡

Tile ¡Iterator ¡ • Loop ¡traversal ¡construct ¡to ¡abstract ¡ the ¡Gle ¡traversal ¡order ¡and ¡parallelism ¡ • Applies ¡the ¡loop ¡body ¡on ¡every ¡Gle ¡ • Takes ¡Glesize ¡to ¡logically ¡Gle ¡the ¡array ¡ – It ¡creates ¡logical ¡Gles ¡on ¡the ¡fly ¡ ¡ • Allows ¡dynamic ¡Gle ¡sizes ¡for ¡logical ¡ Gling ¡ ¡ – Different ¡Gle ¡size ¡per ¡nested ¡loop ¡ ¡ ¡ We ¡have ¡both ¡C++ ¡and ¡Fortran ¡implementaGons ¡and ¡API ¡for ¡TiDA ¡ ¡ Didem ¡Unat, ¡Koç ¡University ¡ 12 ¡

InteracGon ¡with ¡AMR ¡libraries ¡ AMR ¡Level ¡ ¡ ¡ ¡ ¡ ¡ ¡ Grid ¡ AMR ¡library ¡(e.g. ¡Boxlib, ¡Chombo) ¡ ¡ ¡ ¡ ¡ TiDA ¡library ¡ ¡ ¡ Region ¡ ¡ ¡ • Regions ¡and ¡Gles ¡are ¡light-‑weight ¡ ¡ data ¡structures, ¡compared ¡to ¡Grid ¡ ¡ Tile ¡ ¡ • Construct ¡messages ¡directly ¡from ¡ ¡ Regions ¡ ¡ ¡ Cell ¡

Building ¡TiDA ¡Arrays ¡ Integer ¡vector ¡ type(tileArray) :: A, B of ¡regionsize ¡ type(absTileArray) :: abstractAB abstractAB = absTileArray_build(lb, ub, regionsizes, tilesizes) lower ¡bound ¡ upper ¡bound ¡ A= tilearray_build(abstractAB, ghosts) B= tilearray_build(abstractAB, ghosts) . . . call tilearray_destroy(A) call tilearray_destroy(B) We ¡have ¡both ¡C++ ¡and ¡Fortran ¡implementaGons ¡and ¡API ¡for ¡TiDA ¡ ¡

OperaGon ¡on ¡TiDA ¡Arrays ¡ !$OMP PARALLEL PRIVATE(ti, tlo, thi, reglo, reghi, i, j, ptrA) ti = tileItr_build(abstractAB, logtilesize) Tiling ¡iterator ¡and ¡ do while(next_tile(ti)) its ¡loop ¡ Get ¡Gle ¡ ptrA =>dataptr(A, ti) and ¡its ¡ tlo = get_lwb(ti) range ¡ thi = get_upb(ti) !Option 1: process a tile within a loop do j = tlo(2), thi(2) Original ¡loop ¡ do i = tlo(1), thi(1) nest ¡ ptrA(i,j) = compute(i,j) ... end do end do !Option 2: process a tile within a function reglo = get_lwb(get_region(A, ti))   reghi = get_upb(get_region(A, ti)) ! call compute_a_tile(ptrA, tlo, thi, reglo, reghi) end do !$OMP END PARALLEL

A Tiling Based Programming Model and Its Suppor7ve Tools - PowerPoint PPT Presentation

A Tiling Based Programming Model and Its Suppor7ve Tools Didem Unat , Burak Bastem, Nufail Farooqi Ko University, Istanbul Weiqun Zhang, Tan Nguyen,

Tiling: A Data Locality Optimizing Algorithm Previously Kelly & Pugh transformation

A Relaxed Criterion for Loop Tiling Riyadh Baghdadi, Albert Cohen, Sven Verdoolaege

Tiling for Dynamic Scheduling Ravi Teja Mullapudi Uday Bondhugula CSA, Indian Institue of

Multi-tiling and equidecomposability of polytopes by lattice translates Bochen Liu Bar-Ilan

Automatic Parallelization: Parallelism and Tiling Roshan Dathathri Department of Computer Science

CS 5 4 3 : Com puter Graphics Lecture 2 ( Part I I ) : Tiling, Zoom ing and 2 D Clipping

Will it k-tile? Structural aspects of polytopes and lattices in multiple tiling Alexandru Mihai,

On a Fragment of AMSO and Tiling Systems Achim Blumensath - Masaryk University (Brno) Thomas

Gap-labelling of the pinwheel tiling H. Moustafa Lab. de Math ematiques, Clermont-Ferrand

On the undecidability of the tiling problem Jarkko Kari Mathematics Department, University of

Tiling Shuffling Phenomenon Tri Lai University of Nebraska Lincoln Lincoln, NE 68588 Dimers

Automatic Tiling of Mostly-Tileable Loop Nests David Wonnacott Tian Jin Allison Lake

Tools integrate Tools work together Tools work together Models Specs Code Traces Profiles

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework Lakshminarayanan.

Log-gases on a quadratic lattice via discrete loop equations Alisa Knizel Columbia University

Numerically Accurate Hyperbolic Embeddings Using Tiling-Based Models Tao Yu & Christopher

binary.cr.yp.to D. J. Bernstein University of Illinois at Chicago NSF ITR0716498 2003 Rodr

Branch-cut-and-price algorithms for the vehicle routing problem with backhauls Eduardo Queiroga 1

DISTRIBUTED SYSTEMS: GROUP COMMUNICATION Hakim Weatherspoon CS6410 Slides borrowed liberally

WG on Informatics Education: CECE Members of the Committee on European Computing Education

Recognize some structural properties of a finite group from the orders of its elements Mercede

CSE 262 Lecture 11 GPU Implementation of stencil methods (II) Announcements Final

Transfer Matrix Formulation of Scattering Theory in Arbitrary Dimensions Ali Mostafazadeh (Ko

Psychology-Driven Design of Intelligent Interfaces T. Metin Sezgin Assoc. Prof. College of

A Tiling Based Programming Model and Its Suppor7ve Tools - PowerPoint PPT Presentation

A Tiling Based Programming Model and Its Suppor7ve Tools Didem Unat , Burak Bastem, Nufail Farooqi Ko University, Istanbul Weiqun Zhang, Tan Nguyen,

Tiling: A Data Locality Optimizing Algorithm Previously Kelly &amp; Pugh transformation

A Relaxed Criterion for Loop Tiling Riyadh Baghdadi, Albert Cohen, Sven Verdoolaege

Tiling for Dynamic Scheduling Ravi Teja Mullapudi Uday Bondhugula CSA, Indian Institue of

Multi-tiling and equidecomposability of polytopes by lattice translates Bochen Liu Bar-Ilan

Automatic Parallelization: Parallelism and Tiling Roshan Dathathri Department of Computer Science

CS 5 4 3 : Com puter Graphics Lecture 2 ( Part I I ) : Tiling, Zoom ing and 2 D Clipping

Will it k-tile? Structural aspects of polytopes and lattices in multiple tiling Alexandru Mihai,

On a Fragment of AMSO and Tiling Systems Achim Blumensath - Masaryk University (Brno) Thomas

Gap-labelling of the pinwheel tiling H. Moustafa Lab. de Math ematiques, Clermont-Ferrand

On the undecidability of the tiling problem Jarkko Kari Mathematics Department, University of

Tiling Shuffling Phenomenon Tri Lai University of Nebraska Lincoln Lincoln, NE 68588 Dimers

Automatic Tiling of Mostly-Tileable Loop Nests David Wonnacott Tian Jin Allison Lake

Tools integrate Tools work together Tools work together Models Specs Code Traces Profiles

Optimal ILP and Register Tiling: Analytical Model and Optimization Framework Lakshminarayanan.

Log-gases on a quadratic lattice via discrete loop equations Alisa Knizel Columbia University

Numerically Accurate Hyperbolic Embeddings Using Tiling-Based Models Tao Yu &amp; Christopher

binary.cr.yp.to D. J. Bernstein University of Illinois at Chicago NSF ITR0716498 2003 Rodr

Branch-cut-and-price algorithms for the vehicle routing problem with backhauls Eduardo Queiroga 1

DISTRIBUTED SYSTEMS: GROUP COMMUNICATION Hakim Weatherspoon CS6410 Slides borrowed liberally

WG on Informatics Education: CECE Members of the Committee on European Computing Education

Recognize some structural properties of a finite group from the orders of its elements Mercede

CSE 262 Lecture 11 GPU Implementation of stencil methods (II) Announcements Final

Transfer Matrix Formulation of Scattering Theory in Arbitrary Dimensions Ali Mostafazadeh (Ko

Psychology-Driven Design of Intelligent Interfaces T. Metin Sezgin Assoc. Prof. College of

Tiling: A Data Locality Optimizing Algorithm Previously Kelly & Pugh transformation

Numerically Accurate Hyperbolic Embeddings Using Tiling-Based Models Tao Yu & Christopher