An Analytical Model for BLIS Tze Meng Low 1 Francisco D. Igual 2 - PowerPoint PPT Presentation

An Analytical Model for BLIS Tze Meng Low 1 Francisco D. Igual 2 Tyler M. Smith 3 Enrique Quitana-Ortiz 4 1 Carnegie Mellon University 2 Universidad Complutense de Madrid, Spain 3 The University of Texas at Austin 4 Universidad Jaume I, Spain 2 nd BLIS Retreat 25-26 September 2015

Background ◮ BLAS-like Library Instantiation Software (BLIS) ◮ Framework for rapidly instantiating the BLAS (or BLAS-like) functionalities using the GotoBLAS approach ◮ Productivity multiplier for the developer ◮ With BLIS, an expert has to ◮ Identifying parameter values (e.g. block sizes); and ◮ Implementing an efficient micro-kernel in assembly (in essence, a series of outer-products)

Background ◮ “Is Search Really Necessary to Generate High-Performance BLAS?” [Yotov et al, 2005] ◮ Showed that empirical search in ATLAS can be replaced with simple analytical models ◮ Key differences ◮ ATLAS ◮ BLIS ◮ Scalar instructions ◮ SIMD instructions ◮ Single level, Fully ◮ Hierarchy of Set Associative Cache Associative Caches ◮ Compared against ◮ Compared against ATLAS generated code hand-coded (no user kernels) implementations

GotoBLAS at a glance ◮ 5 parameters ( m r , n r , k c , m c , and n c ) n r Registers + m r = n r L1 k c k c L2 m c n c L3 k c

Model Architecture ◮ Vector registers ◮ Each vector register holds N vec elements. ◮ FMA instructions ◮ Throughput of N fma per clock cycle. ◮ Instruction latency is given by L fma . ◮ Caches ◮ All caches are set-associative. ◮ Cache replacement policy is LRU. ◮ Cache lines are the same for all caches.

Parameters: m r , n r ◮ Recall: ◮ m r and n r determine the size of the micro-block of C ◮ Each element is computed exactly once in each iteration of the micro-kernel n r + m r = ◮ Strategy ◮ Pick the smallest micro-block of C ( m r × n r ) such that no stalls arising from dependencies and instruction latency occur when computing one iteration of the micro-kernel.

Parameters: m r , n r ◮ Recall: ◮ Each FMA instruction has a latency of L fma ◮ N fma FMA instructions can be issued per clock cycle ◮ Each FMA instruction computes N vec elements Time L fma

Parameters: m r , n r ◮ Recall: ◮ Each FMA instruction has a latency of L fma ◮ N fma FMA instructions can be issued per clock cycle ◮ Each FMA instruction computes N vec elements N fma Time L fma

Parameters: m r , n r ◮ Minimum size of the micro-block of C m r n r ≥ N vec L fma N fma ◮ Ideally, � m r , n r ≈ N vec L fma N fma ◮ In practice, �� N vec L fma N fma � m r (or n r ) = N vec N vec

Parameters: k c , m c , n c ◮ Recall that k c , m c , and n c are dimensions of the matrices that are kept in different caches ◮ L1 : Micro-panel of B - k c × n r ◮ L2 : Packed block of A - m c × k c ◮ L3 (if available) : Packed block of B - k c × n c ◮ Pick largest k c , m c and n c such that the matrices will still be kept in their caches

Parameters: k c , m c , n c ◮ Consider the L1 cache: ◮ Same micro-panel of B is used between different invocations of the micro-kernel ◮ Micro-panels of A are used only once ◮ For simplicity, micro-panels of A and B are the same size Option 1 A 0 B C 0 Not efficient use of cache!

Parameters: k c , m c , n c ◮ Consider the L1 cache: ◮ Same micro-panel of B is used between different invocations of the micro-kernel ◮ Micro-panels of A are used only once ◮ For simplicity, micro-panels of A and B are the same size Option 1 A 0 A 1 B C 0 C 1

Parameters: k c , m c , n c ◮ Consider the L1 cache: ◮ Same micro-panel of B is used between different invocations of the micro-kernel ◮ Micro-panels of A are used only once ◮ For simplicity, micro-panels of A and B are the same size Option 1 A 0 A 1 B C 0 C 1 Not efficient use of cache!

Parameters: k c , m c , n c ◮ Consider the L1 cache: ◮ Same micro-panel of B is used between different invocations of the micro-kernel ◮ Micro-panels of A are used only once ◮ For simplicity, micro-panels of A and B are the same size Option 2 A 0 B C 0 Larger micro-panel of B can be kept in the L1 cache (Larger k c !)

Parameters: k c , m c , n c ◮ Consider the L1 cache: ◮ Same micro-panel of B is used between different invocations of the micro-kernel ◮ Micro-panels of A are used only once ◮ For simplicity, micro-panels of A and B are the same size Option 2 A 1 B C 0 C 1 Larger micro-panel of B can be kept in the L1 cache (Larger k c !)

Parameters: k c , m c , n c ◮ Consider the L1 cache: ◮ Same micro-panel of B is used between different invocations of the micro-kernel ◮ Micro-panels of A are used only once ◮ For simplicity, micro-panels of A and B are the same size Option 2 A 1 B C 0 C 1 Larger micro-panel of B can be kept in the L1 cache

Parameters: k c , m c , n c ◮ Observation 1: A and B are packed ◮ Elements of A and B are in contiguous memory locations . . . B B B B . . . B B . . . A A A A . . . A A

Parameters: k c , m c , n c ◮ Observation 1: A and B are packed ◮ Elements of A and B are in contiguous memory locations ◮ Observation 2: Caches are set associative ◮ Cache lines are evicted when all W cache lines in a set is filled ◮ At least one cache line is filled with elements from C . ◮ Micro-panels of A and B can, at most, fill W − 1 cache lines in each set . . . B B B B . . . B B . . . A A A A . . . A A

Parameters: k c , m c , n c ◮ Observation 1: A and B are packed ◮ Elements of A and B are in contiguous memory locations ◮ Observation 2: Caches are set associative ◮ Cache lines are evicted when all W cache lines in a set is filled ◮ At least one cache line is filled with elements from C . ◮ Micro-panels of A and B can, at most, fill W − 1 cache lines in each set . . . B B B B . . . B C C . . . A A A A . . . A A

Parameters: k c , m c , n c ◮ Observation 1: A and B are packed ◮ Elements of A and B are in contiguous memory locations ◮ Observation 2: Caches are set associative ◮ Cache lines are evicted when all W cache lines in a set is filled ◮ At least one cache line is filled with elements from C . ◮ Micro-panels of A and B can, at most, fill W − 1 cache lines in each set . . . B B B B . . . C C . . . A A A A . . .

Parameters: k c , m c , n c ◮ Recall: Want new micro-panel of A to evict old micro-panels of A ◮ Starting location of each micro-panel of A must be mapped to the same set ◮ Size of a micro-panel of A must be a multiple ( C A r ) of the number of sets in the cache . . . A m c A 0 A 1 ◮ C A r is the number of cache lines in each set allocated to a micro-panel of A . ◮ k c can then be computed as follow k c = C A r N L1 C L1 m r S Data

Validation ◮ Compare parameter values from model against OpenBLAS and manually optimized BLIS implementations ◮ Model should yield similar (if not identical) parameter values as those in existing implmentations since all three apporaches use the GotoBLAS approach

Validation ◮ Size of micro-block of C , m r and n r Architecture OpenBLAS BLIS Model m r n r m r n r m r n r Intel Dunnington 4 4 4 4 4 4 Intel SandyBridge 8 4 8 4 8 4 TI C6678 - - 4 4 4 4 AMD Piledriver 8 (6) 2 (4) 4 6 4 6

Validation ◮ Values of k c , and m c . ◮ n c not shown because either architecture had no L3 cache, or varying n c resulted in minimal performance variation Architecture BLIS Model k c m c k c m c Intel Dunnington 256 384 256 384 Intel SandyBridge 256 96 256 96 TI C6678 256 128 256 128 AMD Piledriver 120 1088 128 1792

Conclusion ◮ An analytical model for determining the parameter values required by BLIS ◮ Parameter values that are similar if not identical to those in expert-tuned implementations ◮ Consistent result with Yotov et. al: Analytical modeling is sufficient for high performance BLIS

Future Work ◮ Relax Assumptions ◮ Include bandwidth considerations ◮ Different cache replacement policies ◮ Complex arithmetics ◮ More complicated linear algebra algorithms (e.g. LAPACK) ◮ Extend model to LAPACK-type algorithms ◮ Can BLIS parameters be used to determine optimal block for LAPACK algorithms? ◮ Hardware Co-design ◮ Analytical model for LAP [Pedram et. al. 2012] is similar to the analytical model presented here ◮ Possible for model to be used in cache design/cache replacement policies?

An Analytical Model for BLIS Tze Meng Low 1 Francisco D. Igual 2 - PowerPoint PPT Presentation

An Analytical Model for BLIS Tze Meng Low 1 Francisco D. Igual 2 Tyler M. Smith 3 Enrique Quitana-Ortiz 4 1 Carnegie Mellon University 2 Universidad Complutense de Madrid, Spain 3 The University of Texas at Austin 4 Universidad Jaume I, Spain 2 nd

Blis Connor Abbott, Wendy Pan, Klint Qinami, Jason Vaccaro Motivation: Why Blis? OpenGL is

Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze

Packing - the next BLIS Fron5er? Tze Meng Low BLIS

BLIS Performs Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at

BTEC: Analytical Services and Capabilities Nathaniel Hentz, Assistant Director Analytical What is

P1 Holistic Assessment for Mathematics 2013 Curricula Goal Curricula Goal Analytical

HINARI: An Overview BY Samuel A Bello BLIS, MLIS, MIT, CLN Arcis Librarian University of

Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya

The BLIS Approach to Skinny Matrix Multiplication Field G. Van Zee Science of High Performance

Mixing domains and precisions in BLIS: Ini5al thoughts Field G. Van Zee Science of High

Another year of progress for BLIS: 2017-2018 Field G. Van Zee Science of High Performance

BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby

Sharing a BLISful State Maggie Myers Devangi Parikh Robert van de Geijn Field Van Zee Our

Evidence-based Guidelines for the pre-analytical phase of RNA testing in Blood Samples Francesca

PRE-ANALYTICAL VARIABLES IN THE CONTEXT OF PUBLISHING/ESTABLISHING A STANDARDIZED PRE-ANALYTICAL

UCI International, Inc. Q2 2015 Results | August 6, 2015 Confidential Disclaimer This

LTM SURVEY A peer based study commissioned by Aegis Identity Software, Inc. to identify key

Fourth-Quarter 2018 Earnings Conference Call February 7, 2019 Forward-Looking Statements This

Item Response Theory Using the ltm Package Dimitris Rizopoulos Biostatistical Centre, Catholic

Gluster roadmap: Recent improvements and upcoming features Niels de Vos GlusterFS co-maintainer

Comprehensive Coverage and Global Reach Defense Market Developments M&A Capital Markets

Outline What are the Linux 802.1ag Utils? Very short intro about IEEE 802.1ag How can

Second Quarter 2018 Financial Review July 30, 2018 Forward-Looking Statements Certain statements

An Analytical Model for BLIS Tze Meng Low 1 Francisco D. Igual 2 - PowerPoint PPT Presentation

An Analytical Model for BLIS Tze Meng Low 1 Francisco D. Igual 2 Tyler M. Smith 3 Enrique Quitana-Ortiz 4 1 Carnegie Mellon University 2 Universidad Complutense de Madrid, Spain 3 The University of Texas at Austin 4 Universidad Jaume I, Spain 2 nd

Blis Connor Abbott, Wendy Pan, Klint Qinami, Jason Vaccaro Motivation: Why Blis? OpenGL is

Extending the BLIS Analytical Model for GPUs Elliot Binder, Claudia Kho, Doru Thom Popovici, Tze

Packing - the next BLIS Fron5er? Tze Meng Low BLIS

BLIS Performs Devangi N. Parikh Science of High Performance Compu8ng The University of Texas at

BTEC: Analytical Services and Capabilities Nathaniel Hentz, Assistant Director Analytical What is

P1 Holistic Assessment for Mathematics 2013 Curricula Goal Curricula Goal Analytical

HINARI: An Overview BY Samuel A Bello BLIS, MLIS, MIT, CLN Arcis Librarian University of

Integrating DMA capabilities into BLIS for on-chip data movement Devangi Parikh Ilya

The BLIS Approach to Skinny Matrix Multiplication Field G. Van Zee Science of High Performance

Mixing domains and precisions in BLIS: Ini5al thoughts Field G. Van Zee Science of High

Another year of progress for BLIS: 2017-2018 Field G. Van Zee Science of High Performance

BLASFEO Gianluca Frison University of Freiburg BLIS retreat September 19, 2017 Gianluca Frison

Shared Memory Parallelization of MTTKRP for Dense Tensors BLIS Retreat 2017, September 18 th Koby

Sharing a BLISful State Maggie Myers Devangi Parikh Robert van de Geijn Field Van Zee Our

Evidence-based Guidelines for the pre-analytical phase of RNA testing in Blood Samples Francesca

PRE-ANALYTICAL VARIABLES IN THE CONTEXT OF PUBLISHING/ESTABLISHING A STANDARDIZED PRE-ANALYTICAL

UCI International, Inc. Q2 2015 Results | August 6, 2015 Confidential Disclaimer This

LTM SURVEY A peer based study commissioned by Aegis Identity Software, Inc. to identify key

Fourth-Quarter 2018 Earnings Conference Call February 7, 2019 Forward-Looking Statements This

Item Response Theory Using the ltm Package Dimitris Rizopoulos Biostatistical Centre, Catholic

Gluster roadmap: Recent improvements and upcoming features Niels de Vos GlusterFS co-maintainer

Comprehensive Coverage and Global Reach Defense Market Developments M&amp;A Capital Markets

Outline What are the Linux 802.1ag Utils? Very short intro about IEEE 802.1ag How can

Second Quarter 2018 Financial Review July 30, 2018 Forward-Looking Statements Certain statements

Comprehensive Coverage and Global Reach Defense Market Developments M&A Capital Markets