Does your tool support PAPI SDEs yet? 13 th Scalable Tools Workshop Anthony Danalis, Heike Jagode, Jack Dongarra Tahoe City, CA July 28-Aug 1, 2019
Case study: PaRSEC’s task scheduling algorithm … Core 0 Core 1 Core 2 Core N Core 0 Core 1 Core 2 Core N Core local queues Shared Global queue (overflow)
Case study: PaRSEC’s task scheduling algorithm … Core 0 Core 1 Core 2 Core N Core 0 Core 1 Core 2 Core N Core local queues Shared Global queue Shared Global queue (overflow) (overflow) Thread Local Queues => High Locality Overflow & Work Stealing => Load Balance
Parameter selection Q1: How long should the local queues be? Q2: Should a thread first steal from a close queue, any queue, or the shared queue?
Parameter selection Q1: How long should the local queues be? A: 4*Core_Count Q2: Should a thread first steal from a close queue, any queue, or the shared queue? A: Any local queue (closest to farthest), then shared queue.
Testing Benchmark ... ... ... ... ... ... ... ... ... ... ... ... ... ● 20 Independent Fork-Join chains x 20 (or 25) Tasks per fork. ● Memory bound kernel, with good cache locality. ● 20 Cores on testing node.
Execution time vs Local Queue Length
Execution time vs Local Queue Length (zoom)
Execution time vs Local Queue Length (zoom 2)
Execution time vs Local Queue Length (zoom 3)
Execution time vs Local Queue Length (zoom 4)
Execution time vs Local Queue Length (zoom 5)
Execution time vs Local Queue Length (combined)
Failed Stealing Attempts
L2 Cache Misses (L3 show same pattern)
Successful Close Stealing
Successful Close & Far Stealing
Successful Shared Queue Stealing
Successful Local + Shared Queue Stealing
Unanswered questions Q: So, what causes the bump? Q: How did you measure all these things?
Unanswered questions Q: So, what causes the bump? A: I don’t know! Q: How did you measure all these things?
Unanswered questions Q: So, what causes the bump? A: I don’t know! Q: How did you measure all these things? A: I am glad you asked.
What is missing from current infrastructure? Events that occurred inside the software stack There is no standardized way for a software layer to export information about its behavior such that other, independently developed, software layers can read it . HPC Application Quantum Chemistry Method Distributed Factorization Math library Data Dependency Task runtime One Sided Communication MPI RDMA completion Libibverbs
PAPI Software Defined Events • De facto standard: SDEs from your library can be read using the standard PAPI_start()/PAPI_stop()/PAPI_read(). • Low overhead: Performance critical codes can implement SDEs with zero overhead by exporting existing code variables without adding any new instructions in the fast path. • Rich feature set: PAPI SDE supports counters, groups, recordings, simple statistics, thread safety, custom callbacks.
The tool infrastructure is already there
The tool infrastructure is already there
Simplest SDE code (library side) s t a t i c l o n g l o n g l o c a l _ v a r ; v o i d s m a l l _ t e s t _ i n i t ( v o i d ) { l o c a l _ v a r = 0 ; p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ r e g i s t e r _ c o u n t e r ( h a n d l e , ” E v n t " , P A P I _ S D E _ R O | P A P I _ S D E _ D E L T A , P A P I _ S D E _ l o n g _ l o n g , & l o c a l _ v a r ) ; . . . }
SDE code for registering a callback function s o m e t y p e _ t * d a t a ; v o i d s m a l l _ t e s t _ i n i t ( v o i d ) { d a t a = . . . p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ r e g i s t e r _ f p _ c o u n t e r ( h a n d l e , " E v n t " , P A P I _ S D E _ R O | P A P I _ S D E _ D E L T A , P A P I _ S D E _ l o n g _ l o n g , a c c e s s o r , d a t a ) ; . . . }
SDE code for creating a counter (push mode) v o i d * c o u n t e r _ h a n d l e ; v o i d s m a l l _ t e s t _ i n i t ( v o i d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ c o u n t e r ( h a n d l e , " E v n t " , P A P I _ S D E _ l o n g _ l o n g , & c o u n t e r _ h a n d l e ) ; . . . }
SDE code for creating a recorder (push mode) v o i d * r e c o r d e r _ h a n d l e ; v o i d s m a l l _ t e s t _ i n i t ( v o i d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ r e c o r d e r ( h a n d l e , " R C R D R " , s i z e o f ( d o u b l e ) , c m p r _ f u n c _ p t r , & r e c o r d e r _ h a n d l e ) ; . . . }
SDE code for creating a recorder (push mode) v o i d * r e c o r d e r _ h a n d l e ; s d e : : : T E S T : : R C R D R v o i d s m a l l _ t e s t _ i n i t ( v o i d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ r e c o r d e r ( h a n d l e , " R C R D R " , s i z e o f ( d o u b l e ) , c m p r _ f u n c _ p t r , & r e c o r d e r _ h a n d l e ) ; . . . }
SDE code for creating a recorder (push mode) v o i d * r e c o r d e r _ h a n d l e ; s d e : : : T E S T : : R C R D R v o i d s m a l l _ t s e s d t _ e i : n i : t ( : v T o E i d S ) T { : : R C R D R : C N T p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ r e c o r d e r ( h a n d l e , " R C R D R " , s i z e o f ( d o u b l e ) , c m p r _ f u n c _ p t r , & r e c o r d e r _ h a n d l e ) ; . . . }
SDE code for creating a recorder (push mode) v o i d * r e c o r d e r _ h a n d l e ; s d e : : : T E S T : : R C R D R v o i d s m a l l _ t s e s d t _ e i : n i : t ( : v T o E i d S ) T { : : R C R D R : C N T p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; s d e : : : T E S T : : R C R D R : M I N p a p i _ s d e _ c r e a t e _ r e c o r d e r ( h a n d l e , " R C R D R " , s d e : : : T E S T : : R C R D R : Q 1 s i z e o f ( d o u b l e ) , s d e : : : T E S T : : R C R D R : M E D c m p r _ f u n c _ p t r , & r e c o r d e r _ h a n d l e ) ; s d e : : : T E S T : : R C R D R : Q 3 . . . s d e : : : T E S T : : R C R D R : M A X }
SDE code for updating created counters/recorders v o i d * c o u n t e r _ h a n d l e ; v o i d * r e c o r d e r _ h a n d l e ; v o i d p u s h _ t e s t _ d o w o r k ( v o i d ) { d o u b l e v a l ; l o n g l o n g i n c r e m e n t = 3 ; v a l = p e r f o r m _ u s e f u l _ w o r k ( ) ; p a p i _ s d e _ i n c _ c o u n t e r ( c o u n t e r _ h a n d l e , i n c r e m e n t ) ; p a p i _ s d e _ r e c o r d ( r e c o r d e r _ h a n d l e , s i z e o f ( v a l ) , & v a l ) ; }
Performance overheads in simple benchmark 35
Performance overhead in PaRSEC 36
Performance overhead in HPCG 37
Performance overhead in HPCG (zoom) 38
Open Problem for our Community: How do we associate useful context information with SDEs? What meaningful information to associate with “TASKS_STOLEN”? – Code location – Hardware events (e.g. cache misses) – Patterns in history (e.g. last task before stealing event) – Patterns in call-path/stack/originating thread
Recommend
More recommend