does your tool support papi sdes yet
play

Does your tool support PAPI SDEs yet? 13 th Scalable Tools Workshop - PowerPoint PPT Presentation

Does your tool support PAPI SDEs yet? 13 th Scalable Tools Workshop Anthony Danalis, Heike Jagode, Jack Dongarra Tahoe City, CA July 28-Aug 1, 2019 Case study: PaRSECs task scheduling algorithm Core 0 Core 1 Core 2 Core N Core 0


  1. Does your tool support PAPI SDEs yet? 13 th Scalable Tools Workshop Anthony Danalis, Heike Jagode, Jack Dongarra Tahoe City, CA July 28-Aug 1, 2019

  2. Case study: PaRSEC’s task scheduling algorithm … Core 0 Core 1 Core 2 Core N Core 0 Core 1 Core 2 Core N Core local queues Shared Global queue (overflow)

  3. Case study: PaRSEC’s task scheduling algorithm … Core 0 Core 1 Core 2 Core N Core 0 Core 1 Core 2 Core N Core local queues Shared Global queue Shared Global queue (overflow) (overflow) Thread Local Queues => High Locality Overflow & Work Stealing => Load Balance

  4. Parameter selection Q1: How long should the local queues be? Q2: Should a thread first steal from a close queue, any queue, or the shared queue?

  5. Parameter selection Q1: How long should the local queues be? A: 4*Core_Count Q2: Should a thread first steal from a close queue, any queue, or the shared queue? A: Any local queue (closest to farthest), then shared queue.

  6. Testing Benchmark ... ... ... ... ... ... ... ... ... ... ... ... ... ● 20 Independent Fork-Join chains x 20 (or 25) Tasks per fork. ● Memory bound kernel, with good cache locality. ● 20 Cores on testing node.

  7. Execution time vs Local Queue Length

  8. Execution time vs Local Queue Length (zoom)

  9. Execution time vs Local Queue Length (zoom 2)

  10. Execution time vs Local Queue Length (zoom 3)

  11. Execution time vs Local Queue Length (zoom 4)

  12. Execution time vs Local Queue Length (zoom 5)

  13. Execution time vs Local Queue Length (combined)

  14. Failed Stealing Attempts

  15. L2 Cache Misses (L3 show same pattern)

  16. Successful Close Stealing

  17. Successful Close & Far Stealing

  18. Successful Shared Queue Stealing

  19. Successful Local + Shared Queue Stealing

  20. Unanswered questions Q: So, what causes the bump? Q: How did you measure all these things?

  21. Unanswered questions Q: So, what causes the bump? A: I don’t know! Q: How did you measure all these things?

  22. Unanswered questions Q: So, what causes the bump? A: I don’t know! Q: How did you measure all these things? A: I am glad you asked.

  23. What is missing from current infrastructure? Events that occurred inside the software stack There is no standardized way for a software layer to export information about its behavior such that other, independently developed, software layers can read it . HPC Application Quantum Chemistry Method Distributed Factorization Math library Data Dependency Task runtime One Sided Communication MPI RDMA completion Libibverbs

  24. PAPI Software Defined Events • De facto standard: SDEs from your library can be read using the standard PAPI_start()/PAPI_stop()/PAPI_read(). • Low overhead: Performance critical codes can implement SDEs with zero overhead by exporting existing code variables without adding any new instructions in the fast path. • Rich feature set: PAPI SDE supports counters, groups, recordings, simple statistics, thread safety, custom callbacks.

  25. The tool infrastructure is already there

  26. The tool infrastructure is already there

  27. Simplest SDE code (library side) s t a t i c l o n g l o n g l o c a l _ v a r ; v o i d s m a l l _ t e s t _ i n i t ( v o i d ) { l o c a l _ v a r = 0 ; p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ r e g i s t e r _ c o u n t e r ( h a n d l e , ” E v n t " , P A P I _ S D E _ R O | P A P I _ S D E _ D E L T A , P A P I _ S D E _ l o n g _ l o n g , & l o c a l _ v a r ) ; . . . }

  28. SDE code for registering a callback function s o m e t y p e _ t * d a t a ; v o i d s m a l l _ t e s t _ i n i t ( v o i d ) { d a t a = . . . p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ r e g i s t e r _ f p _ c o u n t e r ( h a n d l e , " E v n t " , P A P I _ S D E _ R O | P A P I _ S D E _ D E L T A , P A P I _ S D E _ l o n g _ l o n g , a c c e s s o r , d a t a ) ; . . . }

  29. SDE code for creating a counter (push mode) v o i d * c o u n t e r _ h a n d l e ; v o i d s m a l l _ t e s t _ i n i t ( v o i d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ c o u n t e r ( h a n d l e , " E v n t " , P A P I _ S D E _ l o n g _ l o n g , & c o u n t e r _ h a n d l e ) ; . . . }

  30. SDE code for creating a recorder (push mode) v o i d * r e c o r d e r _ h a n d l e ; v o i d s m a l l _ t e s t _ i n i t ( v o i d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ r e c o r d e r ( h a n d l e , " R C R D R " , s i z e o f ( d o u b l e ) , c m p r _ f u n c _ p t r , & r e c o r d e r _ h a n d l e ) ; . . . }

  31. SDE code for creating a recorder (push mode) v o i d * r e c o r d e r _ h a n d l e ; s d e : : : T E S T : : R C R D R v o i d s m a l l _ t e s t _ i n i t ( v o i d ) { p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ r e c o r d e r ( h a n d l e , " R C R D R " , s i z e o f ( d o u b l e ) , c m p r _ f u n c _ p t r , & r e c o r d e r _ h a n d l e ) ; . . . }

  32. SDE code for creating a recorder (push mode) v o i d * r e c o r d e r _ h a n d l e ; s d e : : : T E S T : : R C R D R v o i d s m a l l _ t s e s d t _ e i : n i : t ( : v T o E i d S ) T { : : R C R D R : C N T p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; p a p i _ s d e _ c r e a t e _ r e c o r d e r ( h a n d l e , " R C R D R " , s i z e o f ( d o u b l e ) , c m p r _ f u n c _ p t r , & r e c o r d e r _ h a n d l e ) ; . . . }

  33. SDE code for creating a recorder (push mode) v o i d * r e c o r d e r _ h a n d l e ; s d e : : : T E S T : : R C R D R v o i d s m a l l _ t s e s d t _ e i : n i : t ( : v T o E i d S ) T { : : R C R D R : C N T p a p i _ h a n d l e _ t * h a n d l e = p a p i _ s d e _ i n i t ( ” T E S T " ) ; s d e : : : T E S T : : R C R D R : M I N p a p i _ s d e _ c r e a t e _ r e c o r d e r ( h a n d l e , " R C R D R " , s d e : : : T E S T : : R C R D R : Q 1 s i z e o f ( d o u b l e ) , s d e : : : T E S T : : R C R D R : M E D c m p r _ f u n c _ p t r , & r e c o r d e r _ h a n d l e ) ; s d e : : : T E S T : : R C R D R : Q 3 . . . s d e : : : T E S T : : R C R D R : M A X }

  34. SDE code for updating created counters/recorders v o i d * c o u n t e r _ h a n d l e ; v o i d * r e c o r d e r _ h a n d l e ; v o i d p u s h _ t e s t _ d o w o r k ( v o i d ) { d o u b l e v a l ; l o n g l o n g i n c r e m e n t = 3 ; v a l = p e r f o r m _ u s e f u l _ w o r k ( ) ; p a p i _ s d e _ i n c _ c o u n t e r ( c o u n t e r _ h a n d l e , i n c r e m e n t ) ; p a p i _ s d e _ r e c o r d ( r e c o r d e r _ h a n d l e , s i z e o f ( v a l ) , & v a l ) ; }

  35. Performance overheads in simple benchmark 35

  36. Performance overhead in PaRSEC 36

  37. Performance overhead in HPCG 37

  38. Performance overhead in HPCG (zoom) 38

  39. Open Problem for our Community: How do we associate useful context information with SDEs? What meaningful information to associate with “TASKS_STOLEN”? – Code location – Hardware events (e.g. cache misses) – Patterns in history (e.g. last task before stealing event) – Patterns in call-path/stack/originating thread

Recommend


More recommend