time stamp synchronization for event traces of large
play

Time Stamp Synchronization for Event Traces of Large-Scale Message - PowerPoint PPT Presentation

Time Stamp Synchronization for Event Traces of Large-Scale Message Passing Applications D. Becker and F. Wolf Forschungszentrum Jlich Central Institute for Applied Mathematics R. Rabenseifner High Performance Computing Center Stuttgart


  1. Time Stamp Synchronization for Event Traces of Large-Scale Message Passing Applications D. Becker and F. Wolf Forschungszentrum Jülich Central Institute for Applied Mathematics R. Rabenseifner High Performance Computing Center Stuttgart Department Parallel Computing

  2. Outline � Introduction � Event model and replay-based parallel analysis � Controlled logical clock � Extended controlled logical clock � Timestamp synchronization � Conclusion � Future work Daniel Becker 2

  3. SCALASCA � Goal - diagnose wait states in MPI applications on large- scale systems � Scalability through parallel analysis of event traces Trace analysis Execution on Parallel trace Local trace files report parallel machine analyzer Daniel Becker 3

  4. Wait States in MPI Applications process process time time (a) Late sender (b) Late receiver process process time time (c) Late sender / wrong order (d) Wait at n-to-n ENTER EXIT SEND RECV COLLEXIT Daniel Becker 4

  5. Non-Synchronized Clocks � Wait states diagnosis measures temporal displacements between concurrent events � Problem - local processor clocks are often non- synchronized Clocks may vary in offset and drift o � Present approach - linear interpolation Accounts for differences in offset and drift o Assumes that drift is not time dependant o � Inaccuracies and changing drifts can still cause violations of the logical event ordering Synchronization method for violations not already covered by linear interpolation required Daniel Becker 5

  6. Idea � Requirement - realistic message passing codes Different modes of communication (P2P & collective) o Large numbers of processes o � Build on controlled logical clock by Rolf Rabenseifner Synchronization based on Lamport’s logical clock o Only P2P communication o Sequential program o � Approach Extend controlled logical clock to collective operations o Define scalable correction algorithm through parallel replay o Daniel Becker 6

  7. Event Model � Event includes at least timestamp, location and event type Additional information may be supplied depending o on event type � Event type refers to Programming-model independent events o MPI-related events o Events internal to tracing library o � Event sequence recorded for typical MPI operations Enter MPI_Send() E S X Exit Collective Exit MPI_Recv() E R X Send MPI_Allreduce() E CX Receive Daniel Becker 7

  8. Replay-Based Parallel Analysis Parallel analysis scheme � SCALASCA toolset o Originally developed to improve scalability on large-scale systems o Analyze separate local trace files in parallel � Exploits distributed memory & processing capabilities o Keeps whole trace in main memory o Only process-local information visible to a process o Parallel replay of target application‘s communication behavior � Parallel traversal of event streams o Analyze communication with operation of same type o Exchange of required data at synchronization points of target o application Daniel Becker 8

  9. Example: Wait at N x N … … 1 1 1 1 location 2 … … 2 2 2 3 2 … … 3 time 3 3 � Waiting time due to inherent synchronization in N-to-N operations (e.g., MPI_Allreduce) � Algorithm: Triggered by collective exit event o Determine enter events o Determine & distribute latest enter event (max-reduction) o Calculate & store waiting time o Daniel Becker 9

  10. Controlled Logical Clock � Guarantees Lamport's clock condition Use happened-before relations to synchronize timestamps o Send event always earlier than receive event o � Scans event trace for clock condition violations and modifies inexact timestamps � Stretches process-local time axis in the immediate vicinity of affected event Preserves length of intervals between local events o Forward amortization � Smoothes discontinuity at affected event o Backward amortization � Daniel Becker 10

  11. Forward Amortization � Inconsistent event stream p 0 E S X p 1 X E R X � Corrected and forward amortized event stream p 0 E S X p 1 X E R R X X Minimum Latency Daniel Becker 11

  12. Backward Amortization � Forward amortized event stream p 0 E S X p 1 X E R R X Δ t � Forward and backward amortized event stream p 0 E S X p 1 X X E E R X Daniel Becker 12

  13. Extended Controlled Logical Clock � Consider single collective operation as composition of many point-to-point communications � Distinguish between different types 1-to-N o N-to-1 o N-to-N o � Determine send and receive events for each type � Define happened-before relations based on decomposition of collective operations Daniel Becker 13

  14. Decomposition of Collective Operations � 1xN: Root sends data to N processes � Nx1: N processes send data to root � NxN: N processes send data to N processes Daniel Becker 14

  15. Happened-Before Relation � Synchronization needs one send event timestamp � Operation may have multiple send and receive events � Multiple receives used to synchronize multiple clocks � Latest send event is the relevant send event � Example: N-to-1 root Daniel Becker 15

  16. Forward Amortization � New timestamp LC’ is maximum of Max( send event timestamp + minimum latency) o Event timestamp o Previous event timestamp + minimum event spacing o Previous event timestamp + controlled event spacing o Daniel Becker 16

  17. Controller � Approximates original communication after clock condition violation � Limits synchronization error � Bounds propagation during forward amortization � Requires global view of the trace data Daniel Becker 17

  18. Backward Amortization b LC i ’ - LC i b ) min(LC k ’ (corr. receive event) - µ - LC i x wish Jump discontinuity due to x x x a clock condition violation x LC i b := LC i ’ without jump S I S R S I S S R Amortization interval Results of the extended controlled logical clock with jump discontinuities Linear interpolation with backwards amortization Piecewise linear interpolation with backwards amortization jump Amortization interval = accuracy Daniel Becker 18

  19. Timestamp Synchronization � Event tracing of applications running on thousands of processes requires scalable synchronization scheme � Proposed algorithm depends on accuracy of original timestamps � Two-step synchronization scheme Pre-synchronization o Linear interpolation � Parallel post-mortem timestamp synchronization o Extended controlled logical clock � Daniel Becker 19

  20. Pre-Synchronization � Account for differences in offset and drift � Assume that drift is not time dependant � Offset measurement at program initialization and finalization Among arbitrary chosen master and worker processes o � Linear interpolation between these two points Daniel Becker 20

  21. Parallel Timestamp Synchronization � Extended controlled logical clock � Parallel traversal of the event stream Forward amortization o Backward amortization o � Exchange required timestamp at synchronization points � Perform clock correction � Apply control mechanism after replaying the communication Global view of the trace data o Multiple passes until error is below a predefined threshold o Daniel Becker 21

  22. Forward Amortization � Timestamps exchanged depending on the type of operation Type of operation timestamp exchanged MPI function P2P timestamp of send event MPI Send 1-to-N timestamp of root enter event MPI Bcast N-to-1 max( all enter event timestamps ) MPI Reduce N-to-N max( all enter event timestamps ) MPI Allreduce Daniel Becker 22

  23. Backward Amortization � Timestamps exchanged depending on the type of operation Type of operation timestamp exchanged MPI function P2P timestamp of receive event MPI Send 1-to-N min( all collective exit event MPI Reduce timestamps ) N-to-1 timestamp of root collective exit MPI Bcast event N-to-N min( all collective exit event MPI Allreduce timestamps ) Daniel Becker 23

  24. Conclusion � Extended controlled logical clock algorithm takes collective communication semantics into account Defined collective send and receive operations o Defined collective happened-before relations o � Parallel implementation design presented using SCALASCA’s parallel replay approach Exploits distributed memory & processing capabilities o Daniel Becker 24

  25. Future Work � Finish actual implementation � Evaluate algorithm using real message passing codes � Extend algorithm to shared memory programming models � Extend algorithm to one sided communication Daniel Becker 25

  26. Thank you… For more information, visit our project home page: http://www.scalasca.org Daniel Becker 26

Recommend


More recommend