Scalable Observation System (SOS) for Scientific Workflows Pr Project Ov oject Over erview & Discus view & Discussion sion Chad D. Wood Supervisors: Prof. Allen Malony and Kevin Huck
“So, where is this talk going?” To advocate and demonstrate a run-%me system designed to enable the characteriza%on and analysis of complex scien6fic workflow performance at scale. 2 2
em? So So, , What is the Pr What is the Probl oblem? v It is reasonable to want to see “informa6on” during applica6on execu6on v Informa6on could come from the applica6on as well as from the environment in which the applica6on is execu6ng v Applica6on: Performance, problem-specific data and metadata, ... v Environment: System state, resource usage, run6me proper6es, ... v Mul6ple applica6on components may be running together as a workflow, and higher-level workflow behavior might be interes6ng 3 3
ws Scientific Scien tific Workflo orkflows Compute Time A: Parallel B: Serial C 1 : Irregular VIZ C 1 + C 2 : Parallel C 2 : Serial = Unit of Work = Result DATA v Mul6ple components with data flow v Complex interac6ons with dynamic behavior v Components (or en6re flows) may be parallelized differently v Offline episodic performance analysis has limited benefits 4 4
ts? What ar What are the Requir e the Requiremen ements? v Scalable v Portable v Easy to use v Mul6-purpose v Mul6ple informa6on sources v Operates at the 6me of applica6on (workflow) execu6on v Supports in situ access v Low overhead and low intrusion v Ability to alloca6on addi6onal resources to control overhead 5 5
oach Design Approach Design Appr v Base on a model of a “global” informa6on space v U6lize database technology v U6lize MPI high-performance communica6on v Build on launch support in scheduler v Allow for addi6onal (dedicated) resource alloca6on v Flexible publishing interface v SOS architecture 6 6
C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai SOSflow forms a func6onal overlay. 7 7
C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai In situ daemon with its local database 8 8
C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai PE PE SOS PE PE PE PE PE PE PE PE PE SOS lives side by side with your tasks 9 9
C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai Dedicated nodes for aggregate databases 10 10
C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai Dedicated nodes for analy@cs processing 11 11
C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai Co-located analy@cs query modules 12 12
C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai Independent ranks of analy@cs engines 13 13
C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai Analy6cs modules form independent communica6on channels 14 14
C Allocations SOSflo SOSflow: HP : HPC Allocations Db Db Db Adb Adb Adb Ai Ai SOSflow data is con@nuous and asynchronous 15 15
ure SOSflow: Data Struct SOSflo : Data Structur ENUM Create_Input Scope Source CREATE_OUTPUT Layer PUBLICATION HANDLE Create_Viz Nature Metadata Exec_Work Retain About both Pub Handle and Source Buffer SOS Support_Exec Value Support_Flow Frequency Frequency Control_Flow Frequency State Value State Sos State Class Class ... Class Type Type Type Seman@c ENUM Value Seman@c Seman@c Pa[ern Time_Start Pa[ern Pa[ern Compare Time_Stop Compare Rela6onship Hints Compare Mood TIME_STAMP Mood Mood Time_Span Sample Counter Log 16 16
ure SOSflow: Data Struct SOSflo : Data Structur Every value is conserved, with its full history and evolving metadata . . . . 3 PUB. HANDLE 4 mood 5 @me.pack @me. 6 stored by client pack 7 @me.send send 8 recv pushed to daemon @me.recv 9 seman@c injected into db 10 val=___ 11 12 13 14 . . . < val_snaps > < defini6ons > 17 17
: Easy to use API SOSflo SOSflow: Easy to use API . . . 18 18
otocol SOSflo SOSflow: In Sit : In Situ Sock u Socket C et Communication Pr ommunication Protocol Source INIT sosd SOS GUID_BLOCK ANNOUNCE Metadata, PUBLISH Defs. / Structure VAL_SNAPS All pack()’ed values ... VAL_SNAPS 19 19
e) SOSflo SOSflow: Dis : Distributed A tributed Asynchr synchronous Run onous Runtime (Simpl time (Simple) Source sosd sosd SOS (DB) DB AGGREGATE DB 20 20
time SOSflow: Dis SOSflo : Distributed A tributed Asynchr synchronous Run onous Runtime Client App Client App node node Source node SOS SOS SOS DB sosd db transport massive database of doom cloud_sync socket t sosa local_sync r o p s n sosd a r t analy@cs sosd sosd analy@cs node helper DB local (on-node) local query query 21 21
e it Runs SOSflo SOSflow: : Wher Where it Runs v NERSC v Sogware: q Cori q OpenMPI q Edison q MPICH v LLNL q Slurm q CAB q PBS q Catalyst v University of Oregon q ACISS 22 22
uation SOSflo SOSflow: Ev : Eval aluation v Experimental Setup q Explore performance of work-in-progress implementa6on q Synthe6c and real-world cases q What is the latency cost of being async? v Synthe6c Sweep of Parameters q Itera6ons: 2 to 10, steps of 2 q Size: 100 to 500 unique values per pub, steps of 100 q Delay: 0.5 to 1.0 second, each 0.1 second 23 23
[2 iter] [10 iter] 500 400 300 200 100 [SOS_publish() freq. shown as transparency, 0.5 sec to 1.0 sec (darkest)]
[2 iter] [10 iter] 500 400 300 200 100 [Translucency repr. SOS_publish() frequency, 0.5 sec to 1.0 sec (darkest)]
uation SOSflo SOSflow: Ev : Eval aluation v Real-World Scenario q TAU Instrumented LULESH on Cori q TAU reports results to SOSflow on a 6mer q LULESH calls SOSflow API directly at itera6on q SOSflow gathers metrics from the OS <Video> 26 26
216 Processes
343 Processes
512 Processes
ork Fut Futur ure e Work v Performance improvements v Integrate more automa6c data gathering for node-level metrics v Support for deep analy6cs v Tes6ng with addi6onal real-world workflows and opera6ng environments 30 30
Recommend
More recommend