Contacting XD users ● 71 active TG- projects, after removing staff and training accounts ● We have started to email each project, in reverse chronological order on project end date, about OSG and available user support ● Please try to determine if a freshdesk ticket is XSEDE related and assign to me
Large workload problems login01/stash/condor? ● Not many clues yet but… ● 15,000 job workflow fails because some job outputs are missing for some jobs ○ Jobs are marked successful in logs ○ Missing output files are not detected until Pegasus tries to run subsequent jobs ● Works on xd-login ● Potential causes ○ HTCondor? We are running 8.4.2 on xd-login, and 8.2.10 on login01, but I have not had other issues like this with 8.2 ○ Stash? I tried both stash and stash2, both had issues - but the workload is large so maybe try a few more times with a synthetic workload? ○ Combination of HTCondor and Stash? ○ Thoughts? No news yet
# hold jobs using absurd amounts of disk (50+ GB) or using more memory than requested. # not all of our jobs have RequestMemory defined SYSTEM_PERIODIC_HOLD = \ (JobUniverse == 5) && ( \ (JobStatus == 1 || JobStatus == 2) && ( \ (DiskUsage > 50000000) || \ (ResidentSetSize > 1000*2000 && ifThenElse(isUndefined(RequestMemory), True, \ ResidentSetSize > 1000*RequestMemory)) \ ) \ ) # Report why the stupid thing went on hold. SYSTEM_PERIODIC_HOLD_REASON = \ strcat("Job in status ", JobStatus, " put on hold by SYSTEM_PERIODIC_HOLD due to ", \ ifThenElse(isUndefined(ResidentSetSize) == False && ResidentSetSize > 1000*2000 && \ ifThenElse(isUndefined(RequestMemory), True, ResidentSetSize > 1000*RequestMemory), \ strcat("memory usage ", ResidentSetSize), \ strcat("disk usage ", DiskUsage)), ".") # forceful removal of running after 9 days, held jobs after 7 days # and anything trying to run more than 10 times (except users with user level checkpointing) SYSTEM_PERIODIC_REMOVE = \ (JobUniverse == 5) && ( \ (JobStatus == 2 && CurrentTime - EnteredCurrentStatus > 3600*24*9) || \ (JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*24*6) || \ ((JobRunCount >= 10) && (Owner =!= "bxie") && (Owner =!= "strolog")) \ ) # Record why the job was removed SYSTEM_PERIODIC_REMOVE_REASON = strcat("Job removed by SYSTEM_PERIODIC_REMOVE due to ", \ ifThenElse(JobStatus == 2 && CurrentTime - EnteredCurrentStatus > 3600*24*9, \ "runtime of longer than 9 days", \ ifThenElse(JobStatus == 5 && CurrentTime - EnteredCurrentStatus > 3600*24*6, \ "being in hold state for 7 days", \ "more than 10 restarts") \ ) )
Recommend
More recommend