Investigate if jobs can enter monitoring while in submitting stage #2121

egede · 2023-02-27T10:08:26Z

When a master job is submitting, it can take a very long time to submit the subjobs for certain remote backends (i.e. several hours if there are maybe 3000 subjobs). At the moment, the subjobs are not monitored during this period, so if some have finished already, we are effectively having deadtime in the system. Another benefit will be that if a job submission is terminated by the Ganga process getting killed, at least the already submitted subjobs will be recoverable. The current policy of failed submissions reverting the job to the new status should probably be changed to make this work.

egede · 2023-05-01T03:41:57Z

@abhijeetsharma200 See further information here

At the moment the behaviour around submission and monitoring is the following

On submission, a job is split into subjobs. Then if keep_going is True, ganga will attempt to submit all the subjobs, even if there are some failures along the way. The failed submissions will be left in the submitting state.
The overall state of a job is determined from the status of all subjobs. If a single subjob is in submitting the complete subjob will be declared as submitting (see full status calculation).
Master jobs in submitting status are not monitored. The consequence is that monitoring will not start until all subjobs are submitted (can take well above 1 hour) and if a single subjob submission fails, the job will never be monitored.

I think we want a few changes in behaviour.

Subjobs that fail to submit should be put into the failed state rather than left in submitting.
We should change it such that subjobs start to be monitored even while other subjobs are not yet submitted. This code seems to indicate that it is already the case, but I do not think it is. Some careful debugging might be required to understand.
The submitting status is a transient status. So if the ganga process has been killed, then on startup, all subjobs in the submitting status should be changed to failed.

egede · 2023-05-01T03:53:02Z

I think the first step will be to make a set of tests where you can get subjobs to fail on command and can get subjobs to submit very slowly as a way of testing if monitoring is starting at the same time. The TestSubmitter is a dummy backend that can be used for this.

egede added the Core label Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Investigate if jobs can enter monitoring while in submitting stage #2121

Investigate if jobs can enter monitoring while in submitting stage #2121

egede commented Feb 27, 2023

egede commented May 1, 2023

egede commented May 1, 2023

Investigate if jobs can enter monitoring while in submitting stage #2121

Investigate if jobs can enter monitoring while in submitting stage #2121

Comments

egede commented Feb 27, 2023

egede commented May 1, 2023

egede commented May 1, 2023