Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Job Not Found Error #93

Open
davramov opened this issue Dec 13, 2024 · 3 comments
Open

Job Not Found Error #93

davramov opened this issue Dec 13, 2024 · 3 comments
Assignees

Comments

@davramov
Copy link

I am encountering an issue where submit_job() raises a Job Not Found error, even though job IDs are generated, successfully scheduled, and run on Perlmutter. This complicates an automated workflow I am developing where we need to wait for the results of one task before starting the next one.

From my script job_controller.py:

try:
  logger.info("Submitting reconstruction job script to Perlmutter.")
  job = self.client.perlmutter.submit_job(job_script)
except Exception as e:
  logger.error(f"Failed to submit or complete reconstruction job: {e}")

Error log from the exception:

13:11:27.498 | INFO    | orchestration.flows.bl832.job_controller - Submitting reconstruction job script to Perlmutter.
13:12:03.894 | ERROR   | orchestration.flows.bl832.job_controller - Failed to submit or complete reconstruction job: Job not found: 33821565

It seems like this could arise from one of the SfApiErrors raised by the submit_job() function defined in sfapi_client/_sync/compute.py

davramov added a commit to davramov/splash_flows_globus that referenced this issue Dec 13, 2024
…it for the job to complete before moving onto the next step. An OK workaround for this: NERSC/sfapi_client#93
@cjh1
Copy link
Collaborator

cjh1 commented Dec 16, 2024

@davramov Sorry that you are running into this issue. I believe this is related to a change make to the underlying SF API, to do with caching of job status. While the issues related to this are resolved I will cut a release that passes the appropriate parameter when fetching the job status to ask SLURM directly, this should resolve the problem for you.

@cjh1 cjh1 self-assigned this Dec 16, 2024
@cjh1
Copy link
Collaborator

cjh1 commented Dec 16, 2024

@davramov 0.3.2 now contains the fix.

@davramov
Copy link
Author

@cjh1 Thanks Chris, I appreciate it. Things seem to be working on my end now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants