Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NERSC SFAPI Reconstruction Flow #44

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

davramov
Copy link
Contributor

@davramov davramov commented Dec 4, 2024

Setting this up as a draft PR for now while I iron out implementation details to match the ALCF workflow.

These two steps are currently working via sfapi and the container Raja built (registry.nersc.gov/als/tomorecon_nersc_mpi_hdf5):

  • Reconstruction
  • Tiff to Zarr

There are still at least a few things that need to be addressed before it is production ready:

  • Pass into reconstruction.py the same file_path that new_832_file_flow and alcf_recon_flow expect (still using a legacy version of reconstruction.py that expects an input.txt file)
  • File transfers back to data832
  • Schedule Pruning
  • Pytest cases

Copy link
Contributor

@dylanmcreynolds dylanmcreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work.

I added a number of review comments. This is a really nice start, and makes the nersc-specific code easy to see and follow. If my comments were a little picky, it's because the code was easy to follow.

tomocagrahpy_hpt.py should be split into sepearate modules, something like: tomography_hpc.py, nersc.py and alcf.py so that no more code than necessary is imported (e.g. when running an ALCF job, you don't need to import NerscClient.)

Also, what about job_controller.py rather than tomography_hpc.py? Since this is under flows/bl832, tomography is implied. And we might some day write the job controllers for local (non-HPC) jobs.

orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
orchestration/flows/bl832/tomography_hpc.py Outdated Show resolved Hide resolved
…ig.py, renamed tomography_hpc.py to job_controller.py. Added back nersc.py, which now just uses job_controller (need to add back in the other data transfer/pruning logic, etc). Inject NERSCClient into the nersc controller init. Removed references to export controls. Fixed typing for Config832 references (and pass into the init). Turned create_nersc_client() into a static method in the class. Moved load_dotenv to the top of the module. Consoldated reconstrut() and and build_multi_resolution() in the nersc implementation.
Copy link
Contributor

@dylanmcreynolds dylanmcreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This

…c.py. Updated get_controller() in job_controller.py to take in an HPC enum that stores the corresponding Controller implementation for that facility.
password = os.getenv("NERSC_PASSWORD")

job_script = f"""#!/bin/bash
#SBATCH -q preempt
Copy link
Contributor Author

@davramov davramov Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several queue things I have figured out:

  • debug has no minimum time requirement, but using a different queue such as preempt requires allocating 2 hours minimum for a job (which I think is a bit overkill in our case, since recon should only take 5-10 min).
  • Based on testing debug vs preempt, it seems like the the debug jobs get picked up much faster (perhaps due to smaller time constraint, as they are both medium priority).
  • There is also the interactive queue, which has a high priority, but only 4 nodes x 2 jobs can be active simultaneously. We would have to modify the job submission script to use salloc

Just to add some numbers to this after some testing today:

  • jobs in the debug queue have been consistently picked up after about 10 minutes
  • 2 jobs I scheduled this morning in the preempt queue have yet to be picked up (approaching 4 hours...)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bjoern let us know that we (under the 'als' project) have access to the realtime queue at NERSC, which should reduce latency. I had an issue using that queue with the following error:
Error during job submission or completion: sbatch: error: Batch job submission failed: Invalid qos specification

I opened a ticket and we are still waiting to hear from the help desk about the correct job parameters to use the queue:
https://nersc.servicenowservices.com/sp?sys_id=03d747c187a216101c6cea470cbb35e9&view=sp&id=ticket&table=incident

…ath and run respective tasks. Revert to debug queue (at least for now) since the queue time is orders of magnitude short than the preempt queue (want to look into the interactive queue). Moved NERSCTomographyHPCController back into job_controller.py due to circular imports. Fixed Enum implementation.
…ows/bl832/nersc.py. Removed the ALCF controlelr from job_controller, in anticipation of that being defined in alcf.py when I refactor that code.
@dylanmcreynolds
Copy link
Contributor

Now that you have a nice clean NERSCTomographyHPCController, you have something you can easy write unit tests for. I'd recommend mocking NerscClient and test that the calls to it are as expected. Also, are we still sure that we need NerscClient?

…it for the job to complete before moving onto the next step. An OK workaround for this: NERSC/sfapi_client#93
@davramov
Copy link
Contributor Author

Now that you have a nice clean NERSCTomographyHPCController, you have something you can easy write unit tests for. I'd recommend mocking NerscClient and test that the calls to it are as expected. Also, are we still sure that we need NerscClient?

Sounds good to me, I will work on adding unit tests. I suspect we can replace references to NerscClient with the official sfapi_client Client class, and capture that logic in create_nersc_client() in NERSCTomographyHPCController. That could be a bit cleaner.

Copy link
Contributor

@dylanmcreynolds dylanmcreynolds left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's do the following after discussion:

  • add a big comment to the NerscClient saying that it's deprecated and will be removed when we refactor the ptychography code

  • Remove initialization of Config832 and sfapi Client from NERSCTomogrpahyHPCController, passing them down from flows to tasks as parameters

  • Create unit tests for NERSCTomogrpahyHPCController methods

…l be removed. Removed init of COnfig832 and sfapi_client from NERSCTomographyHPCCOntroller, and instead inject them (no longer optional). Updated unit tests in
@davramov
Copy link
Contributor Author

  • Added comments in orchestration/nersc.py indicating that NerscClient is deprecated and will be removed.
  • Removed init of Config832 and sfapi_client from NERSCTomographyHPCController, and instead inject them (no longer optional).
  • Updated unit tests in test_sfapi_flow.py

@davramov davramov marked this pull request as ready for review December 20, 2024 17:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants