Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

diffBragg hardening #857

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

diffBragg hardening #857

wants to merge 2 commits into from

Conversation

irisdyoung
Copy link
Contributor

A couple fixes here to make diffBragg usable with LZ08 at scale. Mainly, we need to be able to skip over one or two bad files without crashing the whole job. Open to discussion if these are addressable individual failures though.

Previously diffBragg would raise an Exception if any image failed to process for unexpected reasons. In future real data use cases we will want to log this issue but continue to process the rest of the dataset. The problems encountered for a single image in LZ08 are now moved to RuntimeErrors and caught by a try-except when looping over experiments/reflections/files.

Design choices here are guided by the fact that previous Exceptions would kill a single rank and hang the slurm job. By adding the @mpi_abort_on_exception decorator to the function that loops over files, we correctly raise an Exception and kill the slurm job when anything except these RuntimeErrors triggers an Exception, and we preserve the most informative traceback (compared with decorating functions called in the loop).

The problems caught as RuntimeErrors are logged at the "critical" level, with more information in the debug logs where possible. The logs also reflect the number of files that could be read, compared with all files found. Although it's not ideal that the files that couldn't be read are still included in the "work distribution" over ranks (marked as having zero reflections), there is an important sanity check taking stock of the expected images here that relies on those files being listed, and the alternative leaves us much more vulnerable to losing images and having no way to tell. A more invasive refactor could be a better choice at some future date.

Finally, note that sys.exit() instances aren't safe to use in the functions decorated with @mpi_abort_on_exception. They either need to be replaced with things that aren't recognized as Exceptions or moved outside those functions.
@irisdyoung irisdyoung requested review from dermen and phyy-nx March 15, 2023 08:28
@irisdyoung irisdyoung self-assigned this Mar 15, 2023
@irisdyoung irisdyoung removed the request for review from phyy-nx March 16, 2023 06:27
@@ -75,7 +76,13 @@ def write_commandline(params):
else:
mpi_logger.setup_logging_from_params(params)

df = pandas.read_pickle(args.input)
for i in range(3):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what was the failure occurring here without the waiting ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there two jobs running simultaneously??

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct me if I'm wrong -- this is what prep_time is currently addressing? This was a first stab at making it happen by default to wait until files are available, if necessary. Am I in the wrong place?

simtbx/diffBragg/hopper_ensemble_utils.py Show resolved Hide resolved
@@ -567,6 +567,8 @@ def get_roi_background_and_selection_flags(refls, imgs, shoebox_sz=10, reject_ed
MAIN_LOGGER.debug("Number of skipped ROI with negative BGs: %d / %d" % (num_roi_negative_bg, len(rois)))
MAIN_LOGGER.debug("Number of skipped ROI with NAN in BGs: %d / %d" % (num_roi_nan_bg, len(rois)))
MAIN_LOGGER.info("Number of ROIS that will proceed to refinement: %d/%d" % (np.sum(selection_flags), len(rois)))
if np.sum(selection_flags) == 0:
raise RuntimeError("Can't proceed with zero ROIs")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it make sense to catch this error in hopper_utils.py / GatherFromExperiment:

try:
     roi_packet = utils.get_roi_background_and_selection_flag( ... 

except RuntimeError:  
    return False

something like that... the return value of this method is inspected in command_line/hopper.py ...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, that makes sense.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants