-
Notifications
You must be signed in to change notification settings - Fork 119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
diffBragg hardening #857
base: master
Are you sure you want to change the base?
diffBragg hardening #857
Conversation
Previously diffBragg would raise an Exception if any image failed to process for unexpected reasons. In future real data use cases we will want to log this issue but continue to process the rest of the dataset. The problems encountered for a single image in LZ08 are now moved to RuntimeErrors and caught by a try-except when looping over experiments/reflections/files. Design choices here are guided by the fact that previous Exceptions would kill a single rank and hang the slurm job. By adding the @mpi_abort_on_exception decorator to the function that loops over files, we correctly raise an Exception and kill the slurm job when anything except these RuntimeErrors triggers an Exception, and we preserve the most informative traceback (compared with decorating functions called in the loop). The problems caught as RuntimeErrors are logged at the "critical" level, with more information in the debug logs where possible. The logs also reflect the number of files that could be read, compared with all files found. Although it's not ideal that the files that couldn't be read are still included in the "work distribution" over ranks (marked as having zero reflections), there is an important sanity check taking stock of the expected images here that relies on those files being listed, and the alternative leaves us much more vulnerable to losing images and having no way to tell. A more invasive refactor could be a better choice at some future date. Finally, note that sys.exit() instances aren't safe to use in the functions decorated with @mpi_abort_on_exception. They either need to be replaced with things that aren't recognized as Exceptions or moved outside those functions.
@@ -75,7 +76,13 @@ def write_commandline(params): | |||
else: | |||
mpi_logger.setup_logging_from_params(params) | |||
|
|||
df = pandas.read_pickle(args.input) | |||
for i in range(3): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what was the failure occurring here without the waiting ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there two jobs running simultaneously??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct me if I'm wrong -- this is what prep_time is currently addressing? This was a first stab at making it happen by default to wait until files are available, if necessary. Am I in the wrong place?
@@ -567,6 +567,8 @@ def get_roi_background_and_selection_flags(refls, imgs, shoebox_sz=10, reject_ed | |||
MAIN_LOGGER.debug("Number of skipped ROI with negative BGs: %d / %d" % (num_roi_negative_bg, len(rois))) | |||
MAIN_LOGGER.debug("Number of skipped ROI with NAN in BGs: %d / %d" % (num_roi_nan_bg, len(rois))) | |||
MAIN_LOGGER.info("Number of ROIS that will proceed to refinement: %d/%d" % (np.sum(selection_flags), len(rois))) | |||
if np.sum(selection_flags) == 0: | |||
raise RuntimeError("Can't proceed with zero ROIs") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it make sense to catch this error in hopper_utils.py / GatherFromExperiment:
try:
roi_packet = utils.get_roi_background_and_selection_flag( ...
except RuntimeError:
return False
something like that... the return value of this method is inspected in command_line/hopper.py
...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that makes sense.
A couple fixes here to make diffBragg usable with LZ08 at scale. Mainly, we need to be able to skip over one or two bad files without crashing the whole job. Open to discussion if these are addressable individual failures though.