diffBragg hardening #857

irisdyoung · 2023-03-15T08:28:42Z

A couple fixes here to make diffBragg usable with LZ08 at scale. Mainly, we need to be able to skip over one or two bad files without crashing the whole job. Open to discussion if these are addressable individual failures though.

Previously diffBragg would raise an Exception if any image failed to process for unexpected reasons. In future real data use cases we will want to log this issue but continue to process the rest of the dataset. The problems encountered for a single image in LZ08 are now moved to RuntimeErrors and caught by a try-except when looping over experiments/reflections/files. Design choices here are guided by the fact that previous Exceptions would kill a single rank and hang the slurm job. By adding the @mpi_abort_on_exception decorator to the function that loops over files, we correctly raise an Exception and kill the slurm job when anything except these RuntimeErrors triggers an Exception, and we preserve the most informative traceback (compared with decorating functions called in the loop). The problems caught as RuntimeErrors are logged at the "critical" level, with more information in the debug logs where possible. The logs also reflect the number of files that could be read, compared with all files found. Although it's not ideal that the files that couldn't be read are still included in the "work distribution" over ranks (marked as having zero reflections), there is an important sanity check taking stock of the expected images here that relies on those files being listed, and the alternative leaves us much more vulnerable to losing images and having no way to tell. A more invasive refactor could be a better choice at some future date. Finally, note that sys.exit() instances aren't safe to use in the functions decorated with @mpi_abort_on_exception. They either need to be replaced with things that aren't recognized as Exceptions or moved outside those functions.

dermen · 2023-03-16T21:03:35Z

simtbx/command_line/hopper_ensemble.py

@@ -75,7 +76,13 @@ def write_commandline(params):
    else:
        mpi_logger.setup_logging_from_params(params)

-    df = pandas.read_pickle(args.input)
+    for i in range(3):


what was the failure occurring here without the waiting ?

is there two jobs running simultaneously??

Correct me if I'm wrong -- this is what prep_time is currently addressing? This was a first stab at making it happen by default to wait until files are available, if necessary. Am I in the wrong place?

simtbx/diffBragg/hopper_ensemble_utils.py

dermen · 2023-03-16T21:15:18Z

simtbx/diffBragg/utils.py

@@ -567,6 +567,8 @@ def get_roi_background_and_selection_flags(refls, imgs, shoebox_sz=10, reject_ed
    MAIN_LOGGER.debug("Number of skipped ROI with negative BGs: %d / %d" % (num_roi_negative_bg, len(rois)))
    MAIN_LOGGER.debug("Number of skipped ROI with NAN in BGs: %d / %d" % (num_roi_nan_bg, len(rois)))
    MAIN_LOGGER.info("Number of ROIS that will proceed to refinement: %d/%d" % (np.sum(selection_flags), len(rois)))
+    if np.sum(selection_flags) == 0:
+        raise RuntimeError("Can't proceed with zero ROIs")


Would it make sense to catch this error in hopper_utils.py / GatherFromExperiment:

try: roi_packet = utils.get_roi_background_and_selection_flag( ... except RuntimeError: return False

something like that... the return value of this method is inspected in command_line/hopper.py ...

Yep, that makes sense.

irisdyoung added 2 commits March 15, 2023 01:00

Wait if files aren't written yet (up to 3 minutes)

9f3d72e

irisdyoung requested review from dermen and phyy-nx March 15, 2023 08:28

irisdyoung self-assigned this Mar 15, 2023

irisdyoung removed the request for review from phyy-nx March 16, 2023 06:27

dermen reviewed Mar 16, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

diffBragg hardening #857

diffBragg hardening #857

irisdyoung commented Mar 15, 2023

dermen Mar 16, 2023

dermen Mar 16, 2023

irisdyoung Mar 16, 2023

dermen Mar 16, 2023

irisdyoung Mar 16, 2023

diffBragg hardening #857

Are you sure you want to change the base?

diffBragg hardening #857

Conversation

irisdyoung commented Mar 15, 2023

dermen Mar 16, 2023

Choose a reason for hiding this comment

dermen Mar 16, 2023

Choose a reason for hiding this comment

irisdyoung Mar 16, 2023

Choose a reason for hiding this comment

dermen Mar 16, 2023

Choose a reason for hiding this comment

irisdyoung Mar 16, 2023

Choose a reason for hiding this comment