Fix for crashes in tracking #676

genevb · 2024-04-02T20:33:46Z

On 2016-11-07, Victor committed some code modifications to address three trouble tickets: 664250f

In his commit note, he listed an additional 4th item: "4. Simplified code in refitL()". This particular code change in Sti/StiKalmanTrack.cxx::refitL() did a little more than simplify: it modified the behavior, though I cannot tell if that was intentional from the comment. The new behavior appears to allow the re-fit to continue along a track when invalid track nodes are encountered (the invalid nodes are excluded from the refit, I believe), for whatever reason a node may be considered invalid. Previously, the entire remainder of the track would be ignored once an invalid node was encountered.

The modification Victor made introduced a potential way for a job to crash for the case when the first node fails validity checks, but the second node doesn't get checked to the same degree and may have invalid content. This shows up in logs as:

root4star: .sl73_gcc485/OBJ/StRoot/Sti/StiTrackNodeHelper.cxx:670: int StiTrackNodeHelper::save(): Assertion 'fabs(mPredPars.hz()-mTargetHz)<=1e-10' failed.

(which is a test of the predicted parameters from the previous node versus the current node)

The crash happens for only a tiny fraction of tracks, but a single failure from all the tracks contained in a single DAQ file is enough to crash a production job and lose all of its data. This particularly seems to be more common in pp500 data where there are more invalid nodes due to large TPC distortion corrections (though whether all of them should be invalid is a separate question not to answer here at this time), but I have seen this crash in various other datasets from time to time over recent years too. A very rough estimation is that this might cost us ~2% of all our Run 22 pp500 data.

Two candidate fixes to consider:

Revert the code to where it was before Victor's "simplification".
Apply the same validity checks to any potential first used node on a track, not just the initial node.

I have run a track-by-track comparison with each on ~200k tracks, and the impacted tracks are very few in number, at the level of ~10 tracks in solution 1, and 1 track in solution 2 (that one track changing by only a single hit). Given that solution 2 appears to be very close to having no impact other than allowing jobs to run that would otherwise crash, this PR implements that solution.

plexoos

Looks good!

Sounds like the expected impact on tracks is minimal. Nevertheless, if it's worth patching the currently used libraries with this change let me know.

My only suggestion is to mention the run/event number where the assert happens (purely for documentation purposes).

genevb · 2024-04-04T21:59:36Z

Thanks for the response, @plexoos . I checked the P23if preview production of Run 22 pp500 st_physics data and found that 1125 out of 118751 jobs died due to this assert. That's just under 1%, and a little lower than my first estimate of ~2%, but still notable. I went ahead and documented the number of events that succeeded (according to the log file) before the crash happened for each of these 1125 files here:
~genevb/public/Run22pp500_P23if_st_physics_mHz_crashes_afterNevents.txt

However, it may be that with different calibrations in place the next time we produce these files, different hits (and track nodes) will pass and fail the validity checks, so the jobs may fail on different events even without modifying this code.

To get some additional quantifications, the P23ie production of AuAu200 data from Run 19 died on 23 out of 13006 jobs due to this crash, and the P23ie production of some Run 20 FXT datasets similarly died for 22 out of 24364 jobs. This supports what I said that the crashes can happen in any dataset, but are most prevalent in the pp500 data.

klendathu2k

LGTM.

genevb added 3 commits March 27, 2024 15:17

Step back from Victor's simplification of refitL() on 2016-11-07

ee3652f

isStarted and restIsWrong are really just yes/no flags

8ca5456

Re-work Victor's attempt to keep tracks with bad first hit

829f60b

genevb requested review from klendathu2k, plexoos, starsdong, perevbnlgov, fisyak, R-Witt and iraklic as code owners April 2, 2024 20:33

plexoos approved these changes Apr 3, 2024

View reviewed changes

Merge branch 'main' into FIX_VICTOR_TRACKING

f20e7aa

klendathu2k approved these changes Apr 10, 2024

View reviewed changes

genevb merged commit 358a58c into star-bnl:main Apr 10, 2024
148 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix for crashes in tracking #676

Fix for crashes in tracking #676

genevb commented Apr 2, 2024 •

edited

Loading

plexoos left a comment

genevb commented Apr 4, 2024

klendathu2k left a comment

Fix for crashes in tracking #676

Fix for crashes in tracking #676

Conversation

genevb commented Apr 2, 2024 • edited Loading

plexoos left a comment

Choose a reason for hiding this comment

genevb commented Apr 4, 2024

klendathu2k left a comment

Choose a reason for hiding this comment

genevb commented Apr 2, 2024 •

edited

Loading