From 358a58c0172b51f61d4babe21845973f35c0ee18 Mon Sep 17 00:00:00 2001
From: Gene Van Buren <85305093+genevb@users.noreply.github.com>
Date: Wed, 10 Apr 2024 12:25:10 -0400
Subject: [PATCH] Fix for crashes in tracking (#676)

On 2016-11-07, Victor committed some code modifications to address three
trouble tickets: 664250f8272bce312682ce03d17f6f5727b9b42b

In his commit note, he listed an additional 4th item: "4. Simplified
code in refitL()". This particular code change in
[`Sti/StiKalmanTrack.cxx::refitL()`](https://github.com/star-bnl/star-sw/blob/2e0982686339b1162c3a98c2b4fac0f28846ce19/StRoot/Sti/StiKalmanTrack.cxx#L1653)
did a little more than simplify: it modified the behavior, though I
cannot tell if that was intentional from the comment. The new behavior
appears to allow the re-fit to continue along a track when invalid track
nodes are encountered (the invalid nodes are excluded from the refit, I
believe), for whatever reason a node may be considered invalid.
Previously, the entire remainder of the track would be ignored once an
invalid node was encountered.

The modification Victor made introduced a potential way for a job to
crash for the case when the first node fails validity checks, but the
second node doesn't get checked to the same degree and may have invalid
content. This shows up in logs as:

`
root4star: .sl73_gcc485/OBJ/StRoot/Sti/StiTrackNodeHelper.cxx:670: int
StiTrackNodeHelper::save(): Assertion
'fabs(mPredPars.hz()-mTargetHz)<=1e-10' failed.
`

(which is a test of the predicted parameters from the previous node
versus the current node)

The crash happens for only a tiny fraction of tracks, but a single
failure from all the tracks contained in a single DAQ file is enough to
crash a production job and lose all of its data. This particularly seems
to be more common in pp500 data where there are more invalid nodes due
to large TPC distortion corrections (though whether all of them should
be invalid is a separate question not to answer here at this time), but
I have seen this crash in various other datasets from time to time over
recent years too. A very rough estimation is that this might cost us ~2%
of all our Run 22 pp500 data.

Two candidate fixes to consider:

1. Revert the code to where it was before Victor's "simplification".
2. Apply the same validity checks to any potential first used node on a
track, not just the initial node.

I have run a track-by-track comparison with each on ~200k tracks, and
the impacted tracks are very few in number, at the level of ~10 tracks
in solution 1, and 1 track in solution 2 (that one track changing by
only a single hit). Given that solution 2 appears to be very close to
having no impact other than allowing jobs to run that would otherwise
crash, this PR implements that solution.
---
 StRoot/Sti/StiKalmanTrack.cxx | 13 +++++++------
 1 file changed, 7 insertions(+), 6 deletions(-)

diff --git a/StRoot/Sti/StiKalmanTrack.cxx b/StRoot/Sti/StiKalmanTrack.cxx
index 47a29f58213..b1a47c6a98c 100644
--- a/StRoot/Sti/StiKalmanTrack.cxx
+++ b/StRoot/Sti/StiKalmanTrack.cxx
@@ -1657,7 +1657,8 @@ static int nCall=0;nCall++;
 
   StiKTNIterator source;
   StiKalmanTrackNode *pNode = 0,*targetNode;
-  int iNode=0, status = 0,isStarted=0;
+  int iNode=0, status = 0;
+  bool isStarted=false;
   sTNH.setDir(1);
   for (source=rbegin();source!=rend();source++) {
     iNode++;
@@ -1668,15 +1669,15 @@ static int nCall=0;nCall++;
       if ( targetNode->getChi2()>1000) 	targetNode->setInvalid();
       if (!targetNode->isValid()) 	continue;
     }
-    isStarted++;
     sTNH.set(pNode,targetNode);
     status = sTNH.makeFit(0);
-    if (status) continue;
+    if (status) {targetNode->setInvalid();continue;}
     if (!targetNode->isValid()) 	continue;
+    isStarted = true;
     pNode = targetNode;
   }//end for of nodes
 
-    pNode = 0; iNode=0;isStarted=0;
+    pNode = 0; iNode=0;isStarted=false;
   sTNH.setDir(0);
   for (source=begin();source!=end();source++) {
     iNode++;
@@ -1686,11 +1687,11 @@ static int nCall=0;nCall++;
       if ( targetNode->getChi2()>1000) 	targetNode->setInvalid();
       if (!targetNode->isValid()) 	continue;
     }
-    isStarted++;
     sTNH.set(pNode,targetNode);
     status = sTNH.makeFit(1);
-    if (status) 			continue;
+    if (status) {targetNode->setInvalid();continue;}
     if (!targetNode->isValid()) 	continue;
+    isStarted = true;
     pNode = targetNode;
   }//end for of nodes
   return 0;