Concurrent branch processing with aliases #60
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Discussed in part in Issue #57.
There are two intertwined features.
The first is to allow the analyzer to track multiple nodes in parallel. So if you do
a.Cut()
it will apply the cut to every one of the nodes in the dictionary of nodes designated as "active". This might seem silly to do until you think about the JME modules. If you modify the pt up or down, that changes the acceptance and so you need to either run the script once for each variation or you need to build "branches" of the processing tree in parallel. The new feature allows the user to set of a dict of nodes so that they can perform one action which is automatically translated to all Nodes/RDataFrames.As an example...
This is actually doing 13 "branches" in parallel (nominal, two pt variations for each JES and JER, two pt variations for each JES, JER, JMS, JMR). The
FatJet_pt[0]>400
cut is applied to all 13 in the final line. You may be wondering what the point of theFatJet_pt
cut is if the pt you care about has now changed. That brings me to the second feature - "aliasing".The basic idea is that, on the level of a Node, you can designated a "find replace" relationship for subsequent actions in the family tree (so the "find replace" is "passed down" to child Nodes). The most basic example showing the infrastructure would be something like this
For the last line, that would actually print and execute:
FatJet_pt[jetIdx[0]]>400
.That example is actually kind of dangerous... If you need two different collections (say you also need AK4s), this would mess up the actions with the
Jet
correction. I'm working on a fix involving an associated regex match but you get the idea I hope.Where this really comes in handy is with the JME modules. Since these affect the pt and mass, you can actually fork the processing tree into the 13 variations (as 13 nodes) each with a different alias for
FatJet_pt
- each pointing to a different "real" name - ex.FatJet_pt_JES_AK8PUPPI_up
,FatJet_pt_JES_AK8PUPPI_down
, etc.That magically makes this...
turn into
except you can see this is not correct...
This is where I've had to leave off and I've created this PR to store my work. There are two very strange behaviors happening that have me worried about memory management.
SplitOnAlias
...For some reason I can't explain, the line
newNode.AddAlias(realname,alias)
is adding an alias tocheckpoint
. This ends up breaking the entire splitting algorithm.PrintNodeTree
to debug, I get the errorIn fact,
graph.nodes[node]
is empty,{}
. As I check thefor
loop the precedes this one, I've checked that graph.add_node() is always adding a node with a name. It's not clear what this empty one is and why it only occurs when theSplitAsAlias
occurs.