Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Concurrent branch processing with aliases #60

Open
wants to merge 6 commits into
base: master
Choose a base branch
from

Conversation

lcorcodilos
Copy link
Owner

Discussed in part in Issue #57.

There are two intertwined features.

The first is to allow the analyzer to track multiple nodes in parallel. So if you do a.Cut() it will apply the cut to every one of the nodes in the dictionary of nodes designated as "active". This might seem silly to do until you think about the JME modules. If you modify the pt up or down, that changes the acceptance and so you need to either run the script once for each variation or you need to build "branches" of the processing tree in parallel. The new feature allows the user to set of a dict of nodes so that they can perform one action which is automatically translated to all Nodes/RDataFrames.

As an example...

a = analyzer('../ttbar18_sample.root')
a = AutoJME(a,'FatJet','2018',setAlias=True)
a.Cut('jetPt','FatJet_pt[0]>400')

This is actually doing 13 "branches" in parallel (nominal, two pt variations for each JES and JER, two pt variations for each JES, JER, JMS, JMR). The FatJet_pt[0]>400 cut is applied to all 13 in the final line. You may be wondering what the point of the FatJet_pt cut is if the pt you care about has now changed. That brings me to the second feature - "aliasing".

The basic idea is that, on the level of a Node, you can designated a "find replace" relationship for subsequent actions in the family tree (so the "find replace" is "passed down" to child Nodes). The most basic example showing the infrastructure would be something like this

a = analyzer('...')
a.Define('jetIdx','functionReturningJetIndexes(FatJets)')
a.AddAlias('[0]','[jetIdx[0]]')
a.Cut('myCut','FatJet_pt[0]>400')

For the last line, that would actually print and execute: FatJet_pt[jetIdx[0]]>400.

That example is actually kind of dangerous... If you need two different collections (say you also need AK4s), this would mess up the actions with the Jet correction. I'm working on a fix involving an associated regex match but you get the idea I hope.

Where this really comes in handy is with the JME modules. Since these affect the pt and mass, you can actually fork the processing tree into the 13 variations (as 13 nodes) each with a different alias for FatJet_pt - each pointing to a different "real" name - ex. FatJet_pt_JES_AK8PUPPI_up, FatJet_pt_JES_AK8PUPPI_down, etc.

That magically makes this...

a = AutoJME(a,'FatJet','2018',setAlias=True)
a.Cut('jetPt','FatJet_pt[0]>400')

turn into

Filtering FatJet_pt_JES_AK8PFPuppi__down__jetPt: CalibratedFatJet_pt_JER_FatJet__down[0]>400
Filtering FatJet_mass_JER_FatJet__up__jetPt: CalibratedFatJet_pt_JER_FatJet__down[0]>400
Filtering FatJet_mass_JMS_AK8PFPuppi__nom__jetPt: CalibratedFatJet_pt_JER_FatJet__down[0]>400
...

except you can see this is not correct...

This is where I've had to leave off and I've created this PR to store my work. There are two very strange behaviors happening that have me worried about memory management.

  1. The first is tied to the issue above. In SplitOnAlias...
def SplitOnAlias(self,aliasTuples,node=None):
    if node == None: node = self.ActiveNode
    newNodes = {}

    checkpoint = node
    for t in aliasTuples:
        realname = t[0]
        alias = t[1]
        newNode = checkpoint.Clone(realname, inherit=True)
        newNode.AddAlias(realname,alias)
        newNodes[realname] = newNode

    return self.SetActiveNode(newNodes)

For some reason I can't explain, the line newNode.AddAlias(realname,alias) is adding an alias to checkpoint. This ends up breaking the entire splitting algorithm.

  1. Second, when attempting the PrintNodeTree to debug, I get the error
Traceback (most recent call last):
  File "JMEexample.py", line 15, in <module>
    a.PrintNodeTree('TestTree.pdf',toSkip=["SubCollDefine"])
  File "/uscms_data/d3/lcorcodi/TIMBERslimming/B2Gworkshop/CMSSW_11_1_4/src/TIMBER/TIMBER/Analyzer.py", line 1251, in PrintNodeTree
    if skip in graph.nodes[node]["type"]:
KeyError: 'type'

In fact, graph.nodes[node] is empty, {}. As I check the for loop the precedes this one, I've checked that graph.add_node() is always adding a node with a name. It's not clear what this empty one is and why it only occurs when the SplitAsAlias occurs.

@lcorcodilos lcorcodilos added this to the Beta 2.0 milestone Apr 19, 2021
@lcorcodilos lcorcodilos removed this from the Beta 2.0 milestone Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant