Concurrent branch processing with aliases #60

lcorcodilos · 2021-04-13T13:25:15Z

Discussed in part in Issue #57.

There are two intertwined features.

The first is to allow the analyzer to track multiple nodes in parallel. So if you do a.Cut() it will apply the cut to every one of the nodes in the dictionary of nodes designated as "active". This might seem silly to do until you think about the JME modules. If you modify the pt up or down, that changes the acceptance and so you need to either run the script once for each variation or you need to build "branches" of the processing tree in parallel. The new feature allows the user to set of a dict of nodes so that they can perform one action which is automatically translated to all Nodes/RDataFrames.

As an example...

a = analyzer('../ttbar18_sample.root')
a = AutoJME(a,'FatJet','2018',setAlias=True)
a.Cut('jetPt','FatJet_pt[0]>400')

This is actually doing 13 "branches" in parallel (nominal, two pt variations for each JES and JER, two pt variations for each JES, JER, JMS, JMR). The FatJet_pt[0]>400 cut is applied to all 13 in the final line. You may be wondering what the point of the FatJet_pt cut is if the pt you care about has now changed. That brings me to the second feature - "aliasing".

The basic idea is that, on the level of a Node, you can designated a "find replace" relationship for subsequent actions in the family tree (so the "find replace" is "passed down" to child Nodes). The most basic example showing the infrastructure would be something like this

a = analyzer('...')
a.Define('jetIdx','functionReturningJetIndexes(FatJets)')
a.AddAlias('[0]','[jetIdx[0]]')
a.Cut('myCut','FatJet_pt[0]>400')

For the last line, that would actually print and execute: FatJet_pt[jetIdx[0]]>400.

That example is actually kind of dangerous... If you need two different collections (say you also need AK4s), this would mess up the actions with the Jet correction. I'm working on a fix involving an associated regex match but you get the idea I hope.

Where this really comes in handy is with the JME modules. Since these affect the pt and mass, you can actually fork the processing tree into the 13 variations (as 13 nodes) each with a different alias for FatJet_pt - each pointing to a different "real" name - ex. FatJet_pt_JES_AK8PUPPI_up, FatJet_pt_JES_AK8PUPPI_down, etc.

That magically makes this...

a = AutoJME(a,'FatJet','2018',setAlias=True)
a.Cut('jetPt','FatJet_pt[0]>400')

turn into

Filtering FatJet_pt_JES_AK8PFPuppi__down__jetPt: CalibratedFatJet_pt_JER_FatJet__down[0]>400
Filtering FatJet_mass_JER_FatJet__up__jetPt: CalibratedFatJet_pt_JER_FatJet__down[0]>400
Filtering FatJet_mass_JMS_AK8PFPuppi__nom__jetPt: CalibratedFatJet_pt_JER_FatJet__down[0]>400
...

except you can see this is not correct...

This is where I've had to leave off and I've created this PR to store my work. There are two very strange behaviors happening that have me worried about memory management.

The first is tied to the issue above. In SplitOnAlias...

def SplitOnAlias(self,aliasTuples,node=None):
    if node == None: node = self.ActiveNode
    newNodes = {}

    checkpoint = node
    for t in aliasTuples:
        realname = t[0]
        alias = t[1]
        newNode = checkpoint.Clone(realname, inherit=True)
        newNode.AddAlias(realname,alias)
        newNodes[realname] = newNode

    return self.SetActiveNode(newNodes)

For some reason I can't explain, the line newNode.AddAlias(realname,alias) is adding an alias to checkpoint. This ends up breaking the entire splitting algorithm.

Second, when attempting the PrintNodeTree to debug, I get the error

Traceback (most recent call last):
  File "JMEexample.py", line 15, in <module>
    a.PrintNodeTree('TestTree.pdf',toSkip=["SubCollDefine"])
  File "/uscms_data/d3/lcorcodi/TIMBERslimming/B2Gworkshop/CMSSW_11_1_4/src/TIMBER/TIMBER/Analyzer.py", line 1251, in PrintNodeTree
    if skip in graph.nodes[node]["type"]:
KeyError: 'type'

In fact, graph.nodes[node] is empty, {}. As I check the for loop the precedes this one, I've checked that graph.add_node() is always adding a node with a name. It's not clear what this empty one is and why it only occurs when the SplitAsAlias occurs.

lcorcodilos added 6 commits April 10, 2021 20:53

Drop Tpt alpha variation and switch to regular eval

03a90c0

Analyzer.py: double to single underscore

d72d706

Switch to CollectionOrganizer

b8c3c40

Merge branch 'TptChange' into dev

4167a06

Modify CalibratedVars docs

c61ea98

Initial work (not working)

b245886

lcorcodilos added this to the Beta 2.0 milestone Apr 19, 2021

lcorcodilos removed this from the Beta 2.0 milestone Jun 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Concurrent branch processing with aliases #60

Concurrent branch processing with aliases #60

lcorcodilos commented Apr 13, 2021

Concurrent branch processing with aliases #60

Are you sure you want to change the base?

Concurrent branch processing with aliases #60

Conversation

lcorcodilos commented Apr 13, 2021