You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When applying any new JEC or variations on the jet momentum, the jets may no longer be ordered in pt. Any HEM issue module will have the same effect. It's currently the responsibility of the user to account for this (if at all).
Accounting for the re-ordering for the HEM issue is not so difficult because of the one-time nature of the module. However, doing the variations on the jet pt is frankly a big headache because each variation in the pt is a new re-ordering (meaning a new collection needs to be created for each variation).
Without TIMBER
There are a few ways around this that do no involve TIMBER automation that could have varying implications depending on the analysis.
Just ignore the pt being out of order. If the analysis doesn't make any decisions on pt ordering, there's no need to do anything. For this reason, re-ordering should always be optional and not forced (from the TIMBER development perspective).
Process the variations in pt one at a time with either (a) or (b) below. Will be less efficient but that may not matter if one is running on condor and the number of jobs is already low.
Process the variations in parallel, tracking each variation of ordering manually but building the dataframe actions concurrently. This is more computationally efficient but requires lots of tracking and is prone to error (unless there's a good TIMBER idea... see below).
For (2) and (3), there are two sub-options.
a. If the analysis is only using 1 or 2 jets, one can simply search for the highest pt values in the new pt vector, extract the indices as new column variables, and use these variables everywhere in place of 0 and 1.
b. Use ReorderCollection()[1] to re-order the entire collection and use the new collection for access to variables (with 0 or 1)
Option (b) is more computationally expensive and it doesn't do much to improve the user interface. Option (a) is a bit more error prone since you're dealing with indices and debugging and indexing issue can be difficult (or hard to identify in the first place).
With TIMBER
Indexing > new collections
Learning from the Without TIMBER section, the optimized option seems to be to develop a new set of indices stored in a separate branch and to direct the user to use these if they want the re-ordered collection. For example, FatJet_pt[0] would become CalibratedFatJet_pt[JES_index[0]] to get the new leading jet (where JEC_index is the pt ordered list of indices for the original collection.
This is lightweight enough that users could make the choice to not use these values if they don't care about pt ordering (and actually, then there would be no computational penalty since they column would never be used).
Simultaneous branch action solutions
As an example, we have something like this...
base
|
1
|
2
/ | \
/ | \
/ | \
pt nom up down
variation | | |
3 3 3
identical | | |
actions 4 4 4
3 & 4
where 1, 2, 3, and 4 are some actions on the dataframe and nom, up, down are the three branches of the processing that change the pt.
Option 1
One solution is to have an AnalyzerGroup() class which parallelizes actions on separate branches of the processing tree. The methods of the AnalyzerGroup are the same as the analyzer but just loop over all analyzer objects in the group to perform the action.
Pros: A new class keeps the logic separable. Cons: All methods would have to be hard coded or a new generic proxy method would need to be written (making subsequent actions look less like actions on a single analyzer object)
Option 2
Modify analyzer() to always track multiple dataframes (the base case being the one dataframe to start). Then analyzer.DataFrame would point to a list of RDataFrames (via Nodes), not a single RDataFrame and every method acting on a Node would actually loop over all Nodes being tracked. There are some potential complications
The Nodes need to be tracked via a dictionary with unique keys (maybe NodeGroup class?). In fact, you'd most likely need subkeys pointing to information about the branch. Something like
allCurrentNodes = {
"key1": {
"node": Node(...),
"CalibratedFatJet_*[*]: "CalibratedFatJet_*[key1_idx[*]]", # pattern for index substitution for cool idea below
...
}
}
Snapshots would have to be saved to separate TTrees carefully.
It would be reasonable to assume that subsequent splittings could happen and these will also need to be tracked. It's not clear if these should be nested but that would require nesting analyzer objects which would be a more complicated task.
One should be able to remove nodes from the active group being tracked
Pros: No duplicating of functions/methods and probably less code overall needing to be added/changed. Subdictionaries would be powerful for string substitution. Everything shows up nicely in one PrintNodeTree()! Cons: Lots of string parsing and substitution which is always error prone and can be hard to debug when the print out is lengthy.
Cool idea: Store the "list" of dataframes/Nodes as a dictionary/NodeGroup and use the subkeys to denote suffix of ordering indexes. Then do automatic find/replace on action strings with key and value pairs in the subdictionary so that one could do
a.Cut("...","CalibratedFatJet_pt[0] > 400")
and get back
CalibratedFatJet_pt[key1_idx[0]] > 400
...
The text was updated successfully, but these errors were encountered:
lcorcodilos
changed the title
JES/R and HEM module pt re-ordering
Pt re-ordering for JES/R and HEM modules
Apr 2, 2021
lcorcodilos
changed the title
Pt re-ordering for JES/R and HEM modules
Pt reordering for JES/R and HEM modules
Apr 2, 2021
lcorcodilos
changed the title
Pt reordering for JES/R and HEM modules
P_T reordering for JES/R and HEM modules
Apr 2, 2021
When applying any new JEC or variations on the jet momentum, the jets may no longer be ordered in pt. Any HEM issue module will have the same effect. It's currently the responsibility of the user to account for this (if at all).
Accounting for the re-ordering for the HEM issue is not so difficult because of the one-time nature of the module. However, doing the variations on the jet pt is frankly a big headache because each variation in the pt is a new re-ordering (meaning a new collection needs to be created for each variation).
Without TIMBER
There are a few ways around this that do no involve TIMBER automation that could have varying implications depending on the analysis.
For (2) and (3), there are two sub-options.
a. If the analysis is only using 1 or 2 jets, one can simply search for the highest pt values in the new pt vector, extract the indices as new column variables, and use these variables everywhere in place of
0
and1
.b. Use
ReorderCollection()
[1] to re-order the entire collection and use the new collection for access to variables (with0
or1
)Option (b) is more computationally expensive and it doesn't do much to improve the user interface. Option (a) is a bit more error prone since you're dealing with indices and debugging and indexing issue can be difficult (or hard to identify in the first place).
With TIMBER
Indexing > new collections
Learning from the Without TIMBER section, the optimized option seems to be to develop a new set of indices stored in a separate branch and to direct the user to use these if they want the re-ordered collection. For example,
FatJet_pt[0]
would becomeCalibratedFatJet_pt[JES_index[0]]
to get the new leading jet (whereJEC_index
is the pt ordered list of indices for the original collection.This is lightweight enough that users could make the choice to not use these values if they don't care about pt ordering (and actually, then there would be no computational penalty since they column would never be used).
Simultaneous branch action solutions
As an example, we have something like this...
where 1, 2, 3, and 4 are some actions on the dataframe and
nom
,up
,down
are the three branches of the processing that change the pt.Option 1
One solution is to have an
AnalyzerGroup()
class which parallelizes actions on separate branches of the processing tree. The methods of theAnalyzerGroup
are the same as theanalyzer
but just loop over all analyzer objects in the group to perform the action.Pros: A new class keeps the logic separable.
Cons: All methods would have to be hard coded or a new generic proxy method would need to be written (making subsequent actions look less like actions on a single analyzer object)
Option 2
Modify
analyzer()
to always track multiple dataframes (the base case being the one dataframe to start). Then analyzer.DataFrame would point to a list of RDataFrames (via Nodes), not a single RDataFrame and every method acting on a Node would actually loop over all Nodes being tracked. There are some potential complicationsPros: No duplicating of functions/methods and probably less code overall needing to be added/changed. Subdictionaries would be powerful for string substitution. Everything shows up nicely in one
PrintNodeTree()
!Cons: Lots of string parsing and substitution which is always error prone and can be hard to debug when the print out is lengthy.
Cool idea: Store the "list" of dataframes/Nodes as a dictionary/NodeGroup and use the subkeys to denote suffix of ordering indexes. Then do automatic find/replace on action strings with key and value pairs in the subdictionary so that one could do
and get back
The text was updated successfully, but these errors were encountered: