Releases: lcorcodilos/TIMBER
Beta version 2.0
Change log
This is perhaps a bigger update than I was intending but I think justifies the release of Beta 2.0. It incorporates PRs #54, #58, #59, #61, #62, #64, and #68. Changes are listed in reverse chronological order in each section.
General
- Add option to "Take" with SubCollection() method.
- Remove SubCollection() being made during CalibrateVars().
- Various robustness and consistency changes.
- Reduce jdl_template default CPU and memory requests for condor
- Remove TIMBER collection structs from Prefire_weight for speed
- Drop
createAllCollections
option fromanalyzer
- Add option to save Runs branch in
analyzer.Snapshot()
(default is to save it) - In
PrintNodeTree()
, drop subcollection definitions by default. - Deduce extension for image saved by CompareShapes
- Add
item_meta
toGroup
class for lazy template histograms to be possible - Change Tpt weight module to drop alpha variation since it's only a normalization effect. Switch the
beta
class method toeval
and dropcorr
(eval
now does the nominal and variations). - Change
__
prefix on private variables to_
for consistency. - Create CollectionOrganizer and implement it. Does not create any user-facing changes but provides infrastructure for future features.
- Add hardware::Open option (default to "READ") with inTIMBER option for internal and external paths.
- Add hardware::LoadHisto to load histogram into memory with inTIMBER option for internal and external paths.
- Make Correction/ModuleWorker constructor arguments more logical - pass correctly typed variable instead of a string of that variable.
- Add MakeWeightCols() correlation option to correlate uncertainties that are tied together but had to be calculated separately (ex. two top jet tag SFs being applied).
- Remove repeated clang parsing when cloning ModuleWorker/Correction
- Change lhaid type from str to int
New Features
- Nodes now have unique hashes which keep them unique in the analyzer so that Nodes of the same name can be tracked. This is useful in the case where the processing has forked and you'd like to keep node naming consistent across processing branches.
- HistGroup.Add() has made the
name
argument optional and, if not specified, it will instead derive it from the hist (viaTObject.GetName()
). However, this will initiate the RDataFrame loop! - Change genEventCount calculation to genEventSumw (for simulation).
- Argument
extraNominal
added toMakeWeightCols()
which will scale all weight values (ex. xsec*lumi/genEventSumw).
New Additions
MemoryFile
class to store a string in memory to mimic a file one wouldopen()
.DictToMarkdownTable()
method to convert python dictionary to markdown table (usesMemoryFile
).- TIMBER/Tools/AutoPU.py added to automate (in pieces or as a whole) the processes of making a pileup weight and applying it.
Common.GenerateHash()
added for Node hashes.analyzer.GetColumnNames()
returns a list of all column names that currently exist.hardware::MultiHadamardProduct
for non-nested vectors- Update GenMatching tools to be better optimized and to take advantage of new AoS feature
CMS Algos
- Update luminosity golden jsons.
- Add DeepAK8 CSV reader and top tagging SF module (note that there have been crashes in some instances for this module that are currently being studied).
- HEM and Prefire correction modules added.
- Add JME data tarball information in readme
- Add W and top tagging scale factor modules (only tau21 and tau32+subjet btag supported, respectively)
Pileup
- Add WeightCalculatorFromHistogram (from NanoAOD-tools)
- Add C++ pileup module with "auto" mode to grab npvx distribution from memory
- Add pileup data files and information on where they are from (+script to get them)
Bug fixes
- Do not try to get Runs branch if it doesn't exist.
- Fix bug when making new collections using CollectionOrg.AddBranch().
- Cleanup plotting in Plot.py to be more consistent and documentation-ready.
- setup.sh had back-ticks that caused unintended executions and
return
is more suitable thanexit
. - Return index from Pythonic::InList rather than a bool
- If ModuleWorker looks in a TIMBER .cc for a function (
eval
typically) and can't find it, look for it in the equivalent .h (since that's where templates live)
Beta version 1.4
Change log
The main addition is the new code to handle JME calibrations and uncertainties. This will be covered at the end since it is the most lengthy. First, some more general changes (some of which are from the JME related work).
Collections as arrays of structs (AoS)
First, a "collection" is used here to describe a group of physics objects stored in NanoAOD format. For example, "Electron" is a collection and has attributes "pt", "eta", etc all of which are stored in branches of the NanoAOD Events tree named as Electron_pt
, Electron_eta
, etc.
If the user would like to access a collection, they can simply use <CollectionName>s
which is built dynamically in the background as an AoS. Keeping the electron example, this means that Electrons[0]
will return the leading electron in the event with attributes pt
, eta
, etc that can be accessed as Electrons[0].pt
, Electrons[0].eta
, etc.
This turns the physics objects into OOP objects which has various niceties for the sake of more complicated algorithms that need to be written in C++ but require an extensive set of arguments. For example, a generator particle matching algorithm would need several attributes from GenPart
and from the reco object (say, an AK8 jet) which would make the method definition long and the use of it prone to error (ex. you need to keep all of the arguments in the right order, etc). With the TIMBER collections, you'd only need two arguments - GenParts
and FatJets[i]
, where i
is the index of the jet you want to use in the matching.
The C++ is dynamically written here.
There are several important notes to make about this feature :
- The object (ex.
FatJets
) doesn't exist until TIMBER detects it in aCut
orDefine
AND the underlying struct (FatJetStruct
) is also not defined yet. So if you'd like to write a C++ method to take one of these collections as input, you'll need to use a C++ template. Please have a look here for an example of this. - This feature has penalties! Because there is more being compiled and built, memory usage increases (though not beyond values that are reasonable). It also increases processing time substantially. The difference between calculating the mean of
FatJet_pt[0]
andFatJets[0].pt
is nearly a factor of 6 (for just this action). Thus, the collections should be reserved for using as input to C++ modules that benefit from having fewer positional arguments. There is a more efficient way to handle the construction of the collections (see issue #36) but this is where I leave it for now.
General
- Created libtimber which is the shared library of all modules in TIMBER/Framework. This is compiled by the new Makefile during setup.sh and is loaded (if not already) by CompileCpp (ie. it doesn't require using the analyzer() class to use!) The JME modules do not work in standalone so these are NOT included in the library if outside of CMSSW.
- Closes Issue #37.
- The analyzer silence attribute can be set to silence the print out from Define and Cut calls.
- Added dedicated method
ReorderCollection()
. This is meant to be used when JECs affect the pt of jets and a re-ordering is needed. Note that this is not done by AutoJME. - Added
ModuleWorker
class to handle all of the functionality shared byCorrection
and the newCalibration
(both of which now inherit fromModuleWorker
).- Closes Issue #21.
- To
common.h
, addedTempDir
(handles temporary directory storage) andReadTarFile
(opens and streams tarball contents - took a day to get working!) This was necessary because the JECs come in tarballs and untarring 800+ files and holding them in TIMBER is undesirable.- This required adding libarchive as a dependency. Added to
setup.sh
the recipe (if it doesn't exist) to download and build it inside ofTIMBER/bin
- This required adding libarchive as a dependency. Added to
- Added
Node.GetBaseNode()
so that the top most parent can be accessed from a Node (ie. outside the analyzer). - Kept RunChain as attribute of
analyzer
so that contents can be easily accessed. - Moved
TIMBER/Framework/ExternalTools/
toTIMBER/Framework/ext/
- Organized
TIMBER/Framework/src
andTIMBER/Framework/include
so that declarations and implementations are split (roughly - there are still some outstanding where they make sense). - Added to
hardware
"Hadamard product" algorithms. - Add tcsh setup script
- Add GetWeightName to get column name for certain weight
- Add SaveRunChain to save out the Run TTree (with option to merge it with an existing file like a snapshot of Events)
JME module work
Modified from PR #38
- Write
JetRecalibrator
class which handles the interfacing to the CMSSW based tools. - Write
JMEpaths
class which handles the interfacing to the JME txt files - Write
JES_weight
class which is the user-facing module to access the corrections (including recalibrations) and uncertainties. - Write
JMS_weight
class which just accesses a hard-coded table of values. - (Re)Write
JetSmearer
class which has the algorithms to evaluate the weights to smear jet energy (pt) and jet mass. - Write
JER_weight
class which usesJetSmearer
to calculate the weights per-jet to smear the pt distribution. - Write
JMR_weight
class which usesJetSmearer
to calculate the weights per-jet to smear the mass distribution.
NOTE 1: The four J*_weight
classes all have eval()
functions which return a vector with length of the number of jets. Each entry in this vector is another vector, {nominal, up, down} where "up" and "down" are absolute, not relative, weights to apply to the pt and/or mass.
NOTE 2: The JMS_weight is included for completeness but it's terribly inefficient because no calculation is done - it just creates a vector of length nFatJet which stores the same values over and over again. There's almost certainly a better way to do this but that can be put on the to-do. For now, the uniformity between modules is important.
- Added
CalibrateVar()
method which will actually do the multiplication of a variable by the calibration weight (looping through uncertainty variations as well). - Added
Calibration
class which doesn't do much different fromModuleWorker
at the moment (it's just notCorrection
). - Added JME related data files to TIMBER/data/JES and TIMBER/data/JER
Validation
Validation of the new modules was done using the new TIMBER bench. See PR #38 for validation details.
Beta version 1.3
Change log
Collection of changes made from mid-November through December 2020. Highlights are the weight column calculation fix and improved C++ argument matching.
Setup/install
None
Analyzer
- Add
ObjectFromCollection()
method that creates a subcollection but for just one object in the originating collection. - Fix the correction/weight collection so that only parent nodes are considered in the weight calculation for a node tree. In other words, if the node/processing tree has split, weights calculated in branch A should not affect those in branch B but they should share any weights calculated before the branches diverged.
--- If one has separate branches, each one needs to have theMakeWeightCols()
method called. Default called onActiveNode
but can take other nodes as input. With this in mind, the method also now takes a name to name a group of weights so that duplicate nodes are not created on the separate branches.
--- Changes for this happened in__checkCorrections()
(traversing up the tree),Node
class (add back parent attribute),MakeWeightCols()
(the naming) - Improved C++ argument matching when building a correction.
--- Will now check against active node columns and not just the base node.
---Correction.MakeCall()
changed to take a dict as input instead of a list. Keys are the C++ method argument names (as written in the C++ file) and values are the names of the RDataFrame columns that you'd like to use as function arguments. If there are arguments in the C++ method that are not in the dict, TIMBER will automatically try to determine if it matches a column name and will use that when building the call to the C++ method. - Added
Range()
method toanalyzer
andNode
classes to select a subset of data to analyzer. Docs include warning to not use this withROOT.EnableImplicitMT()
Tools
- Add s/sqrt(b) plotting to Tools/Plot.py
- Add function
GetStandardFlags()
to return list of standard MET filter flags. Used as defaultflagList
forGetFlagString()
. - Change
cut
andnot
option defaults in TrigTester.py. - Consolidated Cutflow* functions and added "initial" count to be included when producing the cutflow.
Modules
- Staging JetSmearer.h, JetRecalib.h, fatJetUncertainties.cc, and JetMETinfo.h (includes commented-out bit in common.h)
- Change
Trigger_weight.cc
default plateau to -1
Pythonic.h
- Create
Pythonic
namespace. - Add header guards.
- Add
IsDir()
andExecute()
functions. - Updated naming so all functions are capitalized
- Add EffLoader.cc module #24
Data
None
Testing
- Fix tests so
AddCorrections()
changes work. - Fix test_Common.py so it works. Add in actual tests for Cutflow* functions.
- Add test for
Range()
More documentation
- Changed error for multiple nodes of the same name to a warning.
- Add transparent logo.
- Example 1 (
examples/ex1.py
) now includes example of usingRange()
and explains to not useROOT.EnableImplicitMT()
.
Issues that were addressed
- Fix file reading from afs
- Return ActiveNode with SubCollection method.
- When providing a
dict
toNode.SetChildren()
, the code was checking if the keys of thedict
were of typeNode
. Fixed to check thedict
values. - In C++ modules, switch
int
tosize_t
infor
loops. - Fix "Library compiling doesn't play nice with periods in TIMBERPATH"
- Fix TIMBER/data/README.md
- Implement
corr
Correction type deduction - Fix
CompareShapes()
so that it works with empty bkgs, signals, and colors correctly. - Allow for default arguments when doing C++ clang parsing and automatic calls to correction methods.
Beta version 1.2
Change log
Combination of #17, #18, and #19.
Setup/install
- Added
boost
dependency information to main README.md (needed forLumiFilter.h
)
Modules
- Added GenMatching.h which can be used to reconstruct the entire generator particle decay tree from the mother indexes stored in the NanoAOD. This is useful for traversing the entire decay chain with relative ease. Example added in
How to use GenMatching.h
. - Added LumiFilter.h which can be used in conjunction with the newly added golden JSONs to filter data based on the JSONs.
- Added HistLoader.h which can be used to load in a histogram once before processing, with access to the histogram via the class methods while looping over the RDataFrame entries. The
eval
module returns based on the input axis value andeval_bybin
returns based on the provided bin number. - Added TopPt_weight.cc which calculates the top pT correction from the TOP group based on the data/POWHEG+Pythia8 fit. The nominal correction is calculated with the
corr()
method and variations of the constants in the exponential form can be calculated using thealpha()
andbeta()
methods. - Added Trigger_weight.cc which is uses
HistLoader.h
to load a trigger efficiency histogram. Theeval()
method returns the efficiency for that event (based on the input variable, of course) and calculates the uncertainty as one-half the trigger inefficiency. - Rename "analyzer" namespace to "hardware" in
common.h
. Done for clarity in the documentation to avoid confusion with theAnalyzer
python namespace (aka Analyzer.py). - Change
hardware::invariantMass()
argument to be a vector of Lorentz vectors. Invariant mass of all provided vectors is calculated. - Moved
Framework/src/Collection.cc
toFramework/include/Collection.h
Testing
- Added a draft of test_modules.py which features an example for TopPt_weight.cc but it is currently commented out because the test file does not have the
GenPart
collection or theFatJet
collection (and is also not a ttbar set) - Added make_test_file.py to make a small testing histogram.
- Added small testing histogram generated by
make_test_file.py
.
Data
- Added golden JSONs for 2017 and 2018 and added info to the README ledger. It seems 2016 does not have a golden JSON anymore (?)
Analyzer
- Add
corr
type forCorrection()
class. It represents a corrections with no uncertainty. The clang parsing CANNOT currently derive it automatically from the C++ script but it can be assigned as thecorrtype
via the argument toCorrection()
constructor. - Optimized
MakeTemplateHistos()
to book histogram actions before looping. They previously looped over the dataframe one after the other. This provides a significant speed up.
More documentation
- Added page on how to use GenMatching.h in a custom C++ module with the example of finding how many prongs are merged in a top jet.
- Added docs to
Pythonic.h
- Added docs to
common.h
- Added docs to
PDFweight_uncert.cc
- Added docs to
SJBtag_SF.cc
- Consolidate the READMEs for sections so the webpage makes more sense.
- Switch to MathJax for formula rendering
Small bug fixes
- More robust python version checking for ASCII encoding in
OpenJSON()
. - Fix
PrintNodeTree()
for cross-system compatibility. Thenetworkx
package is used to create the graph which can be drawn with a number of tools. TIMBER was usingpygraphviz
which, to be installed, needs the development library ofgraphviz
. While it's easy to get this on Ubuntu or macOS, it is not available on either LPC or LXPLUS servers and we can't install it without a bit of a headache. Thus,pydot
is now used withnetworkx
instead since it does not have the same build dependencies. However, the version ofgraphviz
on the system (akadot
) cannot always write out to modern image formats like PNG. The solution is for TIMBER to attempt to save the requested format and if it isn't possible, save out the .dot file for later conversion. Instructions on how to convert the .dot to something else locally were added to the FAQ section of the docs.
Beta version 1.1
NOTE: These are copied excerpts from #16
Benchmarks
- Benchmarks 1-9 added in
benchmarks/ex*.py
. Some internal comments included about what was done. The CMS Open Data sample included in the examples/ folder does not have electrons so it was not used for benchmarks 7 or 8 (these need the tester to use their own private file which this repo does not provide). - Filled out more of the general testing with pytest.
New to analyzer
Close()
: Implemented to safely delete an analyzer instance.__str__
: Implemented to provide an informational printout whenprint(<analyzer>)
is called.- Can specify Node type as
Cut
andDefine
argument if you have a specific type you'd like to track. SubCollection()
: Creates a named sub-collection based on some discriminant where the sub-collection has all of the same branches as the parent but only includes vector entries that passed the discriminant.
NOTE:myColl_var1
is anRVec
and somyColl_var1 > 5
returns a vector the same size asmyColl_var1
but filled with bools for each entry. These bools determine which entries of theRVec
s of the sub-collection branches are made.
a = analyzer(...)
# Say there is a collection "myColl" with branches "myColl_var1", "myColl_var2", "myColl_var3"
a.SubCollection("mySubColl","myColl","myColl_var1 > 5")
# Now there is a new collection "mySubColl" with branches "mySubColl_var1", "mySubColl_var2", "mySubColl_var3"
# which only have values where myColl_var1 > 5
MergeCollections()
: Creates a new collection which is a merge of all provided collections. New collection has variables that are common between collections being merged.CommonVars()
: Finds the common variables between a set of collections (provided as a list of names).PrintNodeTree()
: Added optional argumenttoSkip=[]
which skips plotting any nodes of types specified bytoSkip
. Note that the function checks for the type intoSkip
as a substring of the type of the Node. So if you providetoSkip=["Define"]
all nodes of type "MergeDefine" and "SubCollDefine" will also be dropped.- Also switched to using
networkx
(which usespygraphviz
).
- Also switched to using
MakeHistsWithBinning()
: Batch creates histograms at the currentActiveNode
based on the inputhistDict
which is formatted as{[<column name>]: <binning tuple>}
. The dimensions of the returned histograms are determined from the size of[<column name>]
.[<column name>]
is a list of column names that you'd like to plot against each other in [x,y,z] orderbinning_tuple
is the set of arguments that would normally be passed toTH1
.
New to Node
- Add "types" to Nodes to denote what was done to produce the Node. Currently used for controlling nodes present in
PrintNodeTree()
output. Current possible types are "Define", "Cut", "MergeDefine", "SubCollDefine", "Correction". Close()
: Implemented to safely delete a Node instance.__str__
: Implemented to provide an informational printout whenprint(<node>)
is called.
New to HistGroup
Merge()
: Adds together all of the histograms in the group and returns the output histogram.
New to C++ Code
common.h
transverseMass()
to get transverse mass of MET + one object. Could be more generalized.- 2nd constructor for
TLvector()
that takesRVec
s as arguments rather than floats (returns back anRVec
ofPtEtaPhiMVector
s)
Small bug fixes
- Fix
Common.py
TIMBER imports - Make
fileName
attribute public (used for new__str__
method for printinganalyzer
object) - Add
BaseNode
toAllNodes
for tracking - Force
BaseNode
to zero children on initialization to avoid memory issues - Fix
Group
addition
Beta version 1.0
With the most recent changes, I believe we've exited any sort of alpha and so the project will now start doing tags and releases to track development and allow users to check if they have the latest version of TIMBER (or grab a development version).