Skip to content

Releases: lcorcodilos/TIMBER

Beta version 2.0

15 Jun 21:53
Compare
Choose a tag to compare

Change log

This is perhaps a bigger update than I was intending but I think justifies the release of Beta 2.0. It incorporates PRs #54, #58, #59, #61, #62, #64, and #68. Changes are listed in reverse chronological order in each section.

General

  • Add option to "Take" with SubCollection() method.
  • Remove SubCollection() being made during CalibrateVars().
  • Various robustness and consistency changes.
  • Reduce jdl_template default CPU and memory requests for condor
  • Remove TIMBER collection structs from Prefire_weight for speed
  • Drop createAllCollections option from analyzer
  • Add option to save Runs branch in analyzer.Snapshot() (default is to save it)
  • In PrintNodeTree(), drop subcollection definitions by default.
  • Deduce extension for image saved by CompareShapes
  • Add item_meta to Group class for lazy template histograms to be possible
  • Change Tpt weight module to drop alpha variation since it's only a normalization effect. Switch the beta class method to eval and drop corr (eval now does the nominal and variations).
  • Change __ prefix on private variables to _ for consistency.
  • Create CollectionOrganizer and implement it. Does not create any user-facing changes but provides infrastructure for future features.
  • Add hardware::Open option (default to "READ") with inTIMBER option for internal and external paths.
  • Add hardware::LoadHisto to load histogram into memory with inTIMBER option for internal and external paths.
  • Make Correction/ModuleWorker constructor arguments more logical - pass correctly typed variable instead of a string of that variable.
  • Add MakeWeightCols() correlation option to correlate uncertainties that are tied together but had to be calculated separately (ex. two top jet tag SFs being applied).
  • Remove repeated clang parsing when cloning ModuleWorker/Correction
  • Change lhaid type from str to int

New Features

  • Nodes now have unique hashes which keep them unique in the analyzer so that Nodes of the same name can be tracked. This is useful in the case where the processing has forked and you'd like to keep node naming consistent across processing branches.
  • HistGroup.Add() has made the name argument optional and, if not specified, it will instead derive it from the hist (via TObject.GetName()). However, this will initiate the RDataFrame loop!
  • Change genEventCount calculation to genEventSumw (for simulation).
  • Argument extraNominal added to MakeWeightCols() which will scale all weight values (ex. xsec*lumi/genEventSumw).

New Additions

  • MemoryFile class to store a string in memory to mimic a file one would open().
  • DictToMarkdownTable() method to convert python dictionary to markdown table (uses MemoryFile).
  • TIMBER/Tools/AutoPU.py added to automate (in pieces or as a whole) the processes of making a pileup weight and applying it.
  • Common.GenerateHash() added for Node hashes.
  • analyzer.GetColumnNames() returns a list of all column names that currently exist.
  • hardware::MultiHadamardProduct for non-nested vectors
  • Update GenMatching tools to be better optimized and to take advantage of new AoS feature

CMS Algos

  • Update luminosity golden jsons.
  • Add DeepAK8 CSV reader and top tagging SF module (note that there have been crashes in some instances for this module that are currently being studied).
  • HEM and Prefire correction modules added.
  • Add JME data tarball information in readme
  • Add W and top tagging scale factor modules (only tau21 and tau32+subjet btag supported, respectively)

Pileup

  • Add WeightCalculatorFromHistogram (from NanoAOD-tools)
  • Add C++ pileup module with "auto" mode to grab npvx distribution from memory
  • Add pileup data files and information on where they are from (+script to get them)

Bug fixes

  • Do not try to get Runs branch if it doesn't exist.
  • Fix bug when making new collections using CollectionOrg.AddBranch().
  • Cleanup plotting in Plot.py to be more consistent and documentation-ready.
  • setup.sh had back-ticks that caused unintended executions and return is more suitable than exit.
  • Return index from Pythonic::InList rather than a bool
  • If ModuleWorker looks in a TIMBER .cc for a function (eval typically) and can't find it, look for it in the equivalent .h (since that's where templates live)

Beta version 1.4

19 Mar 02:47
c2dcde5
Compare
Choose a tag to compare

Change log

The main addition is the new code to handle JME calibrations and uncertainties. This will be covered at the end since it is the most lengthy. First, some more general changes (some of which are from the JME related work).

Collections as arrays of structs (AoS)

First, a "collection" is used here to describe a group of physics objects stored in NanoAOD format. For example, "Electron" is a collection and has attributes "pt", "eta", etc all of which are stored in branches of the NanoAOD Events tree named as Electron_pt, Electron_eta, etc.

If the user would like to access a collection, they can simply use <CollectionName>s which is built dynamically in the background as an AoS. Keeping the electron example, this means that Electrons[0] will return the leading electron in the event with attributes pt, eta, etc that can be accessed as Electrons[0].pt, Electrons[0].eta, etc.

This turns the physics objects into OOP objects which has various niceties for the sake of more complicated algorithms that need to be written in C++ but require an extensive set of arguments. For example, a generator particle matching algorithm would need several attributes from GenPart and from the reco object (say, an AK8 jet) which would make the method definition long and the use of it prone to error (ex. you need to keep all of the arguments in the right order, etc). With the TIMBER collections, you'd only need two arguments - GenParts and FatJets[i], where i is the index of the jet you want to use in the matching.

The C++ is dynamically written here.

There are several important notes to make about this feature :

  • The object (ex. FatJets) doesn't exist until TIMBER detects it in a Cut or Define AND the underlying struct (FatJetStruct) is also not defined yet. So if you'd like to write a C++ method to take one of these collections as input, you'll need to use a C++ template. Please have a look here for an example of this.
  • This feature has penalties! Because there is more being compiled and built, memory usage increases (though not beyond values that are reasonable). It also increases processing time substantially. The difference between calculating the mean of FatJet_pt[0] and FatJets[0].pt is nearly a factor of 6 (for just this action). Thus, the collections should be reserved for using as input to C++ modules that benefit from having fewer positional arguments. There is a more efficient way to handle the construction of the collections (see issue #36) but this is where I leave it for now.

General

  • Created libtimber which is the shared library of all modules in TIMBER/Framework. This is compiled by the new Makefile during setup.sh and is loaded (if not already) by CompileCpp (ie. it doesn't require using the analyzer() class to use!) The JME modules do not work in standalone so these are NOT included in the library if outside of CMSSW.
    • Closes Issue #37.
  • The analyzer silence attribute can be set to silence the print out from Define and Cut calls.
  • Added dedicated method ReorderCollection(). This is meant to be used when JECs affect the pt of jets and a re-ordering is needed. Note that this is not done by AutoJME.
  • Added ModuleWorker class to handle all of the functionality shared by Correction and the new Calibration (both of which now inherit from ModuleWorker).
    • Closes Issue #21.
  • To common.h, added TempDir (handles temporary directory storage) and ReadTarFile (opens and streams tarball contents - took a day to get working!) This was necessary because the JECs come in tarballs and untarring 800+ files and holding them in TIMBER is undesirable.
    • This required adding libarchive as a dependency. Added to setup.sh the recipe (if it doesn't exist) to download and build it inside of TIMBER/bin
  • Added Node.GetBaseNode() so that the top most parent can be accessed from a Node (ie. outside the analyzer).
  • Kept RunChain as attribute of analyzer so that contents can be easily accessed.
  • Moved TIMBER/Framework/ExternalTools/to TIMBER/Framework/ext/
  • Organized TIMBER/Framework/src and TIMBER/Framework/include so that declarations and implementations are split (roughly - there are still some outstanding where they make sense).
  • Added to hardware "Hadamard product" algorithms.
  • Add tcsh setup script
  • Add GetWeightName to get column name for certain weight
  • Add SaveRunChain to save out the Run TTree (with option to merge it with an existing file like a snapshot of Events)

JME module work

Modified from PR #38

  • Write JetRecalibrator class which handles the interfacing to the CMSSW based tools.
  • Write JMEpaths class which handles the interfacing to the JME txt files
  • Write JES_weight class which is the user-facing module to access the corrections (including recalibrations) and uncertainties.
  • Write JMS_weight class which just accesses a hard-coded table of values.
  • (Re)Write JetSmearer class which has the algorithms to evaluate the weights to smear jet energy (pt) and jet mass.
  • Write JER_weight class which uses JetSmearer to calculate the weights per-jet to smear the pt distribution.
  • Write JMR_weight class which uses JetSmearer to calculate the weights per-jet to smear the mass distribution.

NOTE 1: The four J*_weight classes all have eval() functions which return a vector with length of the number of jets. Each entry in this vector is another vector, {nominal, up, down} where "up" and "down" are absolute, not relative, weights to apply to the pt and/or mass.
NOTE 2: The JMS_weight is included for completeness but it's terribly inefficient because no calculation is done - it just creates a vector of length nFatJet which stores the same values over and over again. There's almost certainly a better way to do this but that can be put on the to-do. For now, the uniformity between modules is important.

  • Added CalibrateVar() method which will actually do the multiplication of a variable by the calibration weight (looping through uncertainty variations as well).
  • Added Calibration class which doesn't do much different from ModuleWorker at the moment (it's just not Correction).
  • Added JME related data files to TIMBER/data/JES and TIMBER/data/JER

Validation

Validation of the new modules was done using the new TIMBER bench. See PR #38 for validation details.

Beta version 1.3

06 Jan 22:53
Compare
Choose a tag to compare

Change log

Collection of changes made from mid-November through December 2020. Highlights are the weight column calculation fix and improved C++ argument matching.

Setup/install

None

Analyzer

  • Add ObjectFromCollection() method that creates a subcollection but for just one object in the originating collection.
  • Fix the correction/weight collection so that only parent nodes are considered in the weight calculation for a node tree. In other words, if the node/processing tree has split, weights calculated in branch A should not affect those in branch B but they should share any weights calculated before the branches diverged.
    --- If one has separate branches, each one needs to have the MakeWeightCols() method called. Default called on ActiveNode but can take other nodes as input. With this in mind, the method also now takes a name to name a group of weights so that duplicate nodes are not created on the separate branches.
    --- Changes for this happened in __checkCorrections() (traversing up the tree), Node class (add back parent attribute), MakeWeightCols() (the naming)
  • Improved C++ argument matching when building a correction.
    --- Will now check against active node columns and not just the base node.
    --- Correction.MakeCall() changed to take a dict as input instead of a list. Keys are the C++ method argument names (as written in the C++ file) and values are the names of the RDataFrame columns that you'd like to use as function arguments. If there are arguments in the C++ method that are not in the dict, TIMBER will automatically try to determine if it matches a column name and will use that when building the call to the C++ method.
  • Added Range() method to analyzer and Node classes to select a subset of data to analyzer. Docs include warning to not use this with ROOT.EnableImplicitMT()

Tools

  • Add s/sqrt(b) plotting to Tools/Plot.py
  • Add function GetStandardFlags() to return list of standard MET filter flags. Used as default flagList for GetFlagString().
  • Change cut and not option defaults in TrigTester.py.
  • Consolidated Cutflow* functions and added "initial" count to be included when producing the cutflow.

Modules

  • Staging JetSmearer.h, JetRecalib.h, fatJetUncertainties.cc, and JetMETinfo.h (includes commented-out bit in common.h)
  • Change Trigger_weight.cc default plateau to -1

Pythonic.h

  • Create Pythonic namespace.
  • Add header guards.
  • Add IsDir() and Execute() functions.
  • Updated naming so all functions are capitalized
  • Add EffLoader.cc module #24

Data

None

Testing

  • Fix tests so AddCorrections() changes work.
  • Fix test_Common.py so it works. Add in actual tests for Cutflow* functions.
  • Add test for Range()

More documentation

  • Changed error for multiple nodes of the same name to a warning.
  • Add transparent logo.
  • Example 1 (examples/ex1.py) now includes example of using Range() and explains to not use ROOT.EnableImplicitMT().

Issues that were addressed

  • Fix file reading from afs
  • Return ActiveNode with SubCollection method.
  • When providing a dict to Node.SetChildren(), the code was checking if the keys of the dict were of type Node. Fixed to check the dict values.
  • In C++ modules, switch int to size_t in for loops.
  • Fix "Library compiling doesn't play nice with periods in TIMBERPATH"
  • Fix TIMBER/data/README.md
  • Implement corr Correction type deduction
  • Fix CompareShapes() so that it works with empty bkgs, signals, and colors correctly.
  • Allow for default arguments when doing C++ clang parsing and automatic calls to correction methods.

Beta version 1.2

14 Nov 01:50
4a94ff9
Compare
Choose a tag to compare

Change log

Combination of #17, #18, and #19.

Setup/install

  • Added boost dependency information to main README.md (needed for LumiFilter.h)

Modules

  • Added GenMatching.h which can be used to reconstruct the entire generator particle decay tree from the mother indexes stored in the NanoAOD. This is useful for traversing the entire decay chain with relative ease. Example added in How to use GenMatching.h.
  • Added LumiFilter.h which can be used in conjunction with the newly added golden JSONs to filter data based on the JSONs.
  • Added HistLoader.h which can be used to load in a histogram once before processing, with access to the histogram via the class methods while looping over the RDataFrame entries. The eval module returns based on the input axis value and eval_bybin returns based on the provided bin number.
  • Added TopPt_weight.cc which calculates the top pT correction from the TOP group based on the data/POWHEG+Pythia8 fit. The nominal correction is calculated with the corr() method and variations of the constants in the exponential form can be calculated using the alpha() and beta() methods.
  • Added Trigger_weight.cc which is uses HistLoader.h to load a trigger efficiency histogram. The eval() method returns the efficiency for that event (based on the input variable, of course) and calculates the uncertainty as one-half the trigger inefficiency.
  • Rename "analyzer" namespace to "hardware" in common.h. Done for clarity in the documentation to avoid confusion with the Analyzer python namespace (aka Analyzer.py).
  • Change hardware::invariantMass() argument to be a vector of Lorentz vectors. Invariant mass of all provided vectors is calculated.
  • Moved Framework/src/Collection.cc to Framework/include/Collection.h

Testing

  • Added a draft of test_modules.py which features an example for TopPt_weight.cc but it is currently commented out because the test file does not have the GenPart collection or the FatJet collection (and is also not a ttbar set)
  • Added make_test_file.py to make a small testing histogram.
  • Added small testing histogram generated by make_test_file.py.

Data

  • Added golden JSONs for 2017 and 2018 and added info to the README ledger. It seems 2016 does not have a golden JSON anymore (?)

Analyzer

  • Add corr type for Correction() class. It represents a corrections with no uncertainty. The clang parsing CANNOT currently derive it automatically from the C++ script but it can be assigned as the corrtype via the argument to Correction() constructor.
  • Optimized MakeTemplateHistos() to book histogram actions before looping. They previously looped over the dataframe one after the other. This provides a significant speed up.

More documentation

  • Added page on how to use GenMatching.h in a custom C++ module with the example of finding how many prongs are merged in a top jet.
  • Added docs to Pythonic.h
  • Added docs to common.h
  • Added docs to PDFweight_uncert.cc
  • Added docs to SJBtag_SF.cc
  • Consolidate the READMEs for sections so the webpage makes more sense.
  • Switch to MathJax for formula rendering

Small bug fixes

  • More robust python version checking for ASCII encoding in OpenJSON().
  • Fix PrintNodeTree() for cross-system compatibility. The networkx package is used to create the graph which can be drawn with a number of tools. TIMBER was using pygraphviz which, to be installed, needs the development library of graphviz. While it's easy to get this on Ubuntu or macOS, it is not available on either LPC or LXPLUS servers and we can't install it without a bit of a headache. Thus, pydot is now used with networkx instead since it does not have the same build dependencies. However, the version of graphviz on the system (aka dot) cannot always write out to modern image formats like PNG. The solution is for TIMBER to attempt to save the requested format and if it isn't possible, save out the .dot file for later conversion. Instructions on how to convert the .dot to something else locally were added to the FAQ section of the docs.

Beta version 1.1

26 Oct 14:38
620df99
Compare
Choose a tag to compare

NOTE: These are copied excerpts from #16

Benchmarks

  • Benchmarks 1-9 added in benchmarks/ex*.py. Some internal comments included about what was done. The CMS Open Data sample included in the examples/ folder does not have electrons so it was not used for benchmarks 7 or 8 (these need the tester to use their own private file which this repo does not provide).
  • Filled out more of the general testing with pytest.

New to analyzer

  • Close(): Implemented to safely delete an analyzer instance.
  • __str__: Implemented to provide an informational printout when print(<analyzer>) is called.
  • Can specify Node type as Cut and Define argument if you have a specific type you'd like to track.
  • SubCollection(): Creates a named sub-collection based on some discriminant where the sub-collection has all of the same branches as the parent but only includes vector entries that passed the discriminant.
    NOTE: myColl_var1 is an RVec and so myColl_var1 > 5 returns a vector the same size as myColl_var1 but filled with bools for each entry. These bools determine which entries of the RVecs of the sub-collection branches are made.
a = analyzer(...)
# Say there is a collection "myColl" with branches "myColl_var1", "myColl_var2", "myColl_var3"
a.SubCollection("mySubColl","myColl","myColl_var1 > 5")
# Now there is a new collection "mySubColl" with branches "mySubColl_var1", "mySubColl_var2", "mySubColl_var3"
# which only have values where myColl_var1 > 5
  • MergeCollections(): Creates a new collection which is a merge of all provided collections. New collection has variables that are common between collections being merged.
  • CommonVars(): Finds the common variables between a set of collections (provided as a list of names).
  • PrintNodeTree(): Added optional argument toSkip=[] which skips plotting any nodes of types specified by toSkip. Note that the function checks for the type in toSkip as a substring of the type of the Node. So if you provide toSkip=["Define"] all nodes of type "MergeDefine" and "SubCollDefine" will also be dropped.
    • Also switched to using networkx (which uses pygraphviz).
  • MakeHistsWithBinning(): Batch creates histograms at the current ActiveNode based on the input histDict which is formatted as {[<column name>]: <binning tuple>}. The dimensions of the returned histograms are determined from the size of [<column name>].
    • [<column name>] is a list of column names that you'd like to plot against each other in [x,y,z] order
    • binning_tuple is the set of arguments that would normally be passed to TH1.

New to Node

  • Add "types" to Nodes to denote what was done to produce the Node. Currently used for controlling nodes present in PrintNodeTree() output. Current possible types are "Define", "Cut", "MergeDefine", "SubCollDefine", "Correction".
  • Close(): Implemented to safely delete a Node instance.
  • __str__: Implemented to provide an informational printout when print(<node>) is called.

New to HistGroup

  • Merge(): Adds together all of the histograms in the group and returns the output histogram.

New to C++ Code

common.h

  • transverseMass() to get transverse mass of MET + one object. Could be more generalized.
  • 2nd constructor for TLvector() that takes RVecs as arguments rather than floats (returns back an RVec of PtEtaPhiMVectors)

Small bug fixes

  • Fix Common.py TIMBER imports
  • Make fileName attribute public (used for new __str__ method for printing analyzer object)
  • Add BaseNode to AllNodes for tracking
  • Force BaseNode to zero children on initialization to avoid memory issues
  • Fix Group addition

Beta version 1.0

12 Oct 20:08
e5893d6
Compare
Choose a tag to compare

With the most recent changes, I believe we've exited any sort of alpha and so the project will now start doing tags and releases to track development and allow users to check if they have the latest version of TIMBER (or grab a development version).