-
Notifications
You must be signed in to change notification settings - Fork 0
Home
Following is the Google Summer of Code 2021 Final Report on Project Root Storage of Deep Learning Models in TMVA conducted under CERN-HSF.
Student's Name | Sanjiban Sengupta |
Mentors | Lorenzo Moneta, Sitong An, Anirudh Dagar |
Organization | Root-Project (CERN-HSF) |
Organization Code Repository | https://github.com/root-project/root |
Project Page | https://summerofcode.withgoogle.com/projects/#5424575602491392 |
Code Implementations | https://github.com/root-project/root/pulls?q=author:sanjibansg |
Documentation Blog | https://blog.sanjiban.ml/series/gsoc |
The Toolkit for Multivariate Data Analysis (TMVA) is a sub-module of ROOT which provides a machine learning environment for conducting the training, testing, and evaluation of various multivariate methods especially used in High-energy Physics. Recently, the TMVA team introduced SOFIE ( System for Fast Inference code Emit) which facilitates its own intermediate representation of deep learning models following the ONNX standards. To facilitate the usage, storage, and exchange of these models, this project aimed at developing the storage functionality of Deep Learning models in the .root
format, popular in the High Energy Physics community.
- Functionality for serialization of RModel for storing a trained deep learning model in
.root
format. - Functionality for parsing a Keras
.h5
file into a RModel object for generation of inference code. - Functionality for parsing a PyTorch
.pt
file into a RModel object for generation of inference code. - Tests and Tutorials for various parsers of TMVA SOFIE's RModel object.
1. Serialization of RModel PR#8666
-
Link to Blog article: https://blog.sanjiban.ml/root-project-introducing-sofie
-
Description RModel is the primary class defined in SOFIE for storing the configuration and weights of a trained deep learning model, and ROperator is the abstract base class from which various operators are derived. Following ONNX standards, the ROperators are responsible for generating specific inference codes to operate on input tensors and provide the outputs as per the attributes provided. It was required to make the RModel class serializable so that it can be saved in
.root
format. -
- Modifying the Data Structures
- Modifying struct InitializedTensor
- Modifying class RModel & ROperator
- Modifying the LinkDef file
- Adding the Custom Streamer to RModel
- Tests
- Emit Files for generating header files
- Tests for Parser
- Modifying the Data Structures
-
Interface
//Writing ROOT File TFile file("model.root","CREATE"); using namespace TMVA::Experimental; SOFIE::RModel model = SOFIE::PyKeras::Parse("trained_model_dense.h5"); model.Write("model"); file.Close(); //Reading ROOT File TFile file("model.root","READ"); using namespace TMVA::Experimental; SOFIE::RModel *model; file.GetObject("model",model); file.Close();
2. Keras Parser for RModel PR#8430
-
Link to blog article https://blog.sanjiban.ml/root-project-keras-parser-for-sofie
-
Description Converter for Keras
.h5
models was required for translating a Keras Sequential API and Functional API model into a RModel object for the subsequent generation of inference code. -
Progress
- Restructured SOFIE to avoid dependency conflicts between different Python libraries
- Parser function for extracting the model information and weights and instantiate a RModel Object
- Support for Keras Sequential API Models
- Support for Keras Functional API Models
- Supports Dense (with relu activation), ReLU and Permute Layers
- Header file for the function
- Function implementation
- Converter Function writing the RModel containing the model information into ROOT file.
- Header file for the function
- Function implementation
- Tests
- Emit Files for generating header files
- Tests for Parser
- Tutorials
-
Interface
//Parser returns a RModel object using TMVA::Experimental::SOFIE; RModel model = PyKeras::Parse("trained_model_dense.h5"); //Converter writes a ROOT file directly PyKeras::ConvertToRoot(“trained_model_dense.h5”);
3. PyTorch Parser for RModel PR#8684
-
Link to Blog Article https://blog.sanjiban.ml/root-project-pytorch-parser-for-sofie
-
Description Converters for PyTorch
.pt
models saved using TorchScript, were required to be parsed into a RModel object for the subsequent generation of inference code. The developed functionality required the shape of the input tensor and its data type. If not specified, the data type is by defaultFloat
, but the shapes vector is a mandatory parameter. -
Progress
-
Parser function for extracting the model information and weights and instantiate a RModel Object
- Support for PyTorch nn.Module, nn.Sequential, nn.ModuleList containers.
- Supports Linear, ReLU and Transpose Layers/operations.
- Supports tensors of dynamic axes.
- Header file for the function
- Function implementation
-
Converter Function writing the RModel containing the model information into ROOT file.
- Header file for the function
- Function implementation
-
Tests
- Emit Files for generating header files
- Tests for Parser
-
Tutorials
-
-
Interface
//Parser returns a RModel object using TMVA::Experimental::SOFIE; //Building the vector for input shapes std::vector<size_t> s1{120,1}; std::vector<std::vector<size_t>> inputShape{s1}; RModel model = PyTorch::Parse("trained_model_dense.pt",inputShape); //Converter write3s a ROOT file directly std::vector<size_t> s1{120,1}; std::vector<std::vector<size_t>> shape{s1}; PyTorch::ConvertToRoot(“trained_model_dense.pt”,inputShape);
- Tests were built on Google's GTest Framework. Python Scripts were developed which were run by the C-Python API to generate models and save them. Then these models were parsed and the correctness of the Parsers was validated by comparing the outputs from the generated inference code and the saved models when called on the same input tensors.
- Simple Tutorials were built (PR#8874) for showcasing use cases of the Parsers, generation of inference code, and usage of functions defined in RModel class.
After implementing the Expected deliverables, I started working on the development of the Root-Storage of BDT. The implementation required developing a class that will be the primary data structure for holding model configuration & weights and will be serializable into the .root
file. Also, a Parse function was required for translating a BDT model which was trained in TMVA and saved in the .xml
file. And lastly, a Mapping interface to TMVA Tree Inference for generating inference code. The developed class was initially implemented by Jonas Rembser (https://github.com/guitargeek/tmva-to-xgboost/), and modifications were done by me.
-
Interface
//Parser loads the BDT model from .xml to RootStorage::BDT object TMVA::Experimental::RootStorage::BDT model; bool usePurity = true; model.Parse("TMVA_CNN_Classification_BDT.weights.xml",usePurity);
Pull Request | PR Number | Status |
---|---|---|
Restructured SOFIE | #8594 | |
Serialisation of RModel | #8666 | |
Modifying AddOutputTensorNameList() | #8640 | |
PyKeras Converter TMVA | #8430 | |
PyTorch Converter TMVA | #8684 | |
Tutorials for RModel Parsers | #8874 | |
Root Storage of BDT | #8873 |
-
Documenting the data structures and functions in SOFIE and the Parsers using DOxygen.
-
Contribute to ROOT & TMVA for implementing, improving, and debugging code.
-
Development of Root Storage of BDT
- Develop the mapping interface for inference code generation from class RootStorage::BDT
- Researching on the conversion of scikit-learn based BDT models to class RootStorage::BDT for subsequent inference
- Adding tests & tutorials
-
Development of ROperators
- Implementing classes for various ROperators for ONNX & ONNX-ml
Planned goals of the project were successfully implemented. Currently, in the experimental stage, SOFIE requires continuous development and holds effective applications on the inference of deep learning models. I wish to contribute to the project in the future in implementing functionalities, improving features, and debugging issues. I had an in-depth understanding of the Root-Project and its applications in High-Energy Physics. While working on the project, I faced numerous challenges but learned the way to tackle them. In the due course, I did learn about a lot of tools, methods, and concepts for developing robust applications. It was a dream to work with the people from the largest Particle Physics Facility in the world, and I am blessed to receive the opportunity and guidance from them, and sincerely wish to receive the chance to work with them again.
First of all, I convey my thanks to Google for organizing this event of massive learning, networking, and experiencing the development of open-source software. I am highly grateful to my mentors Lorenzo Moneta, Sitong An, Anirudh Dagar, and CERN-HSF to provide me the opportunity to work on the project, and for all the guidance and help they have been providing. I am also thankful to TMVA Team member Omar Andres Zapata Mesa for his help and support in implementing and debugging the functionalities. Lastly, I thank all the student developers to make this program successful, my friends and seniors for their continuous help and support, and my Parents for their belief, guidance, and support in all my endeavors.