MLflow integration && dev env chapters (#8)

* add mlflow integration * fix mlflow label * fix underscore * remove sample chapter * remove sample chapter * fix underscore * add dev-env * add dev-env * pr fixes * add citations * citation fixes
prinz-nussknacker · Jun 2, 2021 · b29e914 · b29e914
1 parent 89e8285
commit b29e914
Show file tree

Hide file tree

Showing 5 changed files with 262 additions and 14 deletions.
diff --git a/chapters/dev-env.tex b/chapters/dev-env.tex
@@ -0,0 +1,109 @@
+\chapter{Development environment}
+\label{chap:devenv}
+
+\section{Development environment overview}
+
+We created a specially configured development environment as a part of our project repository to
+simplify the process of the environment setup which is needed to run every integration with our
+library and Nussknacker service. For this purpose a separate directory named \texttt{dev-environment} was
+created which includes all the Docker configurations and scripts for setting up the environment
+from scratch.
+
+The base for configuration for the environment is made of the .env file which contains the environment
+constant variables’ definitions and allows easily changing the ports of services and some paths in
+Docker images that can be configured. This file is loaded also by the sbt which is then able to work
+with the same environment for the purpose of integration tests that use the Docker services to test
+every integration.
+
+The process of creating the new environment comes to running a single bash script from the \texttt{dev-environment}
+which allows configuring the build process with additional flags specified for script execution. There is
+a possibility to run only a single integration environment (e.g. only MLflow server repository with its
+environment) or to exclude the process of library recompilation before placing it in the Nussknacker image.
+All these improvements were added during the process of adding new integrations as we found it annoying
+to spend a lot of time waiting for some part of the environment which wasn’t actually needed to test
+and experiment with another integration. Additionally we created the extra bash script for cleaning
+the environment cached data as discovered to take a lot of disc space when having so many different
+integrations with their environments in a single project.
+
+Every integration has its own \texttt{docker-compose} configuration file which specifies the Docker services
+needed to run the integration environment. Additionally, there is a separate configuration named env
+for Nussknacker image with its own services, where the compiled Prinz library is placed. Thanks to
+such an approach it is easier to set up every environment separately and run only the needed ones.
+However this also makes it a little harder to allow communication between every environment - we need
+to manually add the Docker network to which the set up environments are connected during creation
+time and then can communicate easily. Moreover, we decided not to expose every integration port for
+external services for the other integrations and Nussknacker environment but create a single proxy
+server for each integration which works as a barrier between the integration implementation details
+and the outside world (and then the environment works more like in real life scenario).
+
+The Docker images for each integration needs extra dependencies which are managed using the \texttt{conda}
+environment manager. There is quite a lot of work to be done by installing all of them separately
+so we decided to create the Docker images for integration and publish it in the external Docker
+images repository. We decided to use the GitHub images hub which allows us to publish the images
+as a part of our open-source project but unfortunately forces the user to log to GitHub before
+downloading the image. However, this GitHub policy can change in the near future as the community
+doesn’t seem to like the logging requirement and there are many open discussions on this topic.
+
+\section{Models serving in the environment}
+
+Every integration includes individual ways of models creating and serving them after the training process:
+
+\begin{itemize}
+	\item MLflow models are trained after creating the models registry environment and then served as
+	Docker services from the same Docker image. Their specification is saved on a separate PostgreSQL
+	database and the signatures are kept on the S3 storage provided by the Minio Docker image. This setup
+	makes the prepared environment behave like a production version of MLflow server because there is no
+	usage of local storage for data keeping.
+
+	\item PMML models are exported as the XML files during training on image start and served as the files
+	by the HTTP little server written in Python. This approach uses the PMML Python dependencies only during
+	training time as there is no alternative for  models management server in case of PMML models representation.
+	The models in our approach have to be listed in a specific way to give the developers ability to
+	automatically find their locations with the proper selector path which specifies the models’ refs on the
+	main site of the server and is configurable in Prinz.
+
+	\item H2O server is somehow between the MLflow and the PMML integration as it sets up a full models
+	registry server during training time and has the ability to list models and serve them with the usage
+	of REST API. However in our case the simpler approach to H2O models was chosen and after training the
+	models they are saved as the standard MOJO files and then leaded by the Prinz library as the local
+	scoring models. Here we leave some room for future integrations as there is also a possibility to
+	score the H2O models on the side of the H2O server like in the case of MLflow registry.
+\end{itemize}
+
+\section{Proxying models environments}
+
+For each integration we created a separate proxy server based on light nginx alpine Docker image which
+is set up with only a single configuration file having needed specifications. In the case of MLflow
+integration we tried to simulate the real world scenario usage of MLflow registry by manually setting
+the proxy configuration connected with the buffering the data and setting some custom requests’ headers.
+In the other integration the served models as the XML and MOJO files are only proxied with the change of
+the port on which they are available to the user.
+
+Furthermore, for each integration the proxy also serves a few static files which contain the data needed
+in the phase of integration tests. It is responsible for serving like some simple REST API server which
+is capable of providing extra inputs for models in specific phases of library tests.
+
+\section{Development environment in integration tests}
+
+Each integration runs separately the integration tests which was one of the main reasons to separate
+the environment configurations to a few files. During the tests phase there is a need to know the project
+root file in the filesystem to locate these configuration files so before running any integration test
+the user has to define the \texttt{REPOSITORY\_ABSOLUTE\_ROOT} environment variable. After this initial
+configuration running tests from the console is as simple as calling a single sbt command as the whole
+environment configuration is loaded with the sbt plugin from the \texttt{.env} file. However, in the case of need
+to run tests from an IDE we need to specify separately the \texttt{.env} file in test run configuration as for now
+there is no possibility to load this information automatically in the IDE.
+
+The testing phase includes testing the integration part e.i. scoring the models available in integration
+models source and checking their signatures and other parameters. Moreover, there are tests which are
+checking the models ability to being proxied with some external data source (outside of Nussknacker
+environment source of data) so they use the described feature of nginx serving as a simple REST API
+and use the local H2 database as the source of data for tests run. All of these described tests are
+wrapped in the abstract traits specifications to allow running them on every integration with the minimal
+configuration process. The process of running the tests for specific integration includes setting up the
+environment from scratch but everything is done by the testcontainers library which uses the docker-compose
+YAML files. This approach connects the source code of our library and its tests with the environment
+definition while in the test running phase there is no need for the developer to manually set up the
+environment. Additionally, the unit tests which don’t use the integrations’ environment just run scala
+tests code without touching anything from the environment definition so there is an easy way to run them
+and verify some specified parts of code without the process of working with any Docker containers.
diff --git a/chapters/mlflow-integration.tex b/chapters/mlflow-integration.tex
@@ -0,0 +1,132 @@
+\chapter{MLflow integration}
+\label{chap:mlflow}
+
+\section{MLflow project overview}
+
+MLflow is an open source project which allows managing the whole machine learning lifecycle.
+It includes experimentation, deployment and reproducibility of models as well as a central registry
+for versioning trained models. It was proposed by the Databricks team working with hundreds of companies
+using machine learning as a solution to common production problems when working with machine learning
+models and their deployment.\cite{mlflowpage}
+
+The whole MLflow platform was inspired by existing solutions from Facebook, Google and Uber which
+seem to be limited because they support only a small set of built-in algorithms or a single library,
+and they are strictly connected to each company’s infrastructure. They don’t allow to easily use of new
+machine learning libraries and share the results of work with an open community.\cite{mlflowarticle}
+
+MLflow was designed to be an open solution in two senses:
+
+\begin{itemize}
+	\item open source: it is an open source library publicly available that can be easily adapted
+	by users and extended to expected form. Additionally, MLflow format makes it easy to share
+	the whole workflow and models from one organization to another if you wish to share your
+	code with collaborators.
+
+	\item open interface: it is designed to work with many already available tools, machine learning
+	libraries, implemented libraries and algorithms. It’s built using the REST APIs and readable data
+	formats where for example a machine learning model can be seen as a single lambda function called
+	on model evaluation. It was designed to be easily introduced to existing machine learning projects
+	so the users can benefit from it immediately.
+\end{itemize}
+
+\section{Trained models management}
+
+MLflow allows users to use existing models and easily convert them to open interface format.
+When working with the MLflow framework we used the Python language for creating some Machine
+Learning models for test purposes using the sklearn framework which is one of the many frameworks
+supported by MLflow. Then the pretrained models were imported to MLflow repository which saved
+their state in an external database and some data storage server that was responsible for keeping
+the artefacts from training. In practice, the artefacts' data that was useful in our case was the
+signature of the model which was exported to YAML file format.
+
+The training process of used models was short enough that we can train the models every time the
+MLflow server was created in a clean environment. This approach to creating this environment
+allowed us to easily debug any inconsistency in models' training process because we could see
+the difference between the clean environment and the environment with trained models.
+
+\section{Trained models serving}
+
+MLflow allows to deploy the trained models from its registry in a few ways including:
+\begin{itemize}
+	\item deploying \texttt{python\_function} model to Microsoft Azure ML
+	\item deploying \texttt{python\_function} model to Amazon SageMaker
+	\item exporting \texttt{python\_function} as an ApacheSpark UDF
+	\item exporting a REST API endpoint serving the model as a docker image
+	\item deploying the model as a REST API local server
+\end{itemize}
+
+Using any of these approaches may be easier or harder to deploy in practice depending on the architecture
+that we already work with and the resources that can be used for the deployment process. Managing the local
+REST API server is the easiest solution while it doesn’t scale up with the number of models. When the
+developers take care about the environment, and it’s really hard to set up a fresh, clean environment
+from scratch it’s better to manage the models using separated Docker containers (which deployment
+process can additionally be easily automated).\cite{mlflowdoc}
+
+In our environment the models after training are served as a local REST API in Docker image with MLflow
+server inside where they have ports exposed for external usage. With this approach we can clean the
+environment on every setup of integration tests of the library, download fresh, prepared images and
+train the prepared models, which are then stored in MLflow clean database and S3 storage (that are
+prepared as separate Docker images).
+
+\section{MLflow Repository}
+
+Configured MLflow server used in our environment model serves as the model registry which takes care
+about model versioning and serving the base models' information using the REST API. As there is no
+implementation for Scala (and Java) language for this REST API, we have created our own implementation
+of the provided API with the usage of high level HTTP client library named \texttt{sttp} written by SoftwareMill
+and the model objects serializer called \texttt{circe} powered by Cats. These two libraries allow us to create
+the model corresponding to the official MLflow documentation which could be easily integrated with the
+current version of MLflow models repository.
+
+The whole model of data served by MLflow models registry server is provided in JSON format, for which
+we managed to create corresponding models in code with the usage of compile time code analysis with
+the Scala macros sbt plugin. In this approach it was enough to create the case classes with proper
+fields to get automatic conversion between the received text data in JSON format and the objects
+model in programming language. The whole process is based on JsonCondec code annotation which allows
+to preprocess case classes in Scala in compile time and generate the Encoder and Decoder objects
+implementations for the annotated class. In the standard approach the developer would have to manually
+write this code which would just translate the JSON fields to object’s fields and vice versa while
+this can be in most cases done automatically.
+
+\section{MLflow Model and ModelInstance}
+
+MLflow model is instantiated based on the data received and parsed in JSON format from MLflow models
+registry and the signature data located in artefact storage of model. The signature data is parsed
+separately from a YAML formatted file which is also mapped to \texttt{JsonCodec} model class with auto Encoder/Decoder
+mechanism.
+
+When the MLflow model is instantiated to allow scoring the served model, the whole conversion of data
+sent to the external model service is done by the MLflow model instance during the run method invocation.
+We created the abstract of Dataframe which corresponds to the single input for the model. When typed input
+data comes from another Nussknacker service to the Prinz MLflow implementation, it has to be manually
+converted to the valid JSON format, which can be determined based on the signature of the model. In the
+runtime of the process the library receives the data in a unified format of type Any which then has to
+be recognized and parsed before creating the input frame for the model. MLflow served models support two
+types of JSON data providing named split and record orientations for compatibility and easier work with
+different types of input data. In our model we use only a single approach of Dataframe to JSON conversion
+as the data received from another service is always given in the same format. The data conversion process
+which is needed before sending the data to the model and after receiving the data from the model was
+implemented in the separated object’s MLFDataConverter methods as they don’t depend on the model and
+exist in the model architecture as the static methods which are responsible only for data conversion
+for this single type of implementation.
+
+\section{MLflow Model Signature}
+
+MLflow model signature is basically not a necessary part of a model registered in the models registry
+provided by MLflow library. In our approach to the problem we assume that every user of our library
+should save the models in the registry in the way which includes steps creating the model signature.
+We found this way of typing the model input and output as the easiest one because the final user of
+Nussknacker doesn’t have to know anything about model implementation (when it was logged to the registry
+with a signature). Additionally creating the model with typed signature ensures the user that only the
+valid input will be allowed for model input (which is not so obvious while the most of the models management
+on the MLflow side is done with Python that hasn’t the static typing). However, creating the signature
+for MLflow models in Python has a single drawback e.i. the model output doesn’t have any names and are
+given as the ordered list in the JSON response. We manually create the additional labels for model outputs
+in the process of scoring with the following outputs named \texttt{output\_0}, \texttt{output\_1} etc. Thanks to this
+approach the final user of the Nussknacker GUI can configure the model in an easier way but still he has
+to know the mapping between the interesting data and its order in the output model map.
+
+It’s worth noting that the saved model signature in case of MLflow models is stored on external data
+service like Amazon S3 data storage which seems to be the default choice of users of MLflow library.
+In Prinz we implemented accessing the models’ signatures located only on S3 storage and here we see
+the place for future improvements to provide some more generic approach of fetching the signature data.
diff --git a/chapters/sample-chapter.tex b/chapters/sample-chapter.tex
diff --git a/references.bib b/references.bib
@@ -1,12 +1,3 @@
-@incollection{turing2009computing,
-	title={Computing machinery and intelligence},
-	author={Turing, Alan M},
-	booktitle={Parsing the turing test},
-	pages={23--65},
-	year={2009},
-	publisher={Springer}
-}
-
 @article{srinath2017python,
 	title={Python--the fastest growing programming language},
 	author={Srinath, KR},
@@ -16,3 +7,22 @@ @article{srinath2017python
 	pages={354--357},
 	year={2017}
 }
+
+@misc{mlflowpage,
+ title = {MLflow--A platform for the machine learning lifecycle},
+ url = {https://mlflow.org/},
+ year = {2021}
+}
+
+@misc{mlflowdoc,
+ title = {MLflow documentation--MLflow 1.17.0 documentation},
+ url = {https://mlflow.org/docs/1.17.0/index.html},
+ year = {2021}
+}
+
+@misc{mlflowarticle,
+ title = {Introducing MLflow: an Open Source Machine Learning Platform},
+ url = {https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html},
+ year = {2021},
+ author = {Matei Zaharia}
+}
diff --git a/thesis.tex b/thesis.tex
@@ -62,8 +62,9 @@
 
 % Chapters
 \input{chapters/introduction.tex}
-\input{chapters/sample-chapter.tex}
 \input{chapters/architecture-overview.tex}
+\input{chapters/mlflow-integration.tex}
+\input{chapters/dev-env.tex}
 \input{chapters/project-development.tex}
 \input{chapters/responsibilities.tex}