Skip to content

Commit

Permalink
MLflow integration && dev env chapters (#8)
Browse files Browse the repository at this point in the history
* add mlflow integration

* fix mlflow label

* fix underscore

* remove sample chapter

* remove sample chapter

* fix underscore

* add dev-env

* add dev-env

* pr fixes

* add citations

* citation fixes
  • Loading branch information
avan1235 authored Jun 2, 2021
1 parent 89e8285 commit b29e914
Show file tree
Hide file tree
Showing 5 changed files with 262 additions and 14 deletions.
109 changes: 109 additions & 0 deletions chapters/dev-env.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
\chapter{Development environment}
\label{chap:devenv}

\section{Development environment overview}

We created a specially configured development environment as a part of our project repository to
simplify the process of the environment setup which is needed to run every integration with our
library and Nussknacker service. For this purpose a separate directory named \texttt{dev-environment} was
created which includes all the Docker configurations and scripts for setting up the environment
from scratch.

The base for configuration for the environment is made of the .env file which contains the environment
constant variables’ definitions and allows easily changing the ports of services and some paths in
Docker images that can be configured. This file is loaded also by the sbt which is then able to work
with the same environment for the purpose of integration tests that use the Docker services to test
every integration.

The process of creating the new environment comes to running a single bash script from the \texttt{dev-environment}
which allows configuring the build process with additional flags specified for script execution. There is
a possibility to run only a single integration environment (e.g. only MLflow server repository with its
environment) or to exclude the process of library recompilation before placing it in the Nussknacker image.
All these improvements were added during the process of adding new integrations as we found it annoying
to spend a lot of time waiting for some part of the environment which wasn’t actually needed to test
and experiment with another integration. Additionally we created the extra bash script for cleaning
the environment cached data as discovered to take a lot of disc space when having so many different
integrations with their environments in a single project.

Every integration has its own \texttt{docker-compose} configuration file which specifies the Docker services
needed to run the integration environment. Additionally, there is a separate configuration named env
for Nussknacker image with its own services, where the compiled Prinz library is placed. Thanks to
such an approach it is easier to set up every environment separately and run only the needed ones.
However this also makes it a little harder to allow communication between every environment - we need
to manually add the Docker network to which the set up environments are connected during creation
time and then can communicate easily. Moreover, we decided not to expose every integration port for
external services for the other integrations and Nussknacker environment but create a single proxy
server for each integration which works as a barrier between the integration implementation details
and the outside world (and then the environment works more like in real life scenario).

The Docker images for each integration needs extra dependencies which are managed using the \texttt{conda}
environment manager. There is quite a lot of work to be done by installing all of them separately
so we decided to create the Docker images for integration and publish it in the external Docker
images repository. We decided to use the GitHub images hub which allows us to publish the images
as a part of our open-source project but unfortunately forces the user to log to GitHub before
downloading the image. However, this GitHub policy can change in the near future as the community
doesn’t seem to like the logging requirement and there are many open discussions on this topic.

\section{Models serving in the environment}

Every integration includes individual ways of models creating and serving them after the training process:

\begin{itemize}
\item MLflow models are trained after creating the models registry environment and then served as
Docker services from the same Docker image. Their specification is saved on a separate PostgreSQL
database and the signatures are kept on the S3 storage provided by the Minio Docker image. This setup
makes the prepared environment behave like a production version of MLflow server because there is no
usage of local storage for data keeping.

\item PMML models are exported as the XML files during training on image start and served as the files
by the HTTP little server written in Python. This approach uses the PMML Python dependencies only during
training time as there is no alternative for models management server in case of PMML models representation.
The models in our approach have to be listed in a specific way to give the developers ability to
automatically find their locations with the proper selector path which specifies the models’ refs on the
main site of the server and is configurable in Prinz.

\item H2O server is somehow between the MLflow and the PMML integration as it sets up a full models
registry server during training time and has the ability to list models and serve them with the usage
of REST API. However in our case the simpler approach to H2O models was chosen and after training the
models they are saved as the standard MOJO files and then leaded by the Prinz library as the local
scoring models. Here we leave some room for future integrations as there is also a possibility to
score the H2O models on the side of the H2O server like in the case of MLflow registry.
\end{itemize}

\section{Proxying models environments}

For each integration we created a separate proxy server based on light nginx alpine Docker image which
is set up with only a single configuration file having needed specifications. In the case of MLflow
integration we tried to simulate the real world scenario usage of MLflow registry by manually setting
the proxy configuration connected with the buffering the data and setting some custom requests’ headers.
In the other integration the served models as the XML and MOJO files are only proxied with the change of
the port on which they are available to the user.

Furthermore, for each integration the proxy also serves a few static files which contain the data needed
in the phase of integration tests. It is responsible for serving like some simple REST API server which
is capable of providing extra inputs for models in specific phases of library tests.

\section{Development environment in integration tests}

Each integration runs separately the integration tests which was one of the main reasons to separate
the environment configurations to a few files. During the tests phase there is a need to know the project
root file in the filesystem to locate these configuration files so before running any integration test
the user has to define the \texttt{REPOSITORY\_ABSOLUTE\_ROOT} environment variable. After this initial
configuration running tests from the console is as simple as calling a single sbt command as the whole
environment configuration is loaded with the sbt plugin from the \texttt{.env} file. However, in the case of need
to run tests from an IDE we need to specify separately the \texttt{.env} file in test run configuration as for now
there is no possibility to load this information automatically in the IDE.

The testing phase includes testing the integration part e.i. scoring the models available in integration
models source and checking their signatures and other parameters. Moreover, there are tests which are
checking the models ability to being proxied with some external data source (outside of Nussknacker
environment source of data) so they use the described feature of nginx serving as a simple REST API
and use the local H2 database as the source of data for tests run. All of these described tests are
wrapped in the abstract traits specifications to allow running them on every integration with the minimal
configuration process. The process of running the tests for specific integration includes setting up the
environment from scratch but everything is done by the testcontainers library which uses the docker-compose
YAML files. This approach connects the source code of our library and its tests with the environment
definition while in the test running phase there is no need for the developer to manually set up the
environment. Additionally, the unit tests which don’t use the integrations’ environment just run scala
tests code without touching anything from the environment definition so there is an easy way to run them
and verify some specified parts of code without the process of working with any Docker containers.
132 changes: 132 additions & 0 deletions chapters/mlflow-integration.tex
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
\chapter{MLflow integration}
\label{chap:mlflow}

\section{MLflow project overview}

MLflow is an open source project which allows managing the whole machine learning lifecycle.
It includes experimentation, deployment and reproducibility of models as well as a central registry
for versioning trained models. It was proposed by the Databricks team working with hundreds of companies
using machine learning as a solution to common production problems when working with machine learning
models and their deployment.\cite{mlflowpage}

The whole MLflow platform was inspired by existing solutions from Facebook, Google and Uber which
seem to be limited because they support only a small set of built-in algorithms or a single library,
and they are strictly connected to each company’s infrastructure. They don’t allow to easily use of new
machine learning libraries and share the results of work with an open community.\cite{mlflowarticle}

MLflow was designed to be an open solution in two senses:

\begin{itemize}
\item open source: it is an open source library publicly available that can be easily adapted
by users and extended to expected form. Additionally, MLflow format makes it easy to share
the whole workflow and models from one organization to another if you wish to share your
code with collaborators.

\item open interface: it is designed to work with many already available tools, machine learning
libraries, implemented libraries and algorithms. It’s built using the REST APIs and readable data
formats where for example a machine learning model can be seen as a single lambda function called
on model evaluation. It was designed to be easily introduced to existing machine learning projects
so the users can benefit from it immediately.
\end{itemize}

\section{Trained models management}

MLflow allows users to use existing models and easily convert them to open interface format.
When working with the MLflow framework we used the Python language for creating some Machine
Learning models for test purposes using the sklearn framework which is one of the many frameworks
supported by MLflow. Then the pretrained models were imported to MLflow repository which saved
their state in an external database and some data storage server that was responsible for keeping
the artefacts from training. In practice, the artefacts' data that was useful in our case was the
signature of the model which was exported to YAML file format.

The training process of used models was short enough that we can train the models every time the
MLflow server was created in a clean environment. This approach to creating this environment
allowed us to easily debug any inconsistency in models' training process because we could see
the difference between the clean environment and the environment with trained models.

\section{Trained models serving}

MLflow allows to deploy the trained models from its registry in a few ways including:
\begin{itemize}
\item deploying \texttt{python\_function} model to Microsoft Azure ML
\item deploying \texttt{python\_function} model to Amazon SageMaker
\item exporting \texttt{python\_function} as an ApacheSpark UDF
\item exporting a REST API endpoint serving the model as a docker image
\item deploying the model as a REST API local server
\end{itemize}

Using any of these approaches may be easier or harder to deploy in practice depending on the architecture
that we already work with and the resources that can be used for the deployment process. Managing the local
REST API server is the easiest solution while it doesn’t scale up with the number of models. When the
developers take care about the environment, and it’s really hard to set up a fresh, clean environment
from scratch it’s better to manage the models using separated Docker containers (which deployment
process can additionally be easily automated).\cite{mlflowdoc}

In our environment the models after training are served as a local REST API in Docker image with MLflow
server inside where they have ports exposed for external usage. With this approach we can clean the
environment on every setup of integration tests of the library, download fresh, prepared images and
train the prepared models, which are then stored in MLflow clean database and S3 storage (that are
prepared as separate Docker images).

\section{MLflow Repository}

Configured MLflow server used in our environment model serves as the model registry which takes care
about model versioning and serving the base models' information using the REST API. As there is no
implementation for Scala (and Java) language for this REST API, we have created our own implementation
of the provided API with the usage of high level HTTP client library named \texttt{sttp} written by SoftwareMill
and the model objects serializer called \texttt{circe} powered by Cats. These two libraries allow us to create
the model corresponding to the official MLflow documentation which could be easily integrated with the
current version of MLflow models repository.

The whole model of data served by MLflow models registry server is provided in JSON format, for which
we managed to create corresponding models in code with the usage of compile time code analysis with
the Scala macros sbt plugin. In this approach it was enough to create the case classes with proper
fields to get automatic conversion between the received text data in JSON format and the objects
model in programming language. The whole process is based on JsonCondec code annotation which allows
to preprocess case classes in Scala in compile time and generate the Encoder and Decoder objects
implementations for the annotated class. In the standard approach the developer would have to manually
write this code which would just translate the JSON fields to object’s fields and vice versa while
this can be in most cases done automatically.

\section{MLflow Model and ModelInstance}

MLflow model is instantiated based on the data received and parsed in JSON format from MLflow models
registry and the signature data located in artefact storage of model. The signature data is parsed
separately from a YAML formatted file which is also mapped to \texttt{JsonCodec} model class with auto Encoder/Decoder
mechanism.

When the MLflow model is instantiated to allow scoring the served model, the whole conversion of data
sent to the external model service is done by the MLflow model instance during the run method invocation.
We created the abstract of Dataframe which corresponds to the single input for the model. When typed input
data comes from another Nussknacker service to the Prinz MLflow implementation, it has to be manually
converted to the valid JSON format, which can be determined based on the signature of the model. In the
runtime of the process the library receives the data in a unified format of type Any which then has to
be recognized and parsed before creating the input frame for the model. MLflow served models support two
types of JSON data providing named split and record orientations for compatibility and easier work with
different types of input data. In our model we use only a single approach of Dataframe to JSON conversion
as the data received from another service is always given in the same format. The data conversion process
which is needed before sending the data to the model and after receiving the data from the model was
implemented in the separated object’s MLFDataConverter methods as they don’t depend on the model and
exist in the model architecture as the static methods which are responsible only for data conversion
for this single type of implementation.

\section{MLflow Model Signature}

MLflow model signature is basically not a necessary part of a model registered in the models registry
provided by MLflow library. In our approach to the problem we assume that every user of our library
should save the models in the registry in the way which includes steps creating the model signature.
We found this way of typing the model input and output as the easiest one because the final user of
Nussknacker doesn’t have to know anything about model implementation (when it was logged to the registry
with a signature). Additionally creating the model with typed signature ensures the user that only the
valid input will be allowed for model input (which is not so obvious while the most of the models management
on the MLflow side is done with Python that hasn’t the static typing). However, creating the signature
for MLflow models in Python has a single drawback e.i. the model output doesn’t have any names and are
given as the ordered list in the JSON response. We manually create the additional labels for model outputs
in the process of scoring with the following outputs named \texttt{output\_0}, \texttt{output\_1} etc. Thanks to this
approach the final user of the Nussknacker GUI can configure the model in an easier way but still he has
to know the mapping between the interesting data and its order in the output model map.

It’s worth noting that the saved model signature in case of MLflow models is stored on external data
service like Amazon S3 data storage which seems to be the default choice of users of MLflow library.
In Prinz we implemented accessing the models’ signatures located only on S3 storage and here we see
the place for future improvements to provide some more generic approach of fetching the signature data.
4 changes: 0 additions & 4 deletions chapters/sample-chapter.tex

This file was deleted.

28 changes: 19 additions & 9 deletions references.bib
Original file line number Diff line number Diff line change
@@ -1,12 +1,3 @@
@incollection{turing2009computing,
title={Computing machinery and intelligence},
author={Turing, Alan M},
booktitle={Parsing the turing test},
pages={23--65},
year={2009},
publisher={Springer}
}

@article{srinath2017python,
title={Python--the fastest growing programming language},
author={Srinath, KR},
Expand All @@ -16,3 +7,22 @@ @article{srinath2017python
pages={354--357},
year={2017}
}

@misc{mlflowpage,
title = {MLflow--A platform for the machine learning lifecycle},
url = {https://mlflow.org/},
year = {2021}
}

@misc{mlflowdoc,
title = {MLflow documentation--MLflow 1.17.0 documentation},
url = {https://mlflow.org/docs/1.17.0/index.html},
year = {2021}
}

@misc{mlflowarticle,
title = {Introducing MLflow: an Open Source Machine Learning Platform},
url = {https://databricks.com/blog/2018/06/05/introducing-mlflow-an-open-source-machine-learning-platform.html},
year = {2021},
author = {Matei Zaharia}
}
3 changes: 2 additions & 1 deletion thesis.tex
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,9 @@

% Chapters
\input{chapters/introduction.tex}
\input{chapters/sample-chapter.tex}
\input{chapters/architecture-overview.tex}
\input{chapters/mlflow-integration.tex}
\input{chapters/dev-env.tex}
\input{chapters/project-development.tex}
\input{chapters/responsibilities.tex}

Expand Down

0 comments on commit b29e914

Please sign in to comment.