-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
MLflow integration && dev env chapters (#8)
* add mlflow integration * fix mlflow label * fix underscore * remove sample chapter * remove sample chapter * fix underscore * add dev-env * add dev-env * pr fixes * add citations * citation fixes
- Loading branch information
Showing
5 changed files
with
262 additions
and
14 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,109 @@ | ||
\chapter{Development environment} | ||
\label{chap:devenv} | ||
|
||
\section{Development environment overview} | ||
|
||
We created a specially configured development environment as a part of our project repository to | ||
simplify the process of the environment setup which is needed to run every integration with our | ||
library and Nussknacker service. For this purpose a separate directory named \texttt{dev-environment} was | ||
created which includes all the Docker configurations and scripts for setting up the environment | ||
from scratch. | ||
|
||
The base for configuration for the environment is made of the .env file which contains the environment | ||
constant variables’ definitions and allows easily changing the ports of services and some paths in | ||
Docker images that can be configured. This file is loaded also by the sbt which is then able to work | ||
with the same environment for the purpose of integration tests that use the Docker services to test | ||
every integration. | ||
|
||
The process of creating the new environment comes to running a single bash script from the \texttt{dev-environment} | ||
which allows configuring the build process with additional flags specified for script execution. There is | ||
a possibility to run only a single integration environment (e.g. only MLflow server repository with its | ||
environment) or to exclude the process of library recompilation before placing it in the Nussknacker image. | ||
All these improvements were added during the process of adding new integrations as we found it annoying | ||
to spend a lot of time waiting for some part of the environment which wasn’t actually needed to test | ||
and experiment with another integration. Additionally we created the extra bash script for cleaning | ||
the environment cached data as discovered to take a lot of disc space when having so many different | ||
integrations with their environments in a single project. | ||
|
||
Every integration has its own \texttt{docker-compose} configuration file which specifies the Docker services | ||
needed to run the integration environment. Additionally, there is a separate configuration named env | ||
for Nussknacker image with its own services, where the compiled Prinz library is placed. Thanks to | ||
such an approach it is easier to set up every environment separately and run only the needed ones. | ||
However this also makes it a little harder to allow communication between every environment - we need | ||
to manually add the Docker network to which the set up environments are connected during creation | ||
time and then can communicate easily. Moreover, we decided not to expose every integration port for | ||
external services for the other integrations and Nussknacker environment but create a single proxy | ||
server for each integration which works as a barrier between the integration implementation details | ||
and the outside world (and then the environment works more like in real life scenario). | ||
|
||
The Docker images for each integration needs extra dependencies which are managed using the \texttt{conda} | ||
environment manager. There is quite a lot of work to be done by installing all of them separately | ||
so we decided to create the Docker images for integration and publish it in the external Docker | ||
images repository. We decided to use the GitHub images hub which allows us to publish the images | ||
as a part of our open-source project but unfortunately forces the user to log to GitHub before | ||
downloading the image. However, this GitHub policy can change in the near future as the community | ||
doesn’t seem to like the logging requirement and there are many open discussions on this topic. | ||
|
||
\section{Models serving in the environment} | ||
|
||
Every integration includes individual ways of models creating and serving them after the training process: | ||
|
||
\begin{itemize} | ||
\item MLflow models are trained after creating the models registry environment and then served as | ||
Docker services from the same Docker image. Their specification is saved on a separate PostgreSQL | ||
database and the signatures are kept on the S3 storage provided by the Minio Docker image. This setup | ||
makes the prepared environment behave like a production version of MLflow server because there is no | ||
usage of local storage for data keeping. | ||
|
||
\item PMML models are exported as the XML files during training on image start and served as the files | ||
by the HTTP little server written in Python. This approach uses the PMML Python dependencies only during | ||
training time as there is no alternative for models management server in case of PMML models representation. | ||
The models in our approach have to be listed in a specific way to give the developers ability to | ||
automatically find their locations with the proper selector path which specifies the models’ refs on the | ||
main site of the server and is configurable in Prinz. | ||
|
||
\item H2O server is somehow between the MLflow and the PMML integration as it sets up a full models | ||
registry server during training time and has the ability to list models and serve them with the usage | ||
of REST API. However in our case the simpler approach to H2O models was chosen and after training the | ||
models they are saved as the standard MOJO files and then leaded by the Prinz library as the local | ||
scoring models. Here we leave some room for future integrations as there is also a possibility to | ||
score the H2O models on the side of the H2O server like in the case of MLflow registry. | ||
\end{itemize} | ||
|
||
\section{Proxying models environments} | ||
|
||
For each integration we created a separate proxy server based on light nginx alpine Docker image which | ||
is set up with only a single configuration file having needed specifications. In the case of MLflow | ||
integration we tried to simulate the real world scenario usage of MLflow registry by manually setting | ||
the proxy configuration connected with the buffering the data and setting some custom requests’ headers. | ||
In the other integration the served models as the XML and MOJO files are only proxied with the change of | ||
the port on which they are available to the user. | ||
|
||
Furthermore, for each integration the proxy also serves a few static files which contain the data needed | ||
in the phase of integration tests. It is responsible for serving like some simple REST API server which | ||
is capable of providing extra inputs for models in specific phases of library tests. | ||
|
||
\section{Development environment in integration tests} | ||
|
||
Each integration runs separately the integration tests which was one of the main reasons to separate | ||
the environment configurations to a few files. During the tests phase there is a need to know the project | ||
root file in the filesystem to locate these configuration files so before running any integration test | ||
the user has to define the \texttt{REPOSITORY\_ABSOLUTE\_ROOT} environment variable. After this initial | ||
configuration running tests from the console is as simple as calling a single sbt command as the whole | ||
environment configuration is loaded with the sbt plugin from the \texttt{.env} file. However, in the case of need | ||
to run tests from an IDE we need to specify separately the \texttt{.env} file in test run configuration as for now | ||
there is no possibility to load this information automatically in the IDE. | ||
|
||
The testing phase includes testing the integration part e.i. scoring the models available in integration | ||
models source and checking their signatures and other parameters. Moreover, there are tests which are | ||
checking the models ability to being proxied with some external data source (outside of Nussknacker | ||
environment source of data) so they use the described feature of nginx serving as a simple REST API | ||
and use the local H2 database as the source of data for tests run. All of these described tests are | ||
wrapped in the abstract traits specifications to allow running them on every integration with the minimal | ||
configuration process. The process of running the tests for specific integration includes setting up the | ||
environment from scratch but everything is done by the testcontainers library which uses the docker-compose | ||
YAML files. This approach connects the source code of our library and its tests with the environment | ||
definition while in the test running phase there is no need for the developer to manually set up the | ||
environment. Additionally, the unit tests which don’t use the integrations’ environment just run scala | ||
tests code without touching anything from the environment definition so there is an easy way to run them | ||
and verify some specified parts of code without the process of working with any Docker containers. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,132 @@ | ||
\chapter{MLflow integration} | ||
\label{chap:mlflow} | ||
|
||
\section{MLflow project overview} | ||
|
||
MLflow is an open source project which allows managing the whole machine learning lifecycle. | ||
It includes experimentation, deployment and reproducibility of models as well as a central registry | ||
for versioning trained models. It was proposed by the Databricks team working with hundreds of companies | ||
using machine learning as a solution to common production problems when working with machine learning | ||
models and their deployment.\cite{mlflowpage} | ||
|
||
The whole MLflow platform was inspired by existing solutions from Facebook, Google and Uber which | ||
seem to be limited because they support only a small set of built-in algorithms or a single library, | ||
and they are strictly connected to each company’s infrastructure. They don’t allow to easily use of new | ||
machine learning libraries and share the results of work with an open community.\cite{mlflowarticle} | ||
|
||
MLflow was designed to be an open solution in two senses: | ||
|
||
\begin{itemize} | ||
\item open source: it is an open source library publicly available that can be easily adapted | ||
by users and extended to expected form. Additionally, MLflow format makes it easy to share | ||
the whole workflow and models from one organization to another if you wish to share your | ||
code with collaborators. | ||
|
||
\item open interface: it is designed to work with many already available tools, machine learning | ||
libraries, implemented libraries and algorithms. It’s built using the REST APIs and readable data | ||
formats where for example a machine learning model can be seen as a single lambda function called | ||
on model evaluation. It was designed to be easily introduced to existing machine learning projects | ||
so the users can benefit from it immediately. | ||
\end{itemize} | ||
|
||
\section{Trained models management} | ||
|
||
MLflow allows users to use existing models and easily convert them to open interface format. | ||
When working with the MLflow framework we used the Python language for creating some Machine | ||
Learning models for test purposes using the sklearn framework which is one of the many frameworks | ||
supported by MLflow. Then the pretrained models were imported to MLflow repository which saved | ||
their state in an external database and some data storage server that was responsible for keeping | ||
the artefacts from training. In practice, the artefacts' data that was useful in our case was the | ||
signature of the model which was exported to YAML file format. | ||
|
||
The training process of used models was short enough that we can train the models every time the | ||
MLflow server was created in a clean environment. This approach to creating this environment | ||
allowed us to easily debug any inconsistency in models' training process because we could see | ||
the difference between the clean environment and the environment with trained models. | ||
|
||
\section{Trained models serving} | ||
|
||
MLflow allows to deploy the trained models from its registry in a few ways including: | ||
\begin{itemize} | ||
\item deploying \texttt{python\_function} model to Microsoft Azure ML | ||
\item deploying \texttt{python\_function} model to Amazon SageMaker | ||
\item exporting \texttt{python\_function} as an ApacheSpark UDF | ||
\item exporting a REST API endpoint serving the model as a docker image | ||
\item deploying the model as a REST API local server | ||
\end{itemize} | ||
|
||
Using any of these approaches may be easier or harder to deploy in practice depending on the architecture | ||
that we already work with and the resources that can be used for the deployment process. Managing the local | ||
REST API server is the easiest solution while it doesn’t scale up with the number of models. When the | ||
developers take care about the environment, and it’s really hard to set up a fresh, clean environment | ||
from scratch it’s better to manage the models using separated Docker containers (which deployment | ||
process can additionally be easily automated).\cite{mlflowdoc} | ||
|
||
In our environment the models after training are served as a local REST API in Docker image with MLflow | ||
server inside where they have ports exposed for external usage. With this approach we can clean the | ||
environment on every setup of integration tests of the library, download fresh, prepared images and | ||
train the prepared models, which are then stored in MLflow clean database and S3 storage (that are | ||
prepared as separate Docker images). | ||
|
||
\section{MLflow Repository} | ||
|
||
Configured MLflow server used in our environment model serves as the model registry which takes care | ||
about model versioning and serving the base models' information using the REST API. As there is no | ||
implementation for Scala (and Java) language for this REST API, we have created our own implementation | ||
of the provided API with the usage of high level HTTP client library named \texttt{sttp} written by SoftwareMill | ||
and the model objects serializer called \texttt{circe} powered by Cats. These two libraries allow us to create | ||
the model corresponding to the official MLflow documentation which could be easily integrated with the | ||
current version of MLflow models repository. | ||
|
||
The whole model of data served by MLflow models registry server is provided in JSON format, for which | ||
we managed to create corresponding models in code with the usage of compile time code analysis with | ||
the Scala macros sbt plugin. In this approach it was enough to create the case classes with proper | ||
fields to get automatic conversion between the received text data in JSON format and the objects | ||
model in programming language. The whole process is based on JsonCondec code annotation which allows | ||
to preprocess case classes in Scala in compile time and generate the Encoder and Decoder objects | ||
implementations for the annotated class. In the standard approach the developer would have to manually | ||
write this code which would just translate the JSON fields to object’s fields and vice versa while | ||
this can be in most cases done automatically. | ||
|
||
\section{MLflow Model and ModelInstance} | ||
|
||
MLflow model is instantiated based on the data received and parsed in JSON format from MLflow models | ||
registry and the signature data located in artefact storage of model. The signature data is parsed | ||
separately from a YAML formatted file which is also mapped to \texttt{JsonCodec} model class with auto Encoder/Decoder | ||
mechanism. | ||
|
||
When the MLflow model is instantiated to allow scoring the served model, the whole conversion of data | ||
sent to the external model service is done by the MLflow model instance during the run method invocation. | ||
We created the abstract of Dataframe which corresponds to the single input for the model. When typed input | ||
data comes from another Nussknacker service to the Prinz MLflow implementation, it has to be manually | ||
converted to the valid JSON format, which can be determined based on the signature of the model. In the | ||
runtime of the process the library receives the data in a unified format of type Any which then has to | ||
be recognized and parsed before creating the input frame for the model. MLflow served models support two | ||
types of JSON data providing named split and record orientations for compatibility and easier work with | ||
different types of input data. In our model we use only a single approach of Dataframe to JSON conversion | ||
as the data received from another service is always given in the same format. The data conversion process | ||
which is needed before sending the data to the model and after receiving the data from the model was | ||
implemented in the separated object’s MLFDataConverter methods as they don’t depend on the model and | ||
exist in the model architecture as the static methods which are responsible only for data conversion | ||
for this single type of implementation. | ||
|
||
\section{MLflow Model Signature} | ||
|
||
MLflow model signature is basically not a necessary part of a model registered in the models registry | ||
provided by MLflow library. In our approach to the problem we assume that every user of our library | ||
should save the models in the registry in the way which includes steps creating the model signature. | ||
We found this way of typing the model input and output as the easiest one because the final user of | ||
Nussknacker doesn’t have to know anything about model implementation (when it was logged to the registry | ||
with a signature). Additionally creating the model with typed signature ensures the user that only the | ||
valid input will be allowed for model input (which is not so obvious while the most of the models management | ||
on the MLflow side is done with Python that hasn’t the static typing). However, creating the signature | ||
for MLflow models in Python has a single drawback e.i. the model output doesn’t have any names and are | ||
given as the ordered list in the JSON response. We manually create the additional labels for model outputs | ||
in the process of scoring with the following outputs named \texttt{output\_0}, \texttt{output\_1} etc. Thanks to this | ||
approach the final user of the Nussknacker GUI can configure the model in an easier way but still he has | ||
to know the mapping between the interesting data and its order in the output model map. | ||
|
||
It’s worth noting that the saved model signature in case of MLflow models is stored on external data | ||
service like Amazon S3 data storage which seems to be the default choice of users of MLflow library. | ||
In Prinz we implemented accessing the models’ signatures located only on S3 storage and here we see | ||
the place for future improvements to provide some more generic approach of fetching the signature data. |
This file was deleted.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters