Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLflow autologging issue # 1618 #2092

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from
93 changes: 93 additions & 0 deletions docs/userguide/torch_forecasting_models.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ We assume that you already know about covariates in Darts. If you're new to the
- [Callbacks](#callbacks)
- [Early Stopping](#example-with-early-stopping)
- [Custom Callback](#example-of-custom-callback-to-store-losses)
- [MLFlow: train, track and monitor](#example-with-mlflow-autologging)

4. [Performance optimisation section](#performance-recommendations) lists tricks to speed up the computation during training.

Expand Down Expand Up @@ -462,6 +463,98 @@ model.fit(...)

*Note* : The callback will give one more element in the `loss_logger.val_loss` as the model trainer performs a validation sanity check before the training begins.

#### Example with MLflow Autologging
madtoinou marked this conversation as resolved.
Show resolved Hide resolved

MLflow using interface (UI) and autologging to track Dart's pytorch models.
```python
import pandas as pd
import torchmetrics
from torchmetrics import MeanAbsolutePercentageError
madtoinou marked this conversation as resolved.
Show resolved Hide resolved
from darts.dataprocessing.transformers import Scaler
from darts.datasets import AirPassengersDataset
from darts.models import NBEATSModel

# read data
series = AirPassengersDataset().load()

# create training and validation sets:
train, val = series.split_after(pd.Timestamp(year=1957, month=12, day=1))

# normalize the time series
transformer = Scaler()
train = transformer.fit_transform(train)
val = transformer.transform(val)

# MLflow setup
## Run this command with environment activated: mlflow ui --port xxxx (e.g. 5000, 5001, 5002)
# Copy and paste url from command line to web browser
import mlflow
from mlflow.data.pandas_dataset import PandasDataset

mlflow.pytorch.autolog(log_every_n_epoch=1, log_every_n_step=None,
log_models=True, log_datasets=True, disable=False,
exclusive=False, disable_for_unsupported_versions=False,
silent=False, registered_model_name=None, extra_tags=None
)

import mlflow.pytorch
from mlflow.client import MlflowClient

model_name = "darts-NBEATS"

with mlflow.start_run(nested=True) as run:

dataset: PandasDataset = mlflow.data.from_pandas(series.pd_dataframe(), source="AirPassengersDataset")

# Log the dataset to the MLflow Run. Specify the "training" context to indicate that the
# dataset is used for model training
mlflow.log_input(dataset, context="training")

# Define model hyperparameters to log
params = {
"input_chunk_length": 24,
"output_chunk_length": 12,
"n_epochs": 500,
"model_name": "NBEATS_MLflow",
"log_tensorboard": True,
"torch_metrics": MeanAbsolutePercentageError(),
"nr_epochs_val_period": 1,
}

# Log hyperparameters
mlflow.log_params(params)

# create the model
model = NBEATSModel(
**params,
)

# use validation dataset
model.fit(
series=train,
val_series=val,
)

# predit
forecast = model.predict(len(val))

# Save conda environment used to run the model
mlflow.pytorch.get_default_conda_env()

# Save pip requirements
mlflow.pytorch.get_default_pip_requirements()

# Set tracking uri
model_uri = f"runs:/{run.info.run_id}/darts-NBEATS"

# Save Darts model as an artifact
model_path = 'nbeats_air_passengers'
mlflow.sklearn.save_model(model, model_path)

# Registering model
mlflow.register_model(model_uri=model_uri, name=model_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO runs currently do not save models as artifacts becaus there is no call to log_model(). Therefore model_uri is not pointing to a valid model. Please correct me if I'm wrong, however, it did not work in my tests. If it works, can you pleas add an example how to load with loaded_model = mlflow.<flavor>.load_model(model_uri=model_uri) ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to have a look at this, my understanding is that it saves it both using mlflow.<model_flavor>.log_model() or mlflow.register_model(). Refer to https://mlflow.org/docs/latest/model-registry.html.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i understand it you can only (kind of) promote a already saved model to a registered model..

Model
An MLflow Model is created from an experiment or run that is logged with one of the model flavor’s mlflow.<model_flavor>.log_model() methods. Once logged, this model can then be registered with the Model Registry.

Registered Model
An MLflow Model can be registered with the Model Registry. A registered model has a unique name, contains versions, aliases, tags, and other metadata.
https://mlflow.org/docs/latest/model-registry.html#concepts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i understand it you can only (kind of) promote a already saved model to a registered model..

Model
An MLflow Model is created from an experiment or run that is logged with one of the model flavor’s mlflow.<model_flavor>.log_model() methods. Once logged, this model can then be registered with the Model Registry.
Registered Model
An MLflow Model can be registered with the Model Registry. A registered model has a unique name, contains versions, aliases, tags, and other metadata.
https://mlflow.org/docs/latest/model-registry.html#concepts

Hello @turbotimon ,
My plans was to share how to do the first part of issue # 1618. As it allows you to track the experiments (so one don't get lost if you are doing more of a few iterations), select the best one, see metrics progress, etc.
I will have a look at what you are suggesting. However, the normal mlflow.python.loadmodel or mlflow.pytorch.loadmodel doesn't seems to work because the way Darts is wrapping torch.nn modules. All test I have done so far didn't work to save and load the model. Only to record hyperparameters, other torchmetrics related information, dataset hash, etc. Thus, I propose to share this, and later we can add saving and loading models (but my last weekend research it is pointing out that the model.py files and common models may need to be rewritten to be compatible with MLFlow (I might be wrong).

Cheers!

Copy link
Contributor

@turbotimon turbotimon Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cargecla1 Yes, that's a good idea to split these two things

If using the mlflow model registry for darts completely fails, you could also mentioning the workaround I proposed in the issue. Which was manually saving/loading the model as an artifact. Something like:

dartsmodel.save("mymodel.pickle")
mlflow.log_artifact("mymodel.pickle")
# later, load artifact from mlflow and do e.g. a dartsmodel=RNNModel.load("mymodel.pickle")

Let me know if i can help anything!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @turbotimon

After trying to save the Darts model using dartsmodel.save("mymodel.pickle") as suggested, I am getting the following error:

File ~.conda\envs\venv_darts\lib\site-packages\torch\serialization.py:653, in _save(obj, zip_file, pickle_module, pickle_protocol) 651 pickler = pickle_module.Pickler(data_buf, protocol=pickle_protocol) 652 pickler.persistent_id = persistent_id --> 653 pickler.dump(obj) 654 data_value = data_buf.getvalue() 655 zip_file.write_record('data.pkl', data_value, len(data_value))

AttributeError: Can't pickle local object 'LayerSummary._register_hook..hook'

Would you know why this is happening?

Cheers

Hi @cla-ra3426, sorry i missed your question. No idea.. but must have to do somehting with darts itself and not mlflow i suppose

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cargecla1 Yes, that's a good idea to split these two things
If using the mlflow model registry for darts completely fails, you could also mentioning the workaround I proposed in the issue. Which was manually saving/loading the model as an artifact. Something like:

dartsmodel.save("mymodel.pickle")
mlflow.log_artifact("mymodel.pickle")
# later, load artifact from mlflow and do e.g. a dartsmodel=RNNModel.load("mymodel.pickle")

Let me know if i can help anything!

Hello @turbotimon , @madtoinou , @dennisbader,

Is it possible to split this issue as discussed above? so we can share how to use, train, track, monitor and save the models using MLFlow? Just leaving loading the model for a later release?

Cheers

Hello gents (@turbotimon , @madtoinou , @dennisbader,)

Have you had time to consider my proposal above? "Is it possible to split this issue as discussed above? so we can share how to use, train, track, monitor and save the models using MLFlow? Just leaving loading the model for a later release?"

The only solution I have right now it is to train the model with MLFlow to track and monitor the model and retrain it with pure Darts to be able to save it and load to predict, Darts saving method "fails" when run inside a MLFlow run, and MLFlow log and save methods don't want to work with Darts either.

Thank you in advice for your consideration!

Cheers!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cargecla1,

Sorry for the delay, loading model is kind-of part of linked issue so it would probably be better to also include it in this PR. But if it too troublesome, we can treat it separately.

@cla-ra3426,

The error you're getting when trying to pickle the model is probably due to the (pytorch-lightning) callbacks. Can you try removing them prior to exporting the model?

I will test this example more thoroughly when I have more time, try to see if I can come up with a solution for the loading.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou

I will try this and see how it goes.

Thanks for this suggestion and for considering splitting the problem into two if above doesn't work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou ,

I took your suggestion on board as per above, but that didn't fixed the issue with loading and predicting steps.

Refer to commit f798244

```

## Performance Recommendations
This section recaps the main factors impacting the performance when
training and using torch-based models.
Expand Down