Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLflow autologging issue # 1618 #2092

Draft
wants to merge 14 commits into
base: master
Choose a base branch
from

Conversation

dennisbader
Copy link
Collaborator

@dennisbader dennisbader commented Nov 24, 2023

Addresses #1618.

Summary

Draft PR for MLflow integration example making pytorch calls and activating tensorboard collection on Darts' s pytortch models to do autologging with MLflow and activate MLflow UI.

MLflow integration example making pytorch calls and activating tensorboard collection on Darts' s pytortch models to do autologging with MLflow and activate MLflow UI.
@codecov-commenter
Copy link

codecov-commenter commented Nov 24, 2023

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 93.97%. Comparing base (a0cc279) to head (ee55932).

Current head ee55932 differs from pull request most recent head 609a0d0

Please upload reports for the commit 609a0d0 to get more accurate results.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #2092      +/-   ##
==========================================
+ Coverage   93.75%   93.97%   +0.21%     
==========================================
  Files         138      136       -2     
  Lines       14352    13647     -705     
==========================================
- Hits        13456    12825     -631     
+ Misses        896      822      -74     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cargecla1
Copy link

Hello @dennisbader,

As a follow up, is there anything else that I need to do?
What is/are the next step/steps?

Cheers!

@dennisbader
Copy link
Collaborator Author

Hi @cargecla1. No, all good from your side 🚀
Right now our capacity is a bit limited before the holidays, so we'll most likely have time to review this early next year.
Thanks again for the contribution 💯

@cargecla1
Copy link

Hi @cargecla1. No, all good from your side 🚀 Right now our capacity is a bit limited before the holidays, so we'll most likely have time to review this early next year. Thanks again for the contribution 💯

Hello @dennisbader,

Thanks for letting me know.
And thank you for supporting this review.

Merry Xmas!

Copy link

@cargecla1 cargecla1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with this changes, there are as passed on.


with mlflow.start_run(nested=True) as run:

dataset: PandasDataset = mlflow.data.from_pandas(series, source="AirPassengersDataset")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO should call series.pd_dataframe() because in my tests it worked only on DataFrames not Series:
dataset: PandasDataset = mlflow.data.from_pandas(series.pd_dataframe(), source="AirPassengersDataset")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good suggestion

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added via commit c15eb1e
Thank you!

docs/userguide/torch_forecasting_models.md Show resolved Hide resolved

# Registering model
model_uri = f"runs:/{run.info.run_id}/darts-NBEATS"
mlflow.register_model(model_uri=model_uri, name=model_name)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO runs currently do not save models as artifacts becaus there is no call to log_model(). Therefore model_uri is not pointing to a valid model. Please correct me if I'm wrong, however, it did not work in my tests. If it works, can you pleas add an example how to load with loaded_model = mlflow.<flavor>.load_model(model_uri=model_uri) ?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will need to have a look at this, my understanding is that it saves it both using mlflow.<model_flavor>.log_model() or mlflow.register_model(). Refer to https://mlflow.org/docs/latest/model-registry.html.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i understand it you can only (kind of) promote a already saved model to a registered model..

Model
An MLflow Model is created from an experiment or run that is logged with one of the model flavor’s mlflow.<model_flavor>.log_model() methods. Once logged, this model can then be registered with the Model Registry.

Registered Model
An MLflow Model can be registered with the Model Registry. A registered model has a unique name, contains versions, aliases, tags, and other metadata.
https://mlflow.org/docs/latest/model-registry.html#concepts

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As i understand it you can only (kind of) promote a already saved model to a registered model..

Model
An MLflow Model is created from an experiment or run that is logged with one of the model flavor’s mlflow.<model_flavor>.log_model() methods. Once logged, this model can then be registered with the Model Registry.
Registered Model
An MLflow Model can be registered with the Model Registry. A registered model has a unique name, contains versions, aliases, tags, and other metadata.
https://mlflow.org/docs/latest/model-registry.html#concepts

Hello @turbotimon ,
My plans was to share how to do the first part of issue # 1618. As it allows you to track the experiments (so one don't get lost if you are doing more of a few iterations), select the best one, see metrics progress, etc.
I will have a look at what you are suggesting. However, the normal mlflow.python.loadmodel or mlflow.pytorch.loadmodel doesn't seems to work because the way Darts is wrapping torch.nn modules. All test I have done so far didn't work to save and load the model. Only to record hyperparameters, other torchmetrics related information, dataset hash, etc. Thus, I propose to share this, and later we can add saving and loading models (but my last weekend research it is pointing out that the model.py files and common models may need to be rewritten to be compatible with MLFlow (I might be wrong).

Cheers!

Copy link
Contributor

@turbotimon turbotimon Feb 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cargecla1 Yes, that's a good idea to split these two things

If using the mlflow model registry for darts completely fails, you could also mentioning the workaround I proposed in the issue. Which was manually saving/loading the model as an artifact. Something like:

dartsmodel.save("mymodel.pickle")
mlflow.log_artifact("mymodel.pickle")
# later, load artifact from mlflow and do e.g. a dartsmodel=RNNModel.load("mymodel.pickle")

Let me know if i can help anything!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @turbotimon

After trying to save the Darts model using dartsmodel.save("mymodel.pickle") as suggested, I am getting the following error:

File ~.conda\envs\venv_darts\lib\site-packages\torch\serialization.py:653, in _save(obj, zip_file, pickle_module, pickle_protocol) 651 pickler = pickle_module.Pickler(data_buf, protocol=pickle_protocol) 652 pickler.persistent_id = persistent_id --> 653 pickler.dump(obj) 654 data_value = data_buf.getvalue() 655 zip_file.write_record('data.pkl', data_value, len(data_value))

AttributeError: Can't pickle local object 'LayerSummary._register_hook..hook'

Would you know why this is happening?

Cheers

Hi @cla-ra3426, sorry i missed your question. No idea.. but must have to do somehting with darts itself and not mlflow i suppose

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cargecla1 Yes, that's a good idea to split these two things
If using the mlflow model registry for darts completely fails, you could also mentioning the workaround I proposed in the issue. Which was manually saving/loading the model as an artifact. Something like:

dartsmodel.save("mymodel.pickle")
mlflow.log_artifact("mymodel.pickle")
# later, load artifact from mlflow and do e.g. a dartsmodel=RNNModel.load("mymodel.pickle")

Let me know if i can help anything!

Hello @turbotimon , @madtoinou , @dennisbader,

Is it possible to split this issue as discussed above? so we can share how to use, train, track, monitor and save the models using MLFlow? Just leaving loading the model for a later release?

Cheers

Hello gents (@turbotimon , @madtoinou , @dennisbader,)

Have you had time to consider my proposal above? "Is it possible to split this issue as discussed above? so we can share how to use, train, track, monitor and save the models using MLFlow? Just leaving loading the model for a later release?"

The only solution I have right now it is to train the model with MLFlow to track and monitor the model and retrain it with pure Darts to be able to save it and load to predict, Darts saving method "fails" when run inside a MLFlow run, and MLFlow log and save methods don't want to work with Darts either.

Thank you in advice for your consideration!

Cheers!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @cargecla1,

Sorry for the delay, loading model is kind-of part of linked issue so it would probably be better to also include it in this PR. But if it too troublesome, we can treat it separately.

@cla-ra3426,

The error you're getting when trying to pickle the model is probably due to the (pytorch-lightning) callbacks. Can you try removing them prior to exporting the model?

I will test this example more thoroughly when I have more time, try to see if I can come up with a solution for the loading.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou

I will try this and see how it goes.

Thanks for this suggestion and for considering splitting the problem into two if above doesn't work.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou ,

I took your suggestion on board as per above, but that didn't fixed the issue with loading and predicting steps.

Refer to commit f798244

Adding first two suggestions raised in MLflow autologging issue # 1618 unit8co#2092 via comments.
Adding code to save pip requirements and conda environment required to run the model via MLFlow.
Adding a way to track and save model before registering it using MLFlow.
Expanding comment to create a new cell to log the artifact and register the model after a model run id has been created.
docs/userguide/torch_forecasting_models.md Show resolved Hide resolved
Comment on lines 514 to 522
mlflow.log_param("model_type", "Darts_Pytorch_model")
mlflow.log_param("input_chunk_length", 24)
mlflow.log_param("output_chunk_length", 12)
mlflow.log_param("n_epochs", 500)
mlflow.log_param("model_name", 'NBEATS_MLflow')
mlflow.log_param("log_tensorboard", True)
mlflow.log_param("torch_metrics", "torchmetrics.regression.MeanAbsolutePercentageError()")
mlflow.log_param("nr_epochs_val_period", 1)
mlflow.log_param("pl_trainer_kwargs", "{callbacks: [loss_logger]}")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make this a bit more concise, with a list of parameters to log and access them in the model.model_params, model.pl_module_params or model.trainer_params attributes (potentially after fusing these dictionaries into one).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou
I made the code more concise as suggested. Refer to commit f798244

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou , @dennisbader ,

I guess you may have been busy with the latest Darts release. Did you have time to look at my latest changes? Does this work to merge the first part of the MLFlow integration as previously discussed?

Cheers!

By the way, nicely done! I need to test the latest release speeds. :)

Updating table of content at the top of the file/page.
Making the log parameters method more concise as suggested.
Copy link
Collaborator

@madtoinou madtoinou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for making the correction, I still have some minor comments.

@@ -25,6 +25,7 @@ We assume that you already know about covariates in Darts. If you're new to the
- [Callbacks](#callbacks)
- [Early Stopping](#example-with-early-stopping)
- [Custom Callback](#example-of-custom-callback-to-store-losses)
- [MLFlow: train, track and monitor](#example-with-mlflow-autologging)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that the indentation is off, can you shift it to the left once so that it's at the same level as "Callbacks"?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @madtoinou ,

I fixed this via commit b839c67

docs/userguide/torch_forecasting_models.md Outdated Show resolved Hide resolved
cargecla1 and others added 5 commits April 29, 2024 14:23
Fixed indentation that was off.
Adding suggestion, thanks!

Co-authored-by: madtoinou <[email protected]>
Resolving torchmetrics issue.
Updating the MLFlow saving method including tracking uri.
@cargecla1
Copy link

Thank you for making the correction, I still have some minor comments.

Hello @madtoinou ,

I just updated torchmetrics and saving model method to capture your latest suggestions with the **params. Hopefully this is work done now. :)

Cheers!

@cargecla1
Copy link

Thank you for making the correction, I still have some minor comments.

Hello @madtoinou ,

I just updated torchmetrics and saving model method to capture your latest suggestions with the **params. Hopefully this is work done now. :)

Cheers!

Hello @madtoinou, @dennisbader ,

Any updates, comments from my latest updates/commits?

Cheers!

@cargecla1 cargecla1 mentioned this pull request May 14, 2024
@cargecla1
Copy link

Thank you for making the correction, I still have some minor comments.

Hello @madtoinou ,
I just updated torchmetrics and saving model method to capture your latest suggestions with the **params. Hopefully this is work done now. :)
Cheers!

Hello @madtoinou, @dennisbader ,

Any updates, comments from my latest updates/commits?

Cheers!

Hello @madtoinou , @dennisbader ,

Did you have chance to have a look a my latest changes to this PR?, any feedback will be appreciated.

Cheers!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants