Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

log_artifact: external and non-DVC tracked files support #551

Open
shcheklein opened this issue May 1, 2023 · 8 comments
Open

log_artifact: external and non-DVC tracked files support #551

shcheklein opened this issue May 1, 2023 · 8 comments
Labels
A: log_artifact Area: `live.log_artifact` p3-nice-to-have question Further information is requested

Comments

@shcheklein
Copy link
Member

More of a question for now:

  • Should DVC tracking be optional / disabled by default? E.g. if model weights are in Git-lfs? I can still see the model in the registry, but I don't need DVC remote, etc, etc.
  • How should we treat S3 files? dvc.yaml should support them I think. Do we need to create import file .dvc or not?
@shcheklein shcheklein added the question Further information is requested label May 1, 2023
@daavoo
Copy link
Contributor

daavoo commented May 1, 2023

Feels like the questions are assuming log_artifact is coupled with the model registry but strictly speaking, it is not.

As of today, it is decoupled from an implementation perspective (log_artifact creates the .dvc but it is make_dvcyaml that writes the artifacts section), but I would also like to think that it should not be coupled from a product perspective.

For those scenarios, why would you want to use log_artifact python API for registering the model? It is more convenient than writing the artifacts section in the dvc.yaml or using the UI?

If we still want a Python API, should we make it part of dvc.api? Does it belong in DVCLive logger?

@dberenbaum
Copy link
Collaborator

  • Should DVC tracking be optional / disabled by default? E.g. if model weights are in Git-lfs? I can still see the model in the registry, but I don't need DVC remote, etc, etc.

The use case I can think of is huggingfaces integration. Is that what you have in mind?

Would we also make dvc get work with git-lfs? Do you have a use case where model registry is useful without being able to retrieve the artifact?

  • How should we treat S3 files? dvc.yaml should support them I think. Do we need to create import file .dvc or not?

@daavoo If we import the model, isn't that part of the core functionality of log_artifact? I think this is an interesting idea because it helps introduce a way to manage external data, which is a major source of confusion today.

@shcheklein
Copy link
Member Author

Would we also make dvc get work with git-lfs?

I don't know yet, at the end people might decide on their own also how exactly bring the artifact from a commit.

@dberenbaum
Copy link
Collaborator

I don't know yet, at the end people might decide on their own also how exactly bring the artifact from a commit.

Can we come up with a use case where model registry is needed in this scenario?

@shcheklein
Copy link
Member Author

Can we come up with a use case where model registry is needed in this scenario?

To be honest, I don't see the difference is it DVC-tracked or not. All the same scenarios apply, no? Find a specific version of a model (by a tag) and fetch it to deploy. Assign stages, etc, etc. In this case dvc.yaml helps to see them in the MR + to see some additional metadata.

Could you may be clarify your question, @dberenbaum ?

@dberenbaum
Copy link
Collaborator

people might decide on their own also how exactly bring the artifact from a commit

Find a specific version of a model (by a tag) and fetch it to deploy.

How do you envision this workflow if the artifact is managed by git lfs? What commands would I run in my deploy script?

@shcheklein
Copy link
Member Author

@dberenbaum I'm not that familiar with Git lfs, but from what I remember you could probably manage it with git pull, or in case of S3 (e.g. HF does it with Git lfs) even get a link to an artifact. Again, not 100% sure, but I would be surprised if there is a limitation like nor being able to fetch a file from a specific revision.

@dberenbaum
Copy link
Collaborator

There are two mechanisms we could use in dvc for this:

  1. Use dvc import-url --no-download. This already exists and allows the user to still have the option to get/pull the data into the repo later, but it only works for external files (I don't think it will work with git-lfs files for example).
  2. We could easily add some option like dvc add --no-cache which would add cache: false to the resulting .dvc file and work with external files. You can't retrieve the files, but it's simpler and closer to what other loggers provide for external files (and probably simpler for cli users looking to track external files).

Neither of these automatically detect whether the files are version-aware today. It would be great if we can add support for that in dvc since I see it in other loggers, but I can't remember the obstacles to doing it (cc @pmrowla).

  • Should DVC tracking be optional / disabled by default? E.g. if model weights are in Git-lfs? I can still see the model in the registry, but I don't need DVC remote, etc, etc.

Neptune is the only logger I have found that supports tracking local files without uploading them, so I'm not sure it should be a high priority, but it's possible to support it with option 2 above.

How should we expose this functionality in dvclive? Some options:

  1. Use it automatically for external artifacts (Live.log_artifact("s3://...")).
  2. Change Live.log_artifact(cache=False) to use this (we may have to tweak the lightning callback).
  3. Add another arg for it in Live.log_artifact().
  4. Add a separate method like Live.log_url() or Live.log_reference().

Some other loggers for comparison (note that mlflow does not support this pattern at all AFAICT):

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: log_artifact Area: `live.log_artifact` p3-nice-to-have question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants