Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download from HuggingFace with private token #390

Merged
merged 16 commits into from
Nov 25, 2024
Merged

Conversation

frostedoyster
Copy link
Collaborator

@frostedoyster frostedoyster commented Nov 13, 2024

As in the title

Contributor (creator of pull-request) checklist

  • Tests updated (for new features and bugfixes)?
  • Documentation updated (for new features)?
  • Issue referenced (for PRs that solve an issue)?

📚 Documentation preview 📚: https://metatrain--390.org.readthedocs.build/en/390/

@abmazitov
Copy link
Contributor

I managed to adapt the PET.load_checkpoint() method, so now it works with the HuggingFace checkpoint that we have. However the mtt export API is still not well-adapted. I keep getting the following error:

Traceback (most recent call last):
  File "/home/mazitov/apps/anaconda3/envs/metatrain/lib/python3.11/site-packages/metatrain/__main__.py", line 105, in main
    export_model(**args.__dict__)
TypeError: export_model() got an unexpected keyword argument 'huggingface_api_token'

which is related to the fact that in the load_model function, we pop the 'huggingface_api_token' key from kwargs, which is the copy of the args.__dict__, so the original dict still has the unexpected key.


Even though loading the model explicitly worked for me, it only worked while loading from the abmazitov/pet-mad and not from the metatensor/pet-mad. Maybe I did something wrong...

@frostedoyster frostedoyster marked this pull request as ready for review November 14, 2024 10:31
docs/src/getting-started/checkpoints.rst Outdated Show resolved Hide resolved
src/metatrain/utils/io.py Show resolved Hide resolved
Comment on lines +137 to +138
# make sure to copy the checkpoint to the current directory
shutil.copy(path, Path.cwd() / str(path).split("/")[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont understand why this is required?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if you want the file to be directly downloaded to the current directory you can do

path, _ = urlretrieve(
    url=str(path),
    filename=str(Path.cwd() / str(path).split("/")[-1]))

See https://docs.python.org/3/library/urllib.request.html#urllib.request.urlretrieve.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required for fine-tuning. mtt export only makes the .pt available in the current directory, but that's useless for fine-tuning

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to use filename in urlretrieve, and add the same INFO message about adding a file to cwd?

@Luthaf
Copy link
Member

Luthaf commented Nov 14, 2024

I like this! Down the line, we could maybe add a way to read the API token from env variables or a file; and use such mechanism for all external integrations (W&B seems like a very logical one to add next).

@PicoCentauri
Copy link
Contributor

If you have a private model on Hugging Face, there are effective third-party tools like wget, curl, or the Hugging Face web interface to download it. Adding a private API interface introduces overhead and complexity to the codebase, and for a private Hugging Face repo, this doesn’t provide significant benefits, as there are accessible alternatives. The presented workaround also introduces an additional dependency, and I’m concerned it will require maintenance soon. We’re moving around URL schemas, and, to my knowledge, Hugging Face doesn’t have a stable schema, right?

As for W&B, it’s a different story, and I fully support adding a functionality, because users can’t access the benefits of W&B without us integrating it directly into the code.

@Luthaf
Copy link
Member

Luthaf commented Nov 15, 2024

If you have a private model on Hugging Face, there are effective third-party tools like wget, curl, or the Hugging Face web interface to download it.

I think having it enabled inside mtt export makes sense: we already download the models ourself in all other cases, no need for an external tool.

One point one which I am not very clear though is can one actually download a private HF model without a token? I.e. just wget https://https://huggingface.co/........./? If so, then I agree there is no need for this code to be added. But I would be very surprised if one can download a private model without some form of authentication.

We’re moving around URL schemas, and, to my knowledge, Hugging Face doesn’t have a stable schema, right?

This is a good point, maybe a better interface here would look like mtt export --mode huggingface --hf-api-token=.... --hf-owner=who/what --hf-file=file-path (very bad syntax, just for the idea. This could also be huggingface://who/what/file-path or huggingface://token@who/what/file-path or whatever).

The presented workaround also introduces an additional dependency, and I’m concerned it will require maintenance soon.

I'm not very concerned by this: one would hope that HF has enough workforce to keep this working without breaking changes, given their size and that this Python module is part of their core offering. It also only touch a relatively small part of our code and is fairly well isolated. But we should add a test on CI to catch any breaking change anyway!

@PicoCentauri
Copy link
Contributor

One point one which I am not very clear though is can one actually download a private HF model without a token? I.e. just wget https://https://huggingface.co/........./? If so, then I agree there is no need for this code to be added. But I would be very surprised if one can download a private model without some form of authentication.

You might be right that this doesn't work. Would be quite some risk if just having the link is enough to fetch a model.

I'm not very concerned by this: one would hope that HF has enough workforce to keep this working without breaking changes, given their size and that this Python module is part of their core offering. It also only touch a relatively small part of our code and is fairly well isolated. But we should add a test on CI to catch any breaking change anyway!

Okay we can see how this evolves. If the maintenance work is too high we can also easily remove the code again. But, yes a test in CI is the minimum we should do.

@frostedoyster
Copy link
Collaborator Author

Just for completeness: yes, public HF models should be available with wget without going through the new HF functionality here, which is just there for private models

Comment on lines +124 to +127
# get repo_id and filename
split_path = path.split("/")
repo_id = f"{split_path[3]}/{split_path[4]}" # org/repo
filename = ""
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of doing string manipulation, should we pass more structured information to this code? Copied from my comment on the overall PR:

maybe a better interface here would look like mtt export --mode huggingface --hf-api-token=.... --hf-owner=who/what --hf-file=file-path (very bad syntax, just for the idea. This could also be huggingface://who/what/file-path or huggingface://token@who/what/file-path or whatever).

I.e. instead of trying to guess the info we need, ask the use about it

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think so because HuggingFace gives you the download link, so that's what most users will use

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asking for the "split information" on the command line is just more work for the user

filename = filename[10:]
path = hf_hub_download(repo_id, filename, token=kwargs["huggingface_api_token"])
# make sure to copy the checkpoint to the current directory
shutil.copy(path, Path.cwd() / str(path).split("/")[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be added to the log? Some kind of INFO: downloaded $FILE_NAME from hugingface.co/owner/repo to current directory

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

Comment on lines +137 to +138
# make sure to copy the checkpoint to the current directory
shutil.copy(path, Path.cwd() / str(path).split("/")[-1])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you change this to use filename in urlretrieve, and add the same INFO message about adding a file to cwd?

@@ -108,3 +110,41 @@ def test_reexport(monkeypatch, tmp_path):
export_model(model_loaded, "exported_new.pt")

assert Path("exported_new.pt").is_file()


def test_private_huggingface(monkeypatch, tmp_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will fail for people running the tests locally, unless they have access to the repo & a token set, right?

Could we skip the test if HUGGINGFACE_TOKEN_METATRAIN is unset? And set it on CI/locally as needed. (using a different name in case people are already using HUGGINGFACE_TOKEN for some other reason.

@frostedoyster frostedoyster merged commit 1ffccb4 into main Nov 25, 2024
13 checks passed
@frostedoyster frostedoyster deleted the huggingface-download branch November 25, 2024 14:04
Comment on lines +151 to +152
shutil.copy(path, Path.cwd() / filename)
logger.info(f"Downloaded model from HuggingFace to {filename}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understand the code properly, this will download repo/owner/.../main/some/path/to/file.ckpt to <cwd>/some/path/to/file.ckpt. Is this intended? Otherwise we should be using os.path.basename(filename) as the destination

Comment on lines 157 to 158
shutil.copy(path, Path.cwd() / str(path).split("/")[-1])
logger.info(f"Downloaded model to {str(path).split('/')[-1]}")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
shutil.copy(path, Path.cwd() / str(path).split("/")[-1])
logger.info(f"Downloaded model to {str(path).split('/')[-1]}")
basename = os.path.basename(str(path))
shutil.copy(path, Path.cwd() / basename)
logger.info(f"Downloaded model to ./{basename}")

should be clearer and also work on windows where path are \ separated.

@Luthaf
Copy link
Member

Luthaf commented Nov 25, 2024

@frostedoyster could you NOT merge without approval …

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants