-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Download from HuggingFace with private token #390
Conversation
I managed to adapt the Traceback (most recent call last):
File "/home/mazitov/apps/anaconda3/envs/metatrain/lib/python3.11/site-packages/metatrain/__main__.py", line 105, in main
export_model(**args.__dict__)
TypeError: export_model() got an unexpected keyword argument 'huggingface_api_token' which is related to the fact that in the Even though loading the model explicitly worked for me, it only worked while loading from the |
# make sure to copy the checkpoint to the current directory | ||
shutil.copy(path, Path.cwd() / str(path).split("/")[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont understand why this is required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if you want the file to be directly downloaded to the current directory you can do
path, _ = urlretrieve(
url=str(path),
filename=str(Path.cwd() / str(path).split("/")[-1]))
See https://docs.python.org/3/library/urllib.request.html#urllib.request.urlretrieve.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is required for fine-tuning. mtt export
only makes the .pt
available in the current directory, but that's useless for fine-tuning
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to use filename
in urlretrieve
, and add the same INFO message about adding a file to cwd?
I like this! Down the line, we could maybe add a way to read the API token from env variables or a file; and use such mechanism for all external integrations (W&B seems like a very logical one to add next). |
If you have a private model on Hugging Face, there are effective third-party tools like As for W&B, it’s a different story, and I fully support adding a functionality, because users can’t access the benefits of W&B without us integrating it directly into the code. |
I think having it enabled inside One point one which I am not very clear though is can one actually download a private HF model without a token? I.e. just
This is a good point, maybe a better interface here would look like
I'm not very concerned by this: one would hope that HF has enough workforce to keep this working without breaking changes, given their size and that this Python module is part of their core offering. It also only touch a relatively small part of our code and is fairly well isolated. But we should add a test on CI to catch any breaking change anyway! |
You might be right that this doesn't work. Would be quite some risk if just having the link is enough to fetch a model.
Okay we can see how this evolves. If the maintenance work is too high we can also easily remove the code again. But, yes a test in CI is the minimum we should do. |
Just for completeness: yes, public HF models should be available with |
fe7e06e
to
8d132e0
Compare
# get repo_id and filename | ||
split_path = path.split("/") | ||
repo_id = f"{split_path[3]}/{split_path[4]}" # org/repo | ||
filename = "" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of doing string manipulation, should we pass more structured information to this code? Copied from my comment on the overall PR:
maybe a better interface here would look like mtt export --mode huggingface --hf-api-token=.... --hf-owner=who/what --hf-file=file-path
(very bad syntax, just for the idea. This could also be huggingface://who/what/file-path
or huggingface://token@who/what/file-path
or whatever).
I.e. instead of trying to guess the info we need, ask the use about it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so because HuggingFace gives you the download link, so that's what most users will use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asking for the "split information" on the command line is just more work for the user
src/metatrain/utils/io.py
Outdated
filename = filename[10:] | ||
path = hf_hub_download(repo_id, filename, token=kwargs["huggingface_api_token"]) | ||
# make sure to copy the checkpoint to the current directory | ||
shutil.copy(path, Path.cwd() / str(path).split("/")[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be added to the log? Some kind of INFO: downloaded $FILE_NAME from hugingface.co/owner/repo to current directory
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure
# make sure to copy the checkpoint to the current directory | ||
shutil.copy(path, Path.cwd() / str(path).split("/")[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you change this to use filename
in urlretrieve
, and add the same INFO message about adding a file to cwd?
@@ -108,3 +110,41 @@ def test_reexport(monkeypatch, tmp_path): | |||
export_model(model_loaded, "exported_new.pt") | |||
|
|||
assert Path("exported_new.pt").is_file() | |||
|
|||
|
|||
def test_private_huggingface(monkeypatch, tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will fail for people running the tests locally, unless they have access to the repo & a token set, right?
Could we skip the test if HUGGINGFACE_TOKEN_METATRAIN
is unset? And set it on CI/locally as needed. (using a different name in case people are already using HUGGINGFACE_TOKEN
for some other reason.
shutil.copy(path, Path.cwd() / filename) | ||
logger.info(f"Downloaded model from HuggingFace to {filename}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I understand the code properly, this will download repo/owner/.../main/some/path/to/file.ckpt
to <cwd>/some/path/to/file.ckpt
. Is this intended? Otherwise we should be using os.path.basename(filename)
as the destination
shutil.copy(path, Path.cwd() / str(path).split("/")[-1]) | ||
logger.info(f"Downloaded model to {str(path).split('/')[-1]}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shutil.copy(path, Path.cwd() / str(path).split("/")[-1]) | |
logger.info(f"Downloaded model to {str(path).split('/')[-1]}") | |
basename = os.path.basename(str(path)) | |
shutil.copy(path, Path.cwd() / basename) | |
logger.info(f"Downloaded model to ./{basename}") |
should be clearer and also work on windows where path are \
separated.
@frostedoyster could you NOT merge without approval … |
As in the title
Contributor (creator of pull-request) checklist
📚 Documentation preview 📚: https://metatrain--390.org.readthedocs.build/en/390/