-
Notifications
You must be signed in to change notification settings - Fork 537
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support Remote and HF promptfiles in hf_generate script #786
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The functionality seems good and I'd be happy to merge wrt that. I don't like the invention of a new URI format here tho. We use the standard like s3://
in other places, and making it different here is pretty confusing.
To not break existing functionality, I suggest you leave the file::
thing for local files, and then you just use s3://
etc for remote files. hf://
is the canonical prefix for huggingface things.
) | ||
|
||
local_path = prompt_path.split('/')[-1] | ||
get_file(path=prompt_path, destination=local_path, overwrite=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you probably want to make local_path
a tmp location?
|
||
|
||
def load_prompts_from_dataset(dataset_path: str, | ||
prompt_delimiter: Optional[str] = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer you just add a new argument to the script for the dataset column name instead of overloading the delimiter param
Small QoL improvement for testing generation. Uses composer's
get_file
to support remote prompt files, and adds a bit of syntax for pointing to huggingface hub datasets as well.-p file::/local/path
was already supported. This adds-p file::s3://remote/path
and-p dataset::mosaicml/some-hub-dataset
. For HF datasets, it defaults to looking for a column namedprompt
but will use any string passed as theprompt_delimiter
as the column name (kind of abuses the API, but it felt understandable)@dakinggg if this is out of scope let me know and I can close this out. Most workloads like this end up going to inference... but I can see it being useful both internally and for customers.
EDIT: updated the syntax so remote files are just
s3://
and HF datasets arehf://
example invocations:
and