Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

remove support for script-based datasets #3046

Merged
merged 4 commits into from
Aug 26, 2024

Conversation

severo
Copy link
Collaborator

@severo severo commented Aug 23, 2024

fixes #3004

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@severo
Copy link
Collaborator Author

severo commented Aug 23, 2024

todo:

fix the worker tests
reenable the concurrency test
ignore .py files instead of DatasetWithScriptNotSupportedError?

removed

hub_responses_external_files
external_files_dataset_builder
hub_public_external_files
dataset_script_with_external_files_path
DATASET_SCRIPT_WITH_EXTERNAL_FILES_CONTENT
DATASET_SCRIPT_WITH_MANUAL_DOWNLOAD
dataset_script_with_manual_download_path
hub_public_manual_download
max_external_data_files
PARQUET_AND_INFO_MAX_EXTERNAL_DATA_FILES
_is_too_big_from_external_data_files
resolve_trust_remote_code
hub_public_legacy_n_configs
hub_public_legacy_configs
dataset_scripts_allow_list
trust_remote_code
dataset_script_with_two_configs_path
dataset_script_with_n_configs_path
is_dataset_script_error
EXTERNAL_DATASET_SCRIPT_PATTERN

code errors to be removed

UnsupportedExternalFilesError -> 1
ExternalFilesSizeRequestError -> 931
ExternalFilesSizeRequestHTTPError -> 182
ExternalFilesSizeRequestConnectionError -> 77
ExternalFilesSizeRequestTimeoutError -> 59
DatasetModuleNotInstalledError -> 643
DatasetScriptError -> 404

@severo
Copy link
Collaborator Author

severo commented Aug 23, 2024

hmmm the CI tests for services/worker fail if run serially, but not individually. Maybe we should create the repos when needed, not for the whole session.

I think it's due, for some reason, to the fact that we already accessed the gated (or the private) dataset, and it's in the cache. If I delete the cache directory for the gated dataset in ~/.cache/huggingface/..., I get the correct behavior (no token = no access). It means that, for some reason, we don't create a specific cache directory for the worker (ConfigSplitNamesJobRunner inherits on ConfigJobRunnerWithDatasetsCache, which means it should use a temporary cache dir, not the system global cache one)

@severo
Copy link
Collaborator Author

severo commented Aug 23, 2024

OK!!!! Not related at all with this PR. The reason is that in the tests, we use the global datasets cache, not the temporary cache, because pre_compute(), which sets the datasets cache directory, is called by job_manager, while the tests call directly the .compute() function.
Not sure how much is needed to fix this, I'll try to do it in another PR, then rebase this one.

@severo severo force-pushed the remove-support-for-script-based-datasets branch from cb9c1d5 to 9ecd26d Compare August 23, 2024 21:26
@severo
Copy link
Collaborator Author

severo commented Aug 23, 2024

fixed in #3047 + rebased

Copy link
Member

@albertvillanova albertvillanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome!!

This simplifies a lot the code and makes it more robust against security threats.

@severo
Copy link
Collaborator Author

severo commented Aug 26, 2024

OK, gogogo

@severo severo merged commit 2db67b3 into main Aug 26, 2024
29 checks passed
@severo severo deleted the remove-support-for-script-based-datasets branch August 26, 2024 08:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove allow list for script-based datasets
3 participants