-
Notifications
You must be signed in to change notification settings - Fork 81
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Disable dataset scripts #2001
Disable dataset scripts #2001
Conversation
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## main #2001 +/- ##
==========================================
+ Coverage 86.51% 90.27% +3.76%
==========================================
Files 66 143 +77
Lines 3455 8195 +4740
==========================================
+ Hits 2989 7398 +4409
- Misses 466 797 +331
Flags with carried forward coverage won't be shown. Click here to find out more.
☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to raise the exception in every job that uses the datasets library to run the user script
@@ -88,6 +89,8 @@ | |||
MAX_FILES_PER_DIRECTORY = 10_000 # hf hub limitation | |||
MAX_OPERATIONS_PER_COMMIT = 500 | |||
|
|||
DATASET_SCRIPTS_ALLOW_LIST = ["canonical"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as we will surely apply the restriction to other job runners, move it to a shared file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I used an environment variable that is used in the common config for workers
services/worker/src/worker/utils.py
Outdated
dynamic_modules_path: Optional[str] = None, | ||
) -> None: | ||
for allowed_pattern in allow_list: | ||
if (allowed_pattern == "canonical" and "/" not in name) or fnmatch(name, allowed_pattern): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe a more specific keyword instead of "canonical", as this org exists? (https://huggingface.co/Canonical)
even if it's handled by your code, it would feel less bug-prone.
You can use a forbidden character for example (see https://github.com/huggingface/moon-landing/blob/main/server/lib/Names.ts - internal)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
renamed {{ALL_DATASETS_WITH_NO_NAMESPACE}}
ArgoCD Diff for commit
|
Legend | Status |
---|---|
✅ | The app is synced in ArgoCD, and diffs you see are solely from this PR. |
The app is out-of-sync in ArgoCD, and the diffs you see include those changes plus any from this PR. | |
🛑 | There was an error generating the ArgoCD diffs due to changes in this PR. |
The
config-parquet-and-info
step will now raise aDatasetWithScriptNotSupportedError
for datasets with a script, except those in the allow list. This will prevent users from running arbitrary code using dataset scripts.The error message is shown to the user on the website and it says
The allow list is hardcoded for now:
DATASET_SCRIPTS_ALLOW_LIST = ["canonical"]
The keyword "canonical" means all the datasets without namespaces.
We can add other datasets to the allow list, and it supports
fnmatch
, for example to support all the datasets fromhuggingface
we can addhuggingface/*
to the allow list.cc @severo @XciD