-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
pull: rate limit not handled with HTTP remote #50
Comments
Hi @sisp , thanks for the detailed report. The request makes sense but we discussed internally that implementing the right solution at the right level might be tricky. In your specific case, what is the request limit on the server? |
@daavoo The rate limit is 1000 requests per 15 seconds. To reduce the risk of running into the limit, we use Do you know where the problem would need to be fixed? In |
It is unclear if it would be an acceptable or simple change to make in I haven't look in details, but it would probably make sense to handle it in |
Okay. If you don't have the capacity to make this change, I could look into it and attempt a contribution if somebody could review it then. |
@sisp You should be able to just run |
Also, http remote is very basic and we really didn't want to try to support all custom https servers with their quirks. Maybe you could consider using some s3-compatible server for handling the data? (e.g. minio). |
@efiop is right about simplicity and that fetch / pull can be used in a loop. We have though an example (GDrvive) where the whole implementation is like 10 lines of code (+adding a wrapper on top of operations) - https://github.com/iterative/PyDrive2/blob/main/pydrive2/test/test_util.py#L46-L54 The problem here is that HTTP is a general remote, and 429 is being the most common code for rate limiting, can be not the only one. I'm not sure how often servers use 429, is it always safe to retry in those cases, etc. @sisp do you have some information about that? Overall, I feel that it can be a convenience feature if 429 is indeed used very often and it's always safe to retry on it. |
There's also a question whether dvc should respect that at all. I'd say dvc should be aggressive on downloads/uploads and provide means to control that and it does so via |
That DX is awful, both locally and especially in a CI job. It's at best a poor workaround.
No, let me elaborate. My DVC remote is GitLab's generic package registry on our corporate self-hosted GitLab instance. This approach has significant advantages over, e.g., S3:
See also a feature request for first-class support of DVC in GitLab that I've submitted.
HTTP 429 looks pretty standard to me. I think it should be supported; custom rate limit responses not though.
That's just a hack and not reliable. See also my remark about DX. |
Sounds really great to me! Is the rate limiting the only missing piece for this? Does auth come "automatically" in this case? Do GET and POST work out-of-the-box?
thanks! also a good feature request.
yes, it makes sense, I'm just curious about some real world stats on this (e.g. in GDrive as you can see it's not 429). overall, again, I think if we can make the better integration with GitLab storage with this and the scope is not huge it's worth adding this as a convenience feature. I feel we would need to make it an optional though, to be safe. |
@sisp, does this work in 3.0? We changed the remote structure from root |
Rate limiting and per-file size limit, although the latter is typically not problematic because large files harm deduplication and thus should be avoided.
Yes, auth comes automatically. Only
🙈 Oh no! No, this doesn't work because now there are 4 path elements and GitLab's generic package registry supports only 3. Before, the cache structure was mapped to the generic package registry API URL path template
as follows:
Is there any way to configure a mapping between the local cache and the remote cache? We'd need to strip Or would you be open to having a new remote plugin for GitLab which would be based on the HTTP remote plugin but could handle the peculiarities like this path mapping? I'd contribute it, I really don't want to loose the GitLab integration. |
Yes, I think it can work. They all share the same structure and I think nothing prevents from creating a new remote type. |
Nice! I'll keep you posted on the DVC GitLab plugin. |
Bug Report
Description
dvc pull
aborts with an error when the rate limit of an HTTP remote kicks in. This is problematic because it's impossible to pull the complete data.Reproduce
Here is a Starlette-based web app which implements a simple HTTP backend for DVC including a rate limit (1 request per minute, just to make sure it kicks in) for GET requests. Copy this code to, e.g.,
app.py
.Example:
Create a new virtual env and activate it:
python -m venv .venv . .venv/bin/activate
Install needed packages:
Initialize the project with DVC:
Set the local web server as the DVC remote:
dvc remote add -d local http://127.0.0.1:8000
Create a folder
data
and add some files:Track those files individually with DVC:
dvc add data/*.txt
In a separate terminal window, activate the virtual env and start the web app:
. .venv/bin/activate uvicorn app:app
Push the files to the DVC remote:
Delete the local files and the local DVC cache:
rm -rf data/*.txt .dvc/cache
Pull the files from the DVC remote again and observe this error:
Strangely, when running
dvc pull
again (while the rate limit is still active), another error occurs:Expected
DVC should inspect the
retry-after
field in the response header and attempt to make requests again after this time has passed.Environment information
Output of
dvc doctor
:Additional Information (if any):
Here is the output of the two
dvc pull
commands with the--verbose
flag:It may be possible that the problem needs to be solved within
fsspec
, but I think raising this issue here is a good starting point for the initial discussion, to track it, and to reduce duplicate issues.The text was updated successfully, but these errors were encountered: