Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Very large files require very large buffers #2185

Open
JustinKyleJames opened this issue Apr 17, 2024 · 5 comments
Open

Very large files require very large buffers #2185

JustinKyleJames opened this issue Apr 17, 2024 · 5 comments

Comments

@JustinKyleJames
Copy link
Contributor

JustinKyleJames commented Apr 17, 2024

So that we can support retries on part uploads after failures, an entire part must be stored in the circular buffer.

We had a user who attempted to upload a 1.7 PB file. With the limit of 10,000 parts per upload, that means that each buffer must be 170 GB. That is an unreasonable amount of memory especially considering each streaming thread for each upload has its own buffer.

We should do the following:

  1. When calculating the number of parts based on the file size and circular buffer size, check to see if we need more than 10,000 parts. (This should be done no matter what even if all we did was throw an error.)
  2. If we calculate that we would need more than 10,000 parts:
    • Update the part sizes so that there are 10,000 parts or less.
    • Drop the requirement that the full part must be in memory.
    • Drain the circular buffer as we are streaming the bytes from S3.
    • Do not support retries if the part fails.
@korydraughn
Copy link
Contributor

Turns out the user was attempting to upload a 1.7TB file, but that doesn't change the fact we need to add logic for making sure the total number of parts does not exceed 10000.

Also, AWS S3 doesn't support objects exceeding 5TB.

@alanking
Copy link
Contributor

Is that baked into the protocol, or is that just AWS? Just wondering...

@trel
Copy link
Member

trel commented Apr 18, 2024

The protocol is what AWS says it is. There is no spec.

@alanking
Copy link
Contributor

Oh, interesting. Good to know.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants