Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add R2 fallback for strange 404s #10

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Conversation

brandonsturgeon
Copy link
Member

@brandonsturgeon brandonsturgeon commented Jul 6, 2023

Summary

Cloudflare KV is prone to 404s if the Reader is in a different part of the world than the Writer.

To remedy this, we now also store the data in R2 which is guaranteed to be globally consistent after you write data to it.
This means we attempt to get the performance benefits of KV, but still benefit from the reliability of R2.

There are two downsides:

  1. Each Write will be slightly slower
  2. Extra cost and setup because we can't automatically configure R2 Object Lifecycles

TODO:

  • Add region Caching for R2 reads
  • Extended stress test

This PR is currently deployed to stg.gmod.express
If you want to test it, you should be able to set express_domain stg.gmod.express.
While it shouldn't be necessary, be sure to grab this Express branch too: CFC-Servers/gm_express#37

Explanation

After a great deal of investigation and talking with Cloudflare support, we identified an issue with KV.

There are TWO "primary KV Stores" globally. One in the US area, one in the EU region.

When you Write to KV, the worker chooses the Store closest to the Writer's region.
The Stores then perform a sync operation, that could take a few seconds at most.

Bigger data takes longer to sync between the two Stores.

The issue we found is that, if the Writer Writes to the KV region that is not closest to the Reader, and the Reader attempts to get the data too quickly, they will get a 404.
Worse yet, because KV requests are cached in each Region, that 404 will be cached for another minute or so, meaning everyone else in the region also can't access the data.

An Example:

  • A server in Vermont, US writes a 6mb file to KV
  • They Write to the US KV Store because it's closest
  • Writer sends the UUID of the data to the Reader in Italy
  • Reader attempts to retrieve the data from the EU KV Store
  • Reader gets a 404 because the Stores have not synced yet
  • All Readers in the Italy-ish region will get 404s for another minute or so

Without understanding this problem, I originally implemented a number of systems aimed at band-aiding this problem.
The biggest and most impactful of those was a "Send Delay" in the addon.
After the addon gets a response from Writing, they wait up to 2 seconds before sending the UUID to the recipient.

This slows the whole system down.

Solution

Instead of relying entirely on KV, we also Write the same key/value to Cloudflare's R2, which is assured to be "Strongly Consistent".
From the Cloudflare R2 Worker API Docs:

R2 writes are strongly consistent. Once the Promise resolves, all subsequent read operations will see this key value pair globally.

Then, when a Reader gets a 404 from KV, they will also try to read from R2.
If the value has not expired yet, the Reader is guaranteed to get the data from R2 successfully.

Additional Notes

I will begin working on Express v2 which will use R2 and some clever positioning/routing to improve performance, getting it closer to KV's average read/write times.

Until then, this solution should work fine.

@brandonsturgeon brandonsturgeon self-assigned this Jul 6, 2023
@brandonsturgeon brandonsturgeon added bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request labels Jul 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant