Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize HSH_Lookup #4236

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

sobolevsv
Copy link

@sobolevsv sobolevsv commented Nov 26, 2024

This PR optimizes the HSH_Lookup function to prevent Varnish Cache from becoming unresponsive for extended periods when handling a large number of bans and variations.

Earlier this year we encountered issues where Varnish would suddenly become unresponsive.

The issues always followed the same pattern. They occurred infrequently—about once a week or less, typically in the early morning on weekends. Before Varnish stopped responding, the n_expired counter would suddenly stop updating. Simultaneously, the rate of bans_tests_tested would increase while bans_tested did not significantly change. Then, the number of threads would increase until it reached the thread pool limit, at which point Varnish stopped accepting new connections.

Initially, I suspected a deadlock. We upgraded from version 5 to 6.6, but this did not resolve the issue. We also increased the thread pool size, which helped once—Varnish managed to unblock itself before the DevOps team restarted it. This indicated that the root cause was not a deadlock.

A thorough investigation revealed the following:

We used a TTL of 900 seconds for cacheable pages and 1 day for non-cacheable pages. Several non-cacheable pages were configured in the VCL file to bypass the cache. For other non-cacheable pages, hit-for-miss was configured with a TTL of 1 day.

After 1 day of uptime, Varnish's cache could contain 500,000–1.5 million objects, many of which are hit-for-miss objects (as the investigation showed).

We use lurker-friendly bans to invalidate objects in the cache. Each ban uses a regex to match objects by their URL. We add around 50k bans per hour. The lurker was unable to keep up with the rate of adding bans due to the large number of objects in the cache. After 1 day of uptime, Varnish could have as many as 300k active bans.

Additionally, we use Vary by User-Agent. While this is not ideal, it is part of our current configuration and it worked for years. As a result, a single page could be accessed by more than a thousand unique User-Agent daily, which creates a long list of variations for a single object in the cache.

Bans were added more or less at a constant rate throughout the day, but traffic significantly decreased during nighttime. Many objects were not requested overnight, meaning they needed to be checked against more bans when traffic resumed if the lurker had not yet processed them.

A rough estimate showed that the lurker could take up to 1 day to check all bans against all objects, so we can ignore its role in the investigation.

In the morning, as traffic increased, Varnish received a request for an object that had not been accessed overnight. This object needed to be checked against a large number of bans. If the requested object was non-cacheable with a TTL of 1 day, it also had to go through thousands of variations to find a match. Recently accessed objects were at the beginning of the list, so the probability of the requested object being near the end of the list was low. What is more likely is that a ban invalidating the requested object was added at the start of the night, followed by many more bans added throughout the night.

During lookup, once an objhead is found, Varnish needs to check all variants against all bans from the start of the ban list to the ban that invalidated the object. This process could take minutes to complete (hundreds of thousands of bans multiplied by thousands of variations). During this operation, the mutex oh->mtx is held, which blocks the expire thread. This explains why n_expired stopped updating. This mutex also blocks new incoming requests for the same object (page). This explains why threads started increasing.

We fixed the configuration—reduced the TTL for hit-for-miss objects and added additional non-cacheable pages to bypass the cache. This helped to reduce the number of objects in the cache so the lurker could now keep up with newly added bans. As a result, the issue with Varnish disappeared. We will also revise the Vary header to populate the cache with fewer object variants.

While we had a not very good configuration that allowed our Varnish to work for years without any issues, a critical system software should not stop responding unexpectedly because of wrong or suboptimal configuration.

To prevent Varnish from becoming stuck unexpectedly for minutes, we need to swap the order of the Vary check and the Ban check. If we first find the object's variant that exactly matches the request and only then check it against all bans, the worker will return the object from the cache (or from backends) with a much smaller latency spike. The worker will hold the mutex for a much shorter time, allowing other requests for the same object to be served. Latency should be a top priority—returning a requested object variant to the client as quickly as possible. Let other object variants expire, or let the lurker clean them up asynchronously.

Fix potential case when varnish stops responding with a large number of bans and variations
Copy link
Member

@nigoroll nigoroll left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As you rightly explained, the root cause of your problems lies with the excessive use of variants and bans. Regarding variants, we introduced vary_notice, which does not avoid the problem, but at least helps awareness of it. Regarding bans, we added ban_cutoff, which, if used, would have avoided the problematic scenario you described.

That said, I can see that the change you propose makes sense, so 👍

.gitignore Outdated Show resolved Hide resolved
@sobolevsv
Copy link
Author

Thank you for reviewing and approving my proposal, @nigoroll!
Regarding ban_cutoff - yes, I came across this during the investigation. I understood that it could serve as a workaround, but I wanted to identify the root cause. After pinpointing the issue—an excessive number of variations for non-cacheable objects—we addressed it by fixing the configuration, and the issue disappeared. Moving forward, we will consider both ban_cutoff and vary_notice as safety measures.

@nigoroll
Copy link
Member

@sobolevsv Yes, I agree that you did the right thing by tracking down the root cause.

Can you please squash your commits regarding .gitignore and force-push?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants