Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR optimizes the HSH_Lookup function to prevent Varnish Cache from becoming unresponsive for extended periods when handling a large number of bans and variations.
Earlier this year we encountered issues where Varnish would suddenly become unresponsive.
The issues always followed the same pattern. They occurred infrequently—about once a week or less, typically in the early morning on weekends. Before Varnish stopped responding, the n_expired counter would suddenly stop updating. Simultaneously, the rate of bans_tests_tested would increase while bans_tested did not significantly change. Then, the number of threads would increase until it reached the thread pool limit, at which point Varnish stopped accepting new connections.
Initially, I suspected a deadlock. We upgraded from version 5 to 6.6, but this did not resolve the issue. We also increased the thread pool size, which helped once—Varnish managed to unblock itself before the DevOps team restarted it. This indicated that the root cause was not a deadlock.
A thorough investigation revealed the following:
We used a TTL of 900 seconds for cacheable pages and 1 day for non-cacheable pages. Several non-cacheable pages were configured in the VCL file to bypass the cache. For other non-cacheable pages, hit-for-miss was configured with a TTL of 1 day.
After 1 day of uptime, Varnish's cache could contain 500,000–1.5 million objects, many of which are hit-for-miss objects (as the investigation showed).
We use lurker-friendly bans to invalidate objects in the cache. Each ban uses a regex to match objects by their URL. We add around 50k bans per hour. The lurker was unable to keep up with the rate of adding bans due to the large number of objects in the cache. After 1 day of uptime, Varnish could have as many as 300k active bans.
Additionally, we use Vary by User-Agent. While this is not ideal, it is part of our current configuration and it worked for years. As a result, a single page could be accessed by more than a thousand unique User-Agent daily, which creates a long list of variations for a single object in the cache.
Bans were added more or less at a constant rate throughout the day, but traffic significantly decreased during nighttime. Many objects were not requested overnight, meaning they needed to be checked against more bans when traffic resumed if the lurker had not yet processed them.
A rough estimate showed that the lurker could take up to 1 day to check all bans against all objects, so we can ignore its role in the investigation.
In the morning, as traffic increased, Varnish received a request for an object that had not been accessed overnight. This object needed to be checked against a large number of bans. If the requested object was non-cacheable with a TTL of 1 day, it also had to go through thousands of variations to find a match. Recently accessed objects were at the beginning of the list, so the probability of the requested object being near the end of the list was low. What is more likely is that a ban invalidating the requested object was added at the start of the night, followed by many more bans added throughout the night.
During lookup, once an objhead is found, Varnish needs to check all variants against all bans from the start of the ban list to the ban that invalidated the object. This process could take minutes to complete (hundreds of thousands of bans multiplied by thousands of variations). During this operation, the mutex oh->mtx is held, which blocks the expire thread. This explains why n_expired stopped updating. This mutex also blocks new incoming requests for the same object (page). This explains why threads started increasing.
We fixed the configuration—reduced the TTL for hit-for-miss objects and added additional non-cacheable pages to bypass the cache. This helped to reduce the number of objects in the cache so the lurker could now keep up with newly added bans. As a result, the issue with Varnish disappeared. We will also revise the Vary header to populate the cache with fewer object variants.
While we had a not very good configuration that allowed our Varnish to work for years without any issues, a critical system software should not stop responding unexpectedly because of wrong or suboptimal configuration.
To prevent Varnish from becoming stuck unexpectedly for minutes, we need to swap the order of the Vary check and the Ban check. If we first find the object's variant that exactly matches the request and only then check it against all bans, the worker will return the object from the cache (or from backends) with a much smaller latency spike. The worker will hold the mutex for a much shorter time, allowing other requests for the same object to be served. Latency should be a top priority—returning a requested object variant to the client as quickly as possible. Let other object variants expire, or let the lurker clean them up asynchronously.