Skip to content

cache global data

gaiaops edited this page Oct 24, 2011 · 6 revisions

See: https://github.com/gaiaops/gaia_core_php/blob/master/examples/store/global_data.t

This class demonstrates one of the most simple concepts of caching ... how to take a big chunk of data that doesn't change often and cache it. This takes a lot of load off the database server and distributes it over a pool of memcache servers.

The call SiteConfig::data() in this example returns data of key value pairs from the query:

SELECT name, value FROM config;

Often configuration variables are stored in a lookup table to allow developers to decouple configuration changes from the codebase. But if this configuration data is needed on every page request, the database will be hit very hard over and over by this one query. It doesn't matter that the database can cache the query. At some point the volume of connections will saturate the database and cause performance issues.

The first step to alleviating load is to cache the query. The easiest way is to instantiate a Store\APC object and cache the data in a shared memory segment:

$cache = new Store\APC();
$data->get('global_data');

This is an extremely fast way to retrieve the data and will take the load off of your database. However, if you cache data like we do at gaia, at some point you will be storing more data than you can fit on a single server. Enter memcached:

$cache = new Store\Memcache;
$cache->addServer('10.0.0.1', 11211);    
$cache->addServer('10.0.0.2', 11211);
$cache->addServer('10.0.0.3', 11211);
$data = $cache->get('global_data');

For more justifications on why to use memcache, see Distributed Caching with Memcache. The client side hashing allows you to spread your data evenly across the pool of servers with no duplication of objects in the cache. An additional benefit is that the data in the cache will exist only on one server so when the cache is refreshed, all the other web-servers see that cache update and don't need to refresh the data for themselves. This brings up an interesting problem though.

The most common approach to refreshing data in the cache is to let the cache expire. The next client to ask for the data sees that it is missing and repopulates it back into the cache.

Here's an example:

$data = $cache->get( $key );
if( ! $data ) {
   $data = $this->getDataFromDatabase();
   $cache->set( $key, $data, $timeout = 60);
}

Every 60 seconds the data in the cache will be missing and the client automatically refreshes it from the database.

This approach has the benefit of having to only code the logic for updating the cache in one spot. It is easy to understand and maintain. It doesn't rely on cronjobs or other external mechanisms to maintain the data. And if for whatever reason the data gets evicted from the cache, the code will auto-repopulate it. But the strategy has a big problem. When many clients attempt to access a cache key in parallel and the data is missing from the cache, there is a race condition. This race condition is known as the 'thundering herd'. All of the clients stampede over each other to try to repopulate the data back into the cache.

When this happens, you will see a flurry of database connections stack up on the database server in regular intervals. Worse, since the query hasn't been run by the database server in a while, the query cache or innodb buffer pool may not have easy access to the data. It may have to hit the disk. If the query is poor performing (often a reason it is cached) the problem is that much worse. All the clients sit around waiting while the database attempts to access data off the disk and calculate the results of the query. The highly parallel stampede of clients can even topple and crash a Database server in the worst case scenario.

The Store\Gate class uses a probabilistic approach to refreshing the cache and avoiding the problem of the 'thundering herd'. It elects just one client to refresh the data periodically. It does this transparently by caching the data forever and holding onto a soft timeout value in a separate cache key. When the soft timeout is reached the Gate class tells one client that no data was found, relying on that client to know what to do to re-populate the cache. It uses some other nice tricks for performance like probabilistic cache refreshing to avoid the overhead of network mutex locks on the cache key.

Adding Store\Gate to your memcache object constructed earlier is trivial:

$cache = new Store\Gate( $cache );

Now just use the new cache object as you would the normal memcache object.

In additon, this example also demonstrates how to set up multiple tiers of caching. We can use apc as our first layer cache, and then fallback to memcache if the value isn't in apc. Since our memcache cache layer is wrapped in Store\Gate, that protects us from the thundering herd hitting the database. If we are worried about one cache server going down periodically, we can keep multiple copies of the data in the cache using Store\Replica. This insulates us from cache server outages or intermittent network problems.

The important thing to take away from this example is this: data that is used heavily in your application and changes infrequently should be cached for as long as possible while keeping closely in-sync with your database. Store\Gate provides a nice API for reducing the likelihood of the 'thundering herd' problem when the data needs to be refreshed, and Store\Replica keeps several copies of the data in the cache to insulate against cache outages and hotspots.

Clone this wiki locally