status | implementation | status_last_reviewed | status_notes |
---|---|---|---|
superseded |
superseded |
2024-03-06 |
This is no longer a part of the GOV.UK stack |
We have recently experienced two incidents ( and ) where sections of GOV.UK started returning HTTP 404 responses for content which was previously valid. No automated monitoring picked up these incidents - we'd like to have automated monitoring if this happens in future.
The plan is simply to monitor a list of pages on GOV.UK to ensure that they return valid responses. All else in this RFC is details.
A page on GOV.UK which returns an HTTP 2xx response should very rarely start returning an HTTP 404 response. (The only exceptions to this are accidental publishes - see later for discussion of them). Ignoring the accidental publish exceptions, the only valid responses for a page which has been returning an HTTP 2xx are:
- HTTP 2xx or 3xx
- HTTP 410 ("Gone")
If such a page returns any other HTTP 4xx response, we are very likely to have a persistent problem which is immediately affecting users (it won't be hidden by our CDN and caching layers), and requires urgent manual intervention.
If such a page returns an HTTP 5xx response, we probably have a temporary problem; it may be being hidden from users by the CDN caching. We should alert if such responses are unusually frequent.
There are various possible sources of lists of pages to monitor.
We have access to the logs from our CDN. These contain the following information for each page viewed on the site:
- IP address
- Timestamp
- HTTP method
- Full path (including query parameters)
- HTTP response status
We could process these (perhaps daily) to get a count of the number of times each path was accessed, and returned a 2xx status code. The transitiion-stats repo performs similar analysis for the logs for sites which have been transitioned to GOV.UK.
Problems:
- No information on which app owns a path
Note : Some other summary information from this might be of interest to analysts (eg, counts of the number of accesses of particular asset URLs), so we should consider whether there's any easy way to expose such data for other uses.
We currently have a nightly jenkins job which runs to fetch from GA the number of page loads for every page on GOV.UK. This is used to update the "popularity" field in the search index. The code for this is in the search-analytics github repository.
The downloaded data could also be used to populate a list of pages which have received traffic in the last fortnight. GA information includes the response status code for the pages, so could be filtered to only return pages which are currently returning a 2xx response.
The requests made to GA could be extended to also fetch the frontend app serving each page, and a list of the top pages by traffic for all time. (Fetching all pages by all-time traffic is possible, but would produce a heavily sampled response, ie, many pages would be omitted if you asked GA for this directly.)
Problems :
- This won't cover any "assets" urls - eg, PDFs, since no GA event is triggered for these. (Do we already have separate monitoring for assets working, though?)
- What should be done with query parameters? For most pages, we should ignore query parameters - but for some pages (eg, search forms) they are significant, and it would be good to be checking that common search pages work. We may need to have a whitelist of pages where we preserve query parameters.
- We have sometimes had pages on the site which ignore all path components after a certain point. This can lead to an arbitrary number of different paths, made up by users, being visited (and getting HTTP 200 responses). This might make the list of all pages visited grow indefinitely. (We should probably fix such cases to return either an error, or a redirect to the canonical URL.)
- Pages which have never been visited won't be represented in this output. (This probably only affects very new pages.)
- There's currently very few people who know the search-analytics code base. (This problem is also an opportunity!)
- Fetching data from GA is mildly tricky (due to quirks of the platform).
The search index contains a list of pages on the site. This is used to populate the "sitemap" pages. The list of pages in the search index could be used for the monitoring.
Popularity information is also loaded into the search index nightly, so this could be used to identify the top pages.
Problems:
- Not all documents are in search.
- The frontend app serving a page isn't currently recorded in search. (And it would be awkward to add it currently)
- Search only has a "canonical" URL for documents. Multi-page documents (eg, guidance) often only have the entry page of them represented in search.
- A failure of the search index system which lost some documents from it could plausibly cause some pages on the site to start returning 4xx errors, or erroring. However, this wouldn't be detected if the monitoring used the same list of documents.
The content store apps (perhaps the content-register part of this app cluster) could return a list of routes registered for a particular path.
Problems :
- No information on traffic levels to pages
- A content store item may represent content spread across multiple urls, and there may be no easy way to get a list of all the urls that the item can be accessed at
The individual apps could generate lists of their pages to be monitored.
Problems:
- There are a lot of apps, and custom code would need to be written and maintained for each one. This is probably a non-starter.
Fetching data from the CDN logs is probably the simplest way to get a good coverage of pages. This might be done using the code in the transistion-stats repo, or some other approach.
The result of fetching this data could either be written directly to a file in some shared system (eg, S3, or a git repo) or explicitly copied to all machines which need them to run the monitoring smoke test.. Monitoring apps could then query this index to get lists of documents to check. If a shared system was used, the list of pages fetched each day could be merged with the list of pages which were already present.
Using a git repository for this monitoring would have the nice property that false positives could be resolved by manually editing the repository to remove incorrect entries. Further, it would produce a history of which pages were live on the site at a particular time.
- Use a git repository, pushing updates to it nightly. I would imagine this git repository containing a single, large, CSV file, with one line per page on the site. (in sorted order by base_path, to make for minimal diffs). It might also contain a separate file with top pages by traffic for each frontend app.
- Ensure that the nightly automatic update of this repository performs a "pull" of the repository before updating it, so that it copes with intervening manual edits.
- Have a nagios monitoring check on the timestamp of the last push to this repository, to ensure it's being successfully updated.
- If exceptions to the monitoring are required (eg, paths which we do not want the monitoring to check), record these in the same git repository, such that the monitoring can be entirely configured by editing the repository.
Requirements are:
- Identify problems on high traffic pages quickly (within 5 minutes)
- Identify problems on lower traffic pages "eventually" (ideally within a few hours)
- It should be easy to trigger a check of the top pages for a given app (ideally, this would happen automatically after a deploy)
The checks should be run avoiding caches - so we need to ensure that they don't put too high a load on the apps themselves.
Question : should smokey run these tests? Or should we have a separate persistent app?
Suggestion:
- Smokey runs automatic checks of the top few pages.
- A separate persistent app performs a gradual sampling of pages on the site, reporting problems via nagios.
Some pages are accidentally published, and then have to be reverted. This normally only happens to comply with legal obligations.
We'll make two assumptions about accidentally published pages:
- accidentally published pages will always be reverted within 7 days (which seems reasonable, since there's little point reverting such pages after longer, as they'll certainly be in various web archives).
- no accidentally published pages will have received "high" traffic (ie, within the top 100 pages served by their frontend app, say).
To avoid making false-positive alerts about such pages, we'll exclude any such pages; ie, any pages which haven't been published for more than 7 days, unless they're in the top 100 pages served by their app. This information is available from analytics.
These assumptions aren't "watertight", but if violated would result in false positive alerts. We'd probably want to know about such problems anyway, and I think they'll very rarely be incorrect assumptions, so this seems likely not to be a problem. We can iterate the thresholds if they do turn out to be a problem.