-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposed ADR: Separate changes to models from the corresponding schema migrations #4736
Comments
Sentry Issue: JOB-SERVER-MG |
Importance of deployment strategy to this practiceI think good practice here depends upon our deployment strategy. I don't know what our OpenCodelists deployment strategy is. Similar considerations are possibly relevant there. Job Server's deployment strategy is, I think, to signal the old container to shut down once the new container is accepting connections. In-flight requests to the old container may fail if the new container has modified the database schema and the application accesses the database in a manner invalid under the new schema. An alternate deployment strategy is to signal and wait for the existing container to stop requesting requests before starting the new container, avoiding theses issues completely. This trades some uptime for simpler development and deployment. The existing container could either be stopped or, better, put into an "maintenance" state, serving only a static page with a message to that effect and minimal DB dependencies. I don't know how easy this strategy is to implement in Dokku. I hope Job Runner would be robust to short Job Server outages. I imagine they would be on the order of minutes at most. This also affects human users. I would like to at least consider updating or changing the deployment strategy before changing the PR policy under our current deployment strategy. Under our current deployment strategyDjango doesn't require model-to-database consistency; database-layer errors are raised only when issues arise. Some model changes likely won’t trigger errors under our deployment strategy. Adding a table or column is safe since old code won’t query them. Making a field nullable is also safe, as it won’t cause database access failures and will only impact application code if non-null values are required. Problems arise when a table or column is removed and old code tries to access it, or when a field is made non-nullable and old code attempts to insert null values. For these cases, we could deploy the application change first in a separate PR before the migration PR. Renaming a table or column is more complex. A good approach is to use three PRs: first, a PR with models and migrations to create the new table or column and replicate existing data; then a PR updating the application code to use the new table/column; and finally, a PR to remove the old field. I'm not sure there's a way to avoid inconsistency without that Where to document any decision
I agree we should document any policies or strong opinions on organizing PRs with migrations. I recommend adding this to I'm not sure documenting this in an ADR is necessary. ADRs are more valuable for architectural decisions with broad, long-term impacts. This decision is not as foundational; it’s relatively easy to adjust and doesn’t impact other parts of the system. Our overall deployment strategy, for example, is architectural, as it’s harder to change and can have multiple downstream effects (such as this Issue). |
I have been following this discussion, and I think the above is the key thing to sort out.
I think it is more than just accepting connections, it's returning expected codes from certain endpoints. Or at least, it can be configured to be that per application. Givens a slightly higher guarantee of actually being up.
Specifically, a key part of the deployment strategy is currently that we run migrations as part of the production docker entrypoint. It is not a separate step. This is simple, and has mostly worked, but it has several drawbacks, which have bitten us on occasion.
This is potentially an option, and I think a very sensible one, as we don't have strong uptime requirements. Dokku has a maintenance mode you can enable that may help with this. It might also be good to separate starting the container from running the migrations, as it gives us more control over timings and reduce the maintenance window. e.g.
[snip] I agree with Mike that Django's tight coupling between model code and schema makes this harder than it needs to be. It is possible to carefully write model/schema changes carefully in a series of deployed steps such that no downtime is required, but it is a lot of extra work, and still easy to get wrong. And IMO we don't have the uptime requirements to justify the extra cost. I think that aiming for a deploy with a short downtime window is probably a reasonable compromise. |
Thanks for the extra detail about how the deployment works and the thinking-through of the alternate deployment strategy, Simon, very useful. |
We talked in 19 Nov planning meeting about containers overlapping during deployment. I think extending the overlap period was suggested to give old requests more time to complete before executing migrations in the new container. I think this may not help under our current strategy due to Simon's comment:
As soon as a new container entrypoint starts up with breaking migrations, there's a race between old requests completing and the database schema being modified. I think we have to think about, as discussed above, one of: changing the deployment strategy somewhat; or splitting PRs with breaking migrations; or accepting the failed requests. |
As every Djangonaut knows, changes to models prompt schema migrations. Indeed, it's common to combine both in the same PR. Doing so, however, can cause the database schema to be inconsistent WRT the models, albeit briefly, as the
migrate
management command is run by the container's entrypoint script (docker/entrypoints/prod.sh
) before the container accepts connections. The result is aProgrammingError
.Take this
ProgrammingError
as an example. 5f43027 removedUser.is_active
. A query that referencedjobserver_user.is_active
was executed by the old container aftermigrate
was run by the new container. TheProgrammingError
seemed urgent, and took a non-negligible amount of time to resolve, but it wasn't urgent after all.1 Sentry reported a single event; Honeycomb reported that whilst one call returned 500, subsequent calls returned 200.Separating changes to models from the corresponding schema migrations would mean that the models would be inconsistent WRT the database schema, albeit briefly. It would entail more careful scheduling of PRs. But it wouldn't require many -- if any -- changes to our tooling, as
just check-migrations
doesn't seem to be called from CI.If we choose to separate changes to models from the corresponding schema migrations, then we should record our decision.
Footnotes
https://bennettoxford.slack.com/archives/C03D1G814BA/p1731328692937349 ↩
The text was updated successfully, but these errors were encountered: