Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposed ADR: Separate changes to models from the corresponding schema migrations #4736

Open
iaindillingham opened this issue Nov 11, 2024 · 5 comments
Labels
deck-scrubbing Tech debt or other between-initiative tidy-up work

Comments

@iaindillingham
Copy link
Member

As every Djangonaut knows, changes to models prompt schema migrations. Indeed, it's common to combine both in the same PR. Doing so, however, can cause the database schema to be inconsistent WRT the models, albeit briefly, as the migrate management command is run by the container's entrypoint script (docker/entrypoints/prod.sh) before the container accepts connections. The result is a ProgrammingError.

Take this ProgrammingError as an example. 5f43027 removed User.is_active. A query that referenced jobserver_user.is_active was executed by the old container after migrate was run by the new container. The ProgrammingError seemed urgent, and took a non-negligible amount of time to resolve, but it wasn't urgent after all.1 Sentry reported a single event; Honeycomb reported that whilst one call returned 500, subsequent calls returned 200.

Separating changes to models from the corresponding schema migrations would mean that the models would be inconsistent WRT the database schema, albeit briefly. It would entail more careful scheduling of PRs. But it wouldn't require many -- if any -- changes to our tooling, as just check-migrations doesn't seem to be called from CI.

If we choose to separate changes to models from the corresponding schema migrations, then we should record our decision.

Footnotes

  1. https://bennettoxford.slack.com/archives/C03D1G814BA/p1731328692937349

Copy link

sentry-io bot commented Nov 11, 2024

Sentry Issue: JOB-SERVER-MG

@mikerkelly
Copy link
Contributor

mikerkelly commented Nov 13, 2024

Importance of deployment strategy to this practice

I think good practice here depends upon our deployment strategy.

I don't know what our OpenCodelists deployment strategy is. Similar considerations are possibly relevant there.

Job Server's deployment strategy is, I think, to signal the old container to shut down once the new container is accepting connections. In-flight requests to the old container may fail if the new container has modified the database schema and the application accesses the database in a manner invalid under the new schema.

An alternate deployment strategy is to signal and wait for the existing container to stop requesting requests before starting the new container, avoiding theses issues completely. This trades some uptime for simpler development and deployment. The existing container could either be stopped or, better, put into an "maintenance" state, serving only a static page with a message to that effect and minimal DB dependencies. I don't know how easy this strategy is to implement in Dokku. I hope Job Runner would be robust to short Job Server outages. I imagine they would be on the order of minutes at most. This also affects human users.

I would like to at least consider updating or changing the deployment strategy before changing the PR policy under our current deployment strategy.

Under our current deployment strategy

Django doesn't require model-to-database consistency; database-layer errors are raised only when issues arise. Some model changes likely won’t trigger errors under our deployment strategy. Adding a table or column is safe since old code won’t query them. Making a field nullable is also safe, as it won’t cause database access failures and will only impact application code if non-null values are required.

Problems arise when a table or column is removed and old code tries to access it, or when a field is made non-nullable and old code attempts to insert null values. For these cases, we could deploy the application change first in a separate PR before the migration PR.

Renaming a table or column is more complex. A good approach is to use three PRs: first, a PR with models and migrations to create the new table or column and replicate existing data; then a PR updating the application code to use the new table/column; and finally, a PR to remove the old field. I'm not sure there's a way to avoid inconsistency without that

Where to document any decision

If we choose to separate changes to models from the corresponding schema migrations, then we should record our decision.

I agree we should document any policies or strong opinions on organizing PRs with migrations. I recommend adding this to DEVELOPERS.md -- it’s easily discoverable, and new developers will likely be directed to it. The rationale should be briefly documented there and in the commit message, with reference to the deployment strategy.

I'm not sure documenting this in an ADR is necessary. ADRs are more valuable for architectural decisions with broad, long-term impacts. This decision is not as foundational; it’s relatively easy to adjust and doesn’t impact other parts of the system. Our overall deployment strategy, for example, is architectural, as it’s harder to change and can have multiple downstream effects (such as this Issue).

@bloodearnest
Copy link
Member

Importance of deployment strategy to this practice

I think good practice here depends upon our deployment strategy.

I have been following this discussion, and I think the above is the key thing to sort out.

Job Server's deployment strategy is, I think, to signal the old container to shut down once the new container is accepting connections.

I think it is more than just accepting connections, it's returning expected codes from certain endpoints. Or at least, it can be configured to be that per application. Givens a slightly higher guarantee of actually being up.

In-flight requests to the old container may fail if the new container has modified the database schema and the application accesses the database in a manner invalid under the new schema.

Specifically, a key part of the deployment strategy is currently that we run migrations as part of the production docker entrypoint. It is not a separate step.

This is simple, and has mostly worked, but it has several drawbacks, which have bitten us on occasion.

  • just bringing up the new version will implicitly migrate, possibly breaking the currently running version straight away.
  • if the deploy fails, the db may still have been migrated, and the current version possibly broken and need a manual intervention (either deploying forward or potentially rolling back the migration, but generally the former is easier).

An alternate deployment strategy is to signal and wait for the existing container to stop requesting requests before starting the new container., avoiding these issue. This trades some uptime for simpler development and deployment. The existing container could either be stopped or, better, put into an "maintenance" state, serving only a static page with a message to that effect and minimal DB dependencies. I don't know how easy this strategy is to implement in Dokku. I would like to at least consider this option.

This is potentially an option, and I think a very sensible one, as we don't have strong uptime requirements. Dokku has a maintenance mode you can enable that may help with this.

It might also be good to separate starting the container from running the migrations, as it gives us more control over timings and reduce the maintenance window. e.g.

  1. deploy new image
  2. basic check that its up (full check may require migrations)
  3. enable maintenance mode
  4. run migrations
  5. full check
  6. switch version
  7. disable maintenance mode
  8. gracefully remove old version

Under our current deployment strategy

[snip]

I agree with Mike that Django's tight coupling between model code and schema makes this harder than it needs to be. It is possible to carefully write model/schema changes carefully in a series of deployed steps such that no downtime is required, but it is a lot of extra work, and still easy to get wrong. And IMO we don't have the uptime requirements to justify the extra cost.

I think that aiming for a deploy with a short downtime window is probably a reasonable compromise.

@mikerkelly
Copy link
Contributor

Thanks for the extra detail about how the deployment works and the thinking-through of the alternate deployment strategy, Simon, very useful.

@mikerkelly
Copy link
Contributor

mikerkelly commented Nov 19, 2024

We talked in 19 Nov planning meeting about containers overlapping during deployment. I think extending the overlap period was suggested to give old requests more time to complete before executing migrations in the new container. I think this may not help under our current strategy due to Simon's comment:

Specifically, a key part of the deployment strategy is currently that we run migrations as part of the production docker entrypoint. It is not a separate step.

As soon as a new container entrypoint starts up with breaking migrations, there's a race between old requests completing and the database schema being modified. I think we have to think about, as discussed above, one of: changing the deployment strategy somewhat; or splitting PRs with breaking migrations; or accepting the failed requests.

@lucyb lucyb added the deck-scrubbing Tech debt or other between-initiative tidy-up work label Nov 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
deck-scrubbing Tech debt or other between-initiative tidy-up work
Projects
None yet
Development

No branches or pull requests

4 participants