Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Arthurnn/deadlocks #353

Closed
wants to merge 10 commits into from
Closed

Arthurnn/deadlocks #353

wants to merge 10 commits into from

Conversation

greysteil
Copy link
Contributor

@greysteil greysteil commented Jul 3, 2019

Adds a commit to #350 (I'm not a maintainer, so couldn't do it from within that PR).

@greysteil
Copy link
Contributor Author

@lawrencejones my first commit fixes the .or( problems, but it looks like this PR has two additional problems:

  • Getting type casting right when using a case statement is tricky for older combinations of Rails and postgres.
  • We seem to be failing a uniqueness constraint on MySQL, even when the type casting works. See this failure for an example.

I'm sure the first issue is solvable. I'm a little more concerned about the later.

I'm not going to have more time to work on this - writing code isn't officially part of my job at GitHub and I'm not well placed to debug things further here :-(

@lawrencejones
Copy link
Contributor

Totally understand. I am Keen to give this time, just oversubscribed for this week, and I'd imagine all GC engineers are too. Will give a call round and see if anyone has some free time but otherwise next Monday is the most likely time to get back to this.

Also, this PR was useful regardless. And who knows, by next week maybe rails 6 will be officially released and we can forget about senile rails versions!

@lawrencejones
Copy link
Contributor

Hey @greysteil, pretty sure I have a fix for this.

Can you give me write permission on your fork for the time being? I cba creating yet another PR haha

@greysteil
Copy link
Contributor Author

Yep - you should have it!

arthurnn and others added 10 commits July 9, 2019 18:52
StatesMan will update the old transitions to `most_recent:
nil` and then insert the new transition as `most_recent: true`.
However that can cause deadlocks on MySQL when running concurrent state
transitions.
On the update, MySQL will hold a gap lock to the unique indexes there are, so
other transactions cannot insert. After two updates on the most_recent
and two inserts, the second insert will fail with a deadlock.
For more explanation see the PR description that includes this commit.
#350

@arthurnn opened #350 to reduce the impact of Statesman taking gap locks
when using MySQL. The crux of the issue is that MySQL's implementation
of REPEATABLE READ can take wide locks when an update touches no rows,
which happens frequently on the first transition of Statesman.

By first creating the new transition, we can avoid issuing an update
that will take the large gap lock. This order of queries meant we added
an additional query to the transition step which could impact people who
rely on low-latency Statesman transitions.

This commit is another take on the same approach that collapses two
queries into one, by taking the update of the old and new transition's
most_recent column and updating them together. It's slightly janky but
if robust, would be a good alternative to avoid additional latency.
This commit first refactors the construction of transition SQL
statements to use Arel. This is safer and hopefully more readable than
constructing large SQL statements using strings.

It also fixes a bug with transition updates where MySQL would throw
index violations. This was caused by MySQL validating index constraints
across a partially applied update, where Statesman would set the latest
transition's most_recent = 't' before unsetting the previous.
rails/rails@7508284

When merged with Rails core, Arel removed an initialisation parameter
that was unused in all but a few of the Arel features. For now, provide
a helper local to the ActiveRecord adapter that can construct Manager
classes independent of the API change.
When constructing Arel nodes, we don't need to be quoting the values.
This happens automatically via the Arel#to_sql interface, so quoting
them and passing them into Arel causes them to be quoted twice.
Where the parameters of exec_update were not defaulted to sensible
values, and must be supplied.
@lawrencejones
Copy link
Contributor

Ok! So what is currently in this PR will resolve the problems we were seeing.

Unfortunately it breaks many a test. However the failures should be exclusively on Rails 4, which we intend to deprecate as soon as Rails 6 is fully released, so I don't think this is a big deal.

The biggest piece of work required around this change is performance testing. If we deploy this and verify the performance is unchanged/improved, we can go ahead and merge to master in anticipation of the Rails 6 release and get this out in the wild.

I'm going to ping the GC Banking teams to ask them to trial this branch to confirm no performance regressions. Hopefully we'll have an answer by EoW and can make a decision.

@lawrencejones
Copy link
Contributor

Checking in on this: payments-service has some test failures that I've not yet had time to look into. Rounded up some volunteers (big-up @thom-oman!) from our Banking teams to double check the performance though, so this should be quite quick provided I can find some space to fix-up the tests.

@jurre
Copy link
Contributor

jurre commented Apr 23, 2020

Hi @lawrencejones, we're currently running into these deadlocks at GitHub, would it help if I carve out some time this week to fix up the tests?

@thom-oman
Copy link
Contributor

Hey @jurre sorry for the delays on this, both me and Lawrence haven't been able to devote any time to work on this. I managed to get all tests passing on a branch of mine last September but if I recall correctly there was some clean up to do, as well making sure this doesn't cause any performance regressions.

I will spend some time later today seeing what needs to be done but unfortunately I won't be able to dedicate much time to it until next week.

@jurre
Copy link
Contributor

jurre commented Apr 23, 2020

Thanks @thom-oman! Please let me know if there's anything I can do to help ❤️

@thom-oman
Copy link
Contributor

Quick update, good news is that I think we're very close. All tests were already passing, minus some PG related failures, those should be fixed in #398.

I've opened (yet another) PR here, will try to get some more GC eyes on it as soon as we can, would appreciate any comments from your side too. Hopefully we can get that merged pretty soon and finally get this issue resolved 😅

@thom-oman
Copy link
Contributor

Today I'm going to see if it breaks anything in our monolith, but besides that, we should be good to merge pretty soon 👍

@jurre
Copy link
Contributor

jurre commented Apr 29, 2020

Awesome, thanks so much @thom-oman 🥇

@lawrencejones
Copy link
Contributor

Superseded by #399. Thanks all for the help, and sorry for the (very large) delay.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants