-
Notifications
You must be signed in to change notification settings - Fork 162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Arthurnn/deadlocks #353
Arthurnn/deadlocks #353
Conversation
@lawrencejones my first commit fixes the
I'm sure the first issue is solvable. I'm a little more concerned about the later. I'm not going to have more time to work on this - writing code isn't officially part of my job at GitHub and I'm not well placed to debug things further here :-( |
Totally understand. I am Keen to give this time, just oversubscribed for this week, and I'd imagine all GC engineers are too. Will give a call round and see if anyone has some free time but otherwise next Monday is the most likely time to get back to this. Also, this PR was useful regardless. And who knows, by next week maybe rails 6 will be officially released and we can forget about senile rails versions! |
Hey @greysteil, pretty sure I have a fix for this. Can you give me write permission on your fork for the time being? I cba creating yet another PR haha |
Yep - you should have it! |
StatesMan will update the old transitions to `most_recent: nil` and then insert the new transition as `most_recent: true`. However that can cause deadlocks on MySQL when running concurrent state transitions. On the update, MySQL will hold a gap lock to the unique indexes there are, so other transactions cannot insert. After two updates on the most_recent and two inserts, the second insert will fail with a deadlock. For more explanation see the PR description that includes this commit.
#350 @arthurnn opened #350 to reduce the impact of Statesman taking gap locks when using MySQL. The crux of the issue is that MySQL's implementation of REPEATABLE READ can take wide locks when an update touches no rows, which happens frequently on the first transition of Statesman. By first creating the new transition, we can avoid issuing an update that will take the large gap lock. This order of queries meant we added an additional query to the transition step which could impact people who rely on low-latency Statesman transitions. This commit is another take on the same approach that collapses two queries into one, by taking the update of the old and new transition's most_recent column and updating them together. It's slightly janky but if robust, would be a good alternative to avoid additional latency.
This commit first refactors the construction of transition SQL statements to use Arel. This is safer and hopefully more readable than constructing large SQL statements using strings. It also fixes a bug with transition updates where MySQL would throw index violations. This was caused by MySQL validating index constraints across a partially applied update, where Statesman would set the latest transition's most_recent = 't' before unsetting the previous.
rails/rails@7508284 When merged with Rails core, Arel removed an initialisation parameter that was unused in all but a few of the Arel features. For now, provide a helper local to the ActiveRecord adapter that can construct Manager classes independent of the API change.
When constructing Arel nodes, we don't need to be quoting the values. This happens automatically via the Arel#to_sql interface, so quoting them and passing them into Arel causes them to be quoted twice.
Where the parameters of exec_update were not defaulted to sensible values, and must be supplied.
Ok! So what is currently in this PR will resolve the problems we were seeing. Unfortunately it breaks many a test. However the failures should be exclusively on Rails 4, which we intend to deprecate as soon as Rails 6 is fully released, so I don't think this is a big deal. The biggest piece of work required around this change is performance testing. If we deploy this and verify the performance is unchanged/improved, we can go ahead and merge to master in anticipation of the Rails 6 release and get this out in the wild. I'm going to ping the GC Banking teams to ask them to trial this branch to confirm no performance regressions. Hopefully we'll have an answer by EoW and can make a decision. |
Checking in on this: payments-service has some test failures that I've not yet had time to look into. Rounded up some volunteers (big-up @thom-oman!) from our Banking teams to double check the performance though, so this should be quite quick provided I can find some space to fix-up the tests. |
Hi @lawrencejones, we're currently running into these deadlocks at GitHub, would it help if I carve out some time this week to fix up the tests? |
Hey @jurre sorry for the delays on this, both me and Lawrence haven't been able to devote any time to work on this. I managed to get all tests passing on a branch of mine last September but if I recall correctly there was some clean up to do, as well making sure this doesn't cause any performance regressions. I will spend some time later today seeing what needs to be done but unfortunately I won't be able to dedicate much time to it until next week. |
Thanks @thom-oman! Please let me know if there's anything I can do to help ❤️ |
Quick update, good news is that I think we're very close. All tests were already passing, minus some PG related failures, those should be fixed in #398. I've opened (yet another) PR here, will try to get some more GC eyes on it as soon as we can, would appreciate any comments from your side too. Hopefully we can get that merged pretty soon and finally get this issue resolved 😅 |
Today I'm going to see if it breaks anything in our monolith, but besides that, we should be good to merge pretty soon 👍 |
Awesome, thanks so much @thom-oman 🥇 |
Superseded by #399. Thanks all for the help, and sorry for the (very large) delay. |
Adds a commit to #350 (I'm not a maintainer, so couldn't do it from within that PR).