-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inverse the order of writes on create transition #350
Conversation
StatesMan will update the old transitions to `most_recent: nil` and then insert the new transition as `most_recent: true`. However that can cause deadlocks on MySQL when running concurrent state transitions. On the update, MySQL will hold a gap lock to the unique indexes there are, so other transactions cannot insert. After two updates on the most_recent and two inserts, the second insert will fail with a deadlock. For more explanation see the PR description that includes this commit.
b181139
to
753649d
Compare
@lawrencejones / @Sinjo - thought you might be interested in this one. |
It's great what you find when you start running this stuff on different db engines! I hadn't considered MySQL and it's default of repeatable reads with this type of operation but it makes sense it won't be healthy for throughput. I think this probably makes sense but it does add a query to the transition operation which will have a measurable impact on people relying on low latency transitions. I'll put some time aside today to consider the full scope of that impact and give this a proper review. |
sqlite, postgres and mysql all support upsert, at least in recent versions.
Could we check for this and use it to update both of the columns
simultaneously?
We'd keep the current proposal as a fallback to maintain compatibility.
…On Thu, 13 Jun 2019, 06:00 Lawrence Jones, ***@***.***> wrote:
It's great what you find when you start running this stuff on different db
engines! I hadn't considered MySQL and it's default of repeatable reads
with this type of operation but it makes sense it won't be healthy for
throughput.
I think this probably makes sense but it does add a query to the
transition operation which will have a measurable impact on people relying
on low latency transitions. I'll put some time aside today to consider the
full scope of that impact and give this a proper review.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#350?email_source=notifications&email_token=AAFC7WXBUYGMUQTTIPXSST3P2HH6NA5CNFSM4HXRCJY2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGODXSQP4Y#issuecomment-501549043>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFC7WQEVRJJUK7KXNQAG6DP2HH6NANCNFSM4HXRCJYQ>
.
|
Using upsert is going to be really difficult through the AR layer- I think you'd need to insert the new transition, catch the conflict and write a fallback that both updates the conflicting transition and retries your initial insert. You'd also catch genuine race conflicts that should throw exceptions too, the sort you'd normally want to recover from with Doing a general tidy-up of the statesman now to get a local development environment going, then I'm going to breakdown what SQL is actually being run here to be clear on how the behaviour will change. Will drop an update here when I've got something! |
gocardless#350 @arthurnn opened gocardless#350 to reduce the impact of Statesman taking gap locks when using MySQL. The crux of the issue is that MySQL's implementation of REPEATABLE READ can take wide locks when an update touches no rows, which happens frequently on the first transition of Statesman. By first creating the new transition, we can avoid issuing an update that will take the large gap lock. This order of queries meant we added an additional query to the transition step which could impact people who rely on low-latency Statesman transitions. This commit is another take on the same approach that collapses two queries into one, by taking the update of the old and new transition's most_recent column and updating them together. It's slightly janky but if robust, would be a good alternative to avoid additional latency.
Hey @arthurnn, I had a think about this and have put together arthurnn#1 that I'd love your feedback on. It's in the same spirit as your PR but removes the additional query which could be nice for those who care. The description should explain everything but let me know if I can be clearer somewhere. |
@lawrencejones Pretty cool what you did on arthurnn#1, I like the approach of reusing the I am glad to merge your PR into my branch, so it will update this PR. just let me know. @danwakefield I thought about using an upsert, but we still need to update the old transitions to change the most_recent, thus I could not find a working solution using it. |
So because of But the |
I suspect @greysteil has already mentioned this, but we use statesman at GoCardless for a load of our application code. In several places we transactionally transition many models and while the database queries are super-fast, the latency to the database dominates the query runtime. Consequently we'd much prefer to avoid adding extra hops if possible!
Let's do it! Once we have that PR merged, we're happy to try this out in our staging environment to double check everything is sound. If things look good we'll |
Lock-lesser transitions without additional queries
Interesting note on the db latency. 👍 merged. |
will fix rubocop. one min |
|
A note to say I'm not ignoring this, just busy today. Will get to fixing this next week. |
A note to say I didn't get the time to look at things this week but intend to next. Hope it's not impacting anyone! |
@lawrencejones gentle reminder on this one. @arthurnn is now on paternity leave so won't be responsive anymore 😢 |
Ahh that's really annoying, I missed getting to this before going on holiday last week. My general plan was to identify an alternative to using Arel OR, or maybe just waiting for Rails 6 to be released at which point we can deprecate support for older versions that don't support that functionality. I think the approach here (as in, the mechanics of how the SQL now work) is the correct one, just need to find a way to make tests pass. I'm going to be pressed for time this week (it's roadmap week internally) so would welcome any adjustments you'd want to make @greysteil? |
# previous transitions. | ||
def update_most_recents(most_recent_id) | ||
transitions = transitions_for_parent | ||
last_or_current = transitions.where(id: most_recent_id).or( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Equivalent SQL should be:
transitions.where("#{transition_class.table_name}.id = #{most_recent_id} OR #{transition_class.table_name}.most_recent = #{::ActiveRecord::Base.connection.quote(true)}")
The interpolation should be safe, too, since everything's from statesman internals.
We use a similar approach in Statesman::Adapters::ActiveRecordQueries
.
Can PR it if you're 👍 on the approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Absolutely- please do just add it to this PR!
Slightly concerned about table_name
(if the table name is hyphenated, then I think this will break depending on what database you use?) but if we use it elsewhere then it's unlikely to be something people rely on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Think we use it elsewhere, but no reason not to quote the shit out of it 🤷♂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Better to use AREL in this case, to be more safe and database agnostic.
#350 @arthurnn opened #350 to reduce the impact of Statesman taking gap locks when using MySQL. The crux of the issue is that MySQL's implementation of REPEATABLE READ can take wide locks when an update touches no rows, which happens frequently on the first transition of Statesman. By first creating the new transition, we can avoid issuing an update that will take the large gap lock. This order of queries meant we added an additional query to the transition step which could impact people who rely on low-latency Statesman transitions. This commit is another take on the same approach that collapses two queries into one, by taking the update of the old and new transition's most_recent column and updating them together. It's slightly janky but if robust, would be a good alternative to avoid additional latency.
# indexes. By doing the conditioning on the column, rather than Rails' opinion of | ||
# whether the database supports partial indexes, we're robust to DBs later adding | ||
# support for partial indexes. | ||
def not_most_recent_value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about change this method to
def not_most_recent_value
false if transition_class.columns_hash["most_recent"].null == false
end
Hey hey, we're still running into this issue, @arthurnn has switched teams but I'm happy to help wherever I can to get this over the line. |
#350 @arthurnn opened #350 to reduce the impact of Statesman taking gap locks when using MySQL. The crux of the issue is that MySQL's implementation of REPEATABLE READ can take wide locks when an update touches no rows, which happens frequently on the first transition of Statesman. By first creating the new transition, we can avoid issuing an update that will take the large gap lock. This order of queries meant we added an additional query to the transition step which could impact people who rely on low-latency Statesman transitions. This commit is another take on the same approach that collapses two queries into one, by taking the update of the old and new transition's most_recent column and updating them together. It's slightly janky but if robust, would be a good alternative to avoid additional latency.
#350 @arthurnn opened #350 to reduce the impact of Statesman taking gap locks when using MySQL. The crux of the issue is that MySQL's implementation of REPEATABLE READ can take wide locks when an update touches no rows, which happens frequently on the first transition of Statesman. By first creating the new transition, we can avoid issuing an update that will take the large gap lock. This order of queries meant we added an additional query to the transition step which could impact people who rely on low-latency Statesman transitions. This commit is another take on the same approach that collapses two queries into one, by taking the update of the old and new transition's most_recent column and updating them together. It's slightly janky but if robust, would be a good alternative to avoid additional latency.
* Inverse the order of writes on statesman transitions StatesMan will update the old transitions to `most_recent: nil` and then insert the new transition as `most_recent: true`. However that can cause deadlocks on MySQL when running concurrent state transitions. On the update, MySQL will hold a gap lock to the unique indexes there are, so other transactions cannot insert. After two updates on the most_recent and two inserts, the second insert will fail with a deadlock. For more explanation see the PR description that includes this commit. * Make sure the initial most_recent state is false/nil * Lock-lesser transitions without additional queries #350 @arthurnn opened #350 to reduce the impact of Statesman taking gap locks when using MySQL. The crux of the issue is that MySQL's implementation of REPEATABLE READ can take wide locks when an update touches no rows, which happens frequently on the first transition of Statesman. By first creating the new transition, we can avoid issuing an update that will take the large gap lock. This order of queries meant we added an additional query to the transition step which could impact people who rely on low-latency Statesman transitions. This commit is another take on the same approach that collapses two queries into one, by taking the update of the old and new transition's most_recent column and updating them together. It's slightly janky but if robust, would be a good alternative to avoid additional latency. * fmt * Replace Arel#or with raw SQL * Attempt to fix typecasting * Fix MySQL issue with ordered updates This commit first refactors the construction of transition SQL statements to use Arel. This is safer and hopefully more readable than constructing large SQL statements using strings. It also fixes a bug with transition updates where MySQL would throw index violations. This was caused by MySQL validating index constraints across a partially applied update, where Statesman would set the latest transition's most_recent = 't' before unsetting the previous. * Smooth-over breaking API change in Arel rails/rails@7508284 When merged with Rails core, Arel removed an initialisation parameter that was unused in all but a few of the Arel features. For now, provide a helper local to the ActiveRecord adapter that can construct Manager classes independent of the API change. * Handle transitions differently for MySQL To avoid a gap lock with MySQL, first insert the new transition with most_recent=false and then flip the values for the latest two transitions in order. * MySQL specific gaplock protection is configurable Instead of depending on the name of the ActiveRecord adapter to identify if the database is MySQL, gaplock protection is configurable, as suggested by @jurre on #399 * Loosen dependency on pg gem This reflects a recent change in Rails (rails/rails@592358e) Co-authored-by: Arthur Neves <[email protected]> Co-authored-by: Lawrence Jones <[email protected]> Co-authored-by: Grey Baker <[email protected]>
I think this has being merged in master already right? |
@arthurnn yes, and we released it yesterday. this should've been closed, will close now. |
* Inverse the order of writes on statesman transitions StatesMan will update the old transitions to `most_recent: nil` and then insert the new transition as `most_recent: true`. However that can cause deadlocks on MySQL when running concurrent state transitions. On the update, MySQL will hold a gap lock to the unique indexes there are, so other transactions cannot insert. After two updates on the most_recent and two inserts, the second insert will fail with a deadlock. For more explanation see the PR description that includes this commit. * Make sure the initial most_recent state is false/nil * Lock-lesser transitions without additional queries gocardless#350 @arthurnn opened gocardless#350 to reduce the impact of Statesman taking gap locks when using MySQL. The crux of the issue is that MySQL's implementation of REPEATABLE READ can take wide locks when an update touches no rows, which happens frequently on the first transition of Statesman. By first creating the new transition, we can avoid issuing an update that will take the large gap lock. This order of queries meant we added an additional query to the transition step which could impact people who rely on low-latency Statesman transitions. This commit is another take on the same approach that collapses two queries into one, by taking the update of the old and new transition's most_recent column and updating them together. It's slightly janky but if robust, would be a good alternative to avoid additional latency. * fmt * Replace Arel#or with raw SQL * Attempt to fix typecasting * Fix MySQL issue with ordered updates This commit first refactors the construction of transition SQL statements to use Arel. This is safer and hopefully more readable than constructing large SQL statements using strings. It also fixes a bug with transition updates where MySQL would throw index violations. This was caused by MySQL validating index constraints across a partially applied update, where Statesman would set the latest transition's most_recent = 't' before unsetting the previous. * Smooth-over breaking API change in Arel rails/rails@7508284 When merged with Rails core, Arel removed an initialisation parameter that was unused in all but a few of the Arel features. For now, provide a helper local to the ActiveRecord adapter that can construct Manager classes independent of the API change. * Handle transitions differently for MySQL To avoid a gap lock with MySQL, first insert the new transition with most_recent=false and then flip the values for the latest two transitions in order. * MySQL specific gaplock protection is configurable Instead of depending on the name of the ActiveRecord adapter to identify if the database is MySQL, gaplock protection is configurable, as suggested by @jurre on gocardless#399 * Loosen dependency on pg gem This reflects a recent change in Rails (rails/rails@592358e) Co-authored-by: Arthur Neves <[email protected]> Co-authored-by: Lawrence Jones <[email protected]> Co-authored-by: Grey Baker <[email protected]>
Context
Dependabot is using statesman, and we are running the Dependabot service using MySQL at GitHub.
After some load testing the system, we found out some increasing number of deadlocks coming from states man. This fixed the problem for us.
This is a lock-free solution for #228. Probably also fixes #206
Short problem description
When we were trying to call
ar_model.transition_to!
, we would get a Deadlock error if the system was under pressure. (lots of concurrent transitions happening at the same time)Detailed problem and root cause analysis
Some context
When doing a
transition_to!
statesman will do, mainly, two operations in the database, inside the same transaction:Also, the transitions table usually have two unique indexes,
(parent_id,sort_key)
and(parent_id,most_recent)
, the second index is there to guarantee we only have one most_recent transition for the parent entry. Thus we need the update before the insert.Problem
MySQL default transaction isolation level is
REPEATABLE READ
and that's what we use at Dependabot/GitHub. It states:What means is if in a loop, we do the same SELECT within a db transaction, even if other clients have committed data between the runs, it will, ALWAYS, see the first results it saw the first time.
That is also respected when doing writes (Update/Delete). so
UPDATE transitions SET most_recent=NULL where parent_id=1 AND most_recent=true;
will put an exclusive lock on the row for that unique index, as it cannot allow other transactions to change the data. However, if there is no row to update, MySQL will have to create a next-key lock, to also prevent others from changing this data, as the transaction's view of the snapshot didn't include any rows, as the UPDATE returned 0 rows changed.The issue is, two independent transactions can put the next-key lock on the table (even if they reference different columns values on the index), and when it tries to do the next operation, the insert, it will lock, waiting for the next-key lock to be released. Therefore, if two updates happen first, and two inserts after, they will deadlock. (for better understanding see Simulation 1)
However, if the Update would have changed at least one row, there would be no locking, as for REPEATABLE READ that would be our current snapshot of the data. Thus no need for a gap locking.
See
This is also explained on this 2003 issue, by the people that wrote InnoDB, and on a few articles: 1 2
Solution
There are a few different ways to fix this. The easiest is just to (1)retry on deadlock, another one,(2) would be to change the transaction isolation level for that transaction to READ COMMITTED, which means SELECTS within the transaction are not consistent, and will always return committed data, thus, MySQL will not create the gap lock, as it is safe to update without the lock.
The problem of 1 is that we would still need to limit the number of retries, and eventually can still have the deadlocks, and number 2 would require us to always use ROW based replication as statement base replication wants REPEATABLE READ. [Also, not advised to change the transaction isolation level inside a library]
Instead, we can tackle the root problem: First, I thought in doing a select, and if transition data had to be updated do the update. However, that is race prone.
The actual fix is, re-order the operations, and always do the insert before, so we know the next Update will at least touch one row, and make our latest transition most_recent after. Thus, no next-key lock will happen and less contention will be created.
To simulate this:
Simulation 1 (SQL)
A and B are different MySQL clients:
cc @greysteil @hmarr