-
Notifications
You must be signed in to change notification settings - Fork 336
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mysql2::Error: Deadlock when attempting to lock a job #63
Comments
I get this also with: 4.0.0.beta2 |
+1 .. Seems to happen to us when creating a delayed within another job running, not all the time of course, just very intermittently. |
+1 |
Same here. |
+1 delayed_job (3.0.5) mysql(5.5.27) |
Can we help anything to solve this? Anyone a patch or hints? We are restarting the workers ever 10 minutes because of this. |
@philister We ended up forking and changing the logic to use a row level lock. I suspect the original implementation went to lengths to avoid using a lock, but in the end I'll accept a performance penalty for more reliability. You can find our fork here: https://github.com/doxo/delayed_job_active_record Note that we do still get occasional deadlocks which result in restarts, but rare in comparison to what we were getting before. It would be even better if the gem could detect the deadlock and retry, but getting the root cause out of the exception raised by the MySQL adapter is not straightforward, and possibly version dependent. So I opted to live with the occasional deadlocks for now. |
@gorism Thanks a lot! |
@gorism great fix! Thanks! -- The original dj's workers were dying after deadlock. I'm happy to see my workers alive after some time. |
edit: see followup at #63 (comment). the new queries may not be a net gain, even without deadlocks. We were bit by this as well during an upgrade. I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds. The output of The fix I've worked out locally is to replace the index on Hope this helps. 🔒🔒 |
+20 Thanks a bunch cainlevy. This did the trick. Great work. |
Finally this did it for us, too. Thanks @cainlevy (Other solutions didn' t work in our case) |
Upgraded from 0.3.3 to 0.4.4 and we're experiencing deadlock issues as well. We're running a total of 10 workers and usually they manage to keep the job queue down to 1000 or so. On mysql 5.1. @cainlevy's solution didn't work for us :-(.
|
What do you get from an explain on a query with the same
|
|
I don't think we have the Hopefully you can reproduce the deadlock safely with a handful of workers running a self-replicating job. Let us know if you find anything more! |
I added |
adding locked_at to the index (as described by @cainlevy) has the side-effekt for me that from 10 workers only a few (2-5) are working at the same time. Maybe this affects the performance on the jobs table? |
Thanks @cainlevy. Adding locked_at to the index seems to have solved the problem for me as well. I did it yesterday, amidst the chaos of these deadlock errors being thrown all over the place in a queue of 100k jobs. As soon as this change was made, those deadlock issues were gone. |
+1 Thanks @cainlevy, works great. |
Is everyone that this is working for on mysql 5.5+? |
MySQL 5.5.x If my fix isn't working for someone, I'd recommend following the steps I described earlier to set up a repro someplace safe to experiment. It's the only way we're going to learn more. |
I'll publish my test setup in a moment. In the mean time: could not repro on 5.6.13 |
my experience with this patch: it does stop the deadlock but performance on my DJ doesn't improve after the deadlock is removed. However, removing mysql optimizations makes performance jump from about 1job/second/worker to about 10 jobs/second/worker. seeing if upgrading to 5.6 will help one way or the other. |
I had originally thought some of the fixes above fixed my deadlock issues but apparently I still got them. I think using multiple processes will cause race conditions with the db so I built a gem that is thread based: a manager thread pulls jobs from the db and distributes them to a threadpool of workers. I would totally welcome contributors to http://github.com/zxiest/delayed_job_active_record_threaded |
Part of the minority as well :) @albus522 "This issue primarily comes up when running a lot of really fast jobs against a mysql database, at which point why are the jobs being delayed at all." One of the reasons I use ActiveRecord-backed DJ a lot: It helps me building robust distributed applications. I call remote services with DJs, not to delay them, but to replace the network calls with "INSERT INTO" statements, which are part of the local DB transaction. The DJ then takes care of doing the network call (and only this). I hope you guys don't mind that I plug my blog here, but those who are interested can read more here: http://blog.seiler.cc/post/112879365221/using-a-background-job-queue-to-implement-robust |
Minority here as well. We are using @csmuc 's monkeypatch which got rid of the deadlocks but then our select times got crazy slow (~200k rows, 2-3 seconds, MySQL 5.6.22, AWS RDS db.m1.small). After monkeying around with indexes we found a pretty weird combination that seems to work well: add_index :delayed_jobs, [:locked_by, :locked_at, :run_at] #has to be in this order Somehow this causes MySQL to do an index_merge with a sort_union and brings the select times down to a couple ms. Disclaimer: We haven't used this in production but I'll let you all know how it goes... |
Same problem here, even with a relatively small number of jobs (~500 recurring hourly - some small, some large), 3 servers, 6 workers each, and only ~2500 rows in the DJ table. Although I even see the deadlocks if only one server is running. These deadlocks seem to cause our DB CPU to hit and maintain ~100% utilization in perpetuity. This is obviously a big problem... Is there any sort of consensus on which fork or monkey patch solves the problem? @albus522? @whitslar how have those indexes been working for you over time with @csmuc's monkey patch? |
The other week we tried out the MySQL version a couple of hours in production (by removing my monkey patch) and almost immediately experienced deadlocks. I also would like to get rid of the monkey patch by having a configuration option. I guess the chances are low to get that merged into this repo, so I'm also interested in moving to a fork as well. |
They've been working great. They've been in production for about 4 weeks now and we have had no issues. No deadlocks and the select times are great. Before the indexes, our production db (db.m3.xlarge, MySQL 5.6.22, ~200k rows, 9 workers) had been running between 70-80% cpu pretty regularly and now it's steady at 5-10%. |
PR to add a configuration option #111 |
Ok, this is a huge gotcha. I just traced some major DB issues down to this problem here. Long story short, I just found a 47GB mysql-slow query log that completely filled up my hard drive and took down my app. This is one of the many side effects i've been noticing since I added multiple app servers. I only have 3 application servers and 1 worker per server and this is consistently happening. Is there a fix for this in master? or do I need to use a fork? I'm really not super good with DB tuning, so i'm not sure I follow all of the info posted above, but maybe this can add to the discussion. This is what mysql is flooding my error log with:
I think this is a different problem in a way because it is a replication specific problem. But something that should contribute to the decision made here I believe. |
I'm also part of the growing minority of people who are experiencing this issue on real-life production setups. Deadlocks kill delayed jobs processes and we end up having far less workers after spikes of jobs. We're considering monkey-patching, as the different forks do not contain the latest additions to delayed_jobs, or just move to resque. |
There is an open PR to add a config option: #111 |
Wow. I've been using the |
Good news: |
Awesome, now you can choose between a slower option and a fundamentally flawed option (see my above comments to see why it is fundamentally flawed). Not sure why I'm still subscribed to this thread... unsubscribing. good luck all. |
Ahh I dunno, the original locking mechanism + the indexes I described above completely fixes the issue for us. And if I were to bet on it, it would fix the root issue for everyone else here, given they were on MySQL >= 5.6 (there were some weird things happening with index merge/sort unions on 5.5, but I haven't tested so I can't confirm, mainly because it only crops it head up under a decent size load) Has anyone else been using the indexes I described? Either way, very happy to have PR #111 merged! |
@whitslar haven't tried in production, but the indexes made the queries actually a bit slower on my dev box (MySQL 5.6.13, 2k records). Maybe I'll spend some time and play around with other indexes |
Not sure what i'm doing wrong, but I tried master branch with the default lock method (legacy) and my logs were immediately flooded with deadlock errors:
I put this in gemfile: and this in config/initializers/delayed_job_config.rb Any ideas why this is still a problem / perhaps got worse when I tried the fix? Thanks |
Because that method was rarely better, which is why it isn't the default. |
So bottom line delayed jobs will not work with mysql and multiple workers? |
@smikkelsen the SQL must look something like this:
My Rails initializer: Delayed::Backend::ActiveRecord.configure do |config|
config.reserve_sql_strategy = :default_sql
end I successfully use it in production (MySQL 5.6, 8 workers) and ran into deadlocks permanently using the other SQL strategy (which also littered the mysql log files with replication warnings). |
Running a ton of small jobs asynchronously is a perfectly reasonable thing to do, as is running multiple workers to offload the actual work. The queueing mechanism should work regardless... period. Shouldn't matter whether there are 5 big slow jobs or 10,000 fast ones. |
@brettwgreen It is also reasonable to assume you optimize for your primary use case. No tool will ever work ideally for all situations. |
Having a similar issue with the latest version (4.1.3) of I've used the following workaround to fix the issue: # config/initializers/delayed_job_active_record.rb
# workaround based on https://github.com/collectiveidea/delayed_job_active_record/pull/91
#
module Delayed
module Backend
module ActiveRecord
class Job < ::ActiveRecord::Base
def save(*)
retries = 0
begin
super
rescue ::ActiveRecord::Deadlocked => e
if retries < 100
logger.info "ActiveRecord::Deadlocked rescued: #{e.message}"
logger.info 'Retrying...'
retries += 1
retry
else
raise
end
end
end
end
end
end
end |
I just ran into a version of this issue. we're on mysql 5.7. for me the deadlocks were occurring between DJ worker queries that was searching for updates (optimize sql) and INSERTs from the app to enqueue a job. both queries require a lock on the index, but i concluded there is no deadlock scenario, it's just it just took too dang long to release the lock on the index, even on an nearly empty delayed_jobs table. we have around 12 workers. What fixed it for us is we ran |
We've just upgraded from 0.3.3 to 0.4.4, to resolve the race condition problem when running multiple workers. We're now seeing occasional MySQL deadlocks, which are unhandled exceptions and end up killing the worker.
For example:
It would seem that, at the very least, this should be handled.
In case it matters, we're using delayed_job 3.0.4.
The text was updated successfully, but these errors were encountered: