-
Notifications
You must be signed in to change notification settings - Fork 953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Something is unstable about latest master #839
Comments
What do the logs say after it exits? |
What kind of errors are the jobs raising? |
I believe the error is being caught and raised by The error is a database error related to being unable to find a document (it is raised by Mongoid library; error base class is The process exits after error is logged. I have a custom lifecycle callback around Note that when I start delayed job, a few jobs do complete successfully, but then it hits a job with the error condition and exits. Rolling back to 4.0.6 fixes the issue. I have not had time to debug further, but wanted to first raise a red flag that something significant has changed related to error catching, and it bit me in production. |
Have you checked for deadlock errors in logs?. This could be the reason sometimes. |
@vijayms doubt this is a deadlock as process is exiting. Seems to be a case of an error within a job leaking out and causing a crash |
@johnnyshields when a deadlock is encountered, it will raise an exception |
@albus522 |
@albus522 |
That is 10 deadlocks in a retry loop. Basically that is never going to complete so the only thing we can do is exit. This is usually caused by creating a ton of very fast jobs. DJ was never intended to delay things that would execute quickly inline. Other possible causes are a stuck worker that has obtained DB locks on things it shouldn't have or DB level operations that lock the table. Examples of locking DB level operations are backups (DB dumps), optimize, and vacuum commands |
@albus522 The deadlock happens not on table level, it is on indies level collectiveidea/delayed_job_active_record#63 (comment). We have to retry once to succeed the transaction. And for this deadlock condition we don't need tons of jobs. when we try to delete the job and reserve a job and both at same index page, then deadlock will encounter. This should not count as exception when we can recover from it. But we can count when it fails in retry also. It is better to retry only once or thrice. This will solve the problem. |
Comment in the github issue page: We were bit by this as well during an upgrade. I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds. The output of show engine innodb status says the contention is between the UPDATE query now used to reserve a job, and the DELETE query used to clean up a finished job. Surprising! Apparently the DELETE query first acquires a lock on the primary index (id) and then on the secondary index (priority, run_at). But the UPDATE query is using the (priority, run_at) index to scan the table, and is trying to grab primary key locks as it goes. Eventually the UPDATE and DELETE queries each grab one of two locks for a given row, try to acquire the other, and . MySQL resolves by killing the UPDATE, which crashes the worker. The fix I've worked out locally is to replace the index on (priority, run_at) with an index on (priority, run_at, locked_by). This completely stabilizes my concurrency test! My theory is that it allows the UPDATE query's scan to skip over rows held by workers, which takes it out of contention with the DELETE query. Hope this helps. |
Resolved by #846. Please reopen if that doesn’t fix your issue. |
@johnnyshields thanks for your reply. In my case deadlock was the problem. |
I tried upgrading from 4.0.6 to latest master and the process keeps exiting in production after errors within the jobs. I don't know the root cause yet.
The text was updated successfully, but these errors were encountered: