Something is unstable about latest master #839

johnnyshields · 2015-08-17T20:12:56Z

I tried upgrading from 4.0.6 to latest master and the process keeps exiting in production after errors within the jobs. I don't know the root cause yet.

albus522 · 2015-08-18T14:13:06Z

What do the logs say after it exits?

Jesus · 2015-08-18T15:03:25Z

What kind of errors are the jobs raising? ArgumentError, NotImplementedError, etc.

johnnyshields · 2015-08-18T18:12:37Z

I believe the error is being caught and raised by Delayed::Command#run, but I'm not 100% sure.

The error is a database error related to being unable to find a document (it is raised by Mongoid library; error base class is StandardError). Presumably this error is happening somewhere inside my workers.

The process exits after error is logged. I have a custom lifecycle callback around :invoke_job to log errors to Bugsnag, but this callback is NOT triggered so the exit/failure is somewhat silent (!)

Note that when I start delayed job, a few jobs do complete successfully, but then it hits a job with the error condition and exits.

Rolling back to 4.0.6 fixes the issue.

I have not had time to debug further, but wanted to first raise a red flag that something significant has changed related to error catching, and it bit me in production.

bv-vijay · 2015-09-10T13:37:01Z

Have you checked for deadlock errors in logs?. This could be the reason sometimes.

johnnyshields · 2015-09-10T14:06:31Z

@vijayms doubt this is a deadlock as process is exiting. Seems to be a case of an error within a job leaking out and causing a crash

bv-vijay · 2015-09-10T14:49:11Z

@johnnyshields when a deadlock is encountered, it will raise an exception Mysql::Error: Deadlock(depends on db you are using.) which is not handled. This leads to killing the jobs. When it encounters deadlock one transaction get succeeded but another one will end up with exception. We have to rescue and retry that transaction. This will prevent from killing the jobs. Could you go through the logs for any delayed job exceptions, as this will be help full to fix your problem.

bv-vijay · 2015-09-10T14:57:20Z

@albus522 2015-07-29T22:47:45+0200: [Worker(delayed_job.15] Job -- (id=233636781) FAILED (0 prior attempts) with ActiveRecord::StatementInvalid: Mysql::Error: Deadlock found when trying to get lock; try restarting transaction: DELETE FROMdelayed_jobsWHEREdelayed_jobs.id= ?. This exception kills the job.

bv-vijay · 2015-09-10T15:40:48Z

@albus522 raise FatalBackendError if @failed_reserve_count >= 10 and count = ready_scope.limit(1).update_all(locked_at: now, locked_by: worker.name). Here update_all may raise Mysql::Error: for deadlock. So after 10 times the deadlock occurs, our code will raise FatalBackendError this will kill the job. We need to recover from deadlock.

albus522 · 2015-09-11T14:20:49Z

That is 10 deadlocks in a retry loop. Basically that is never going to complete so the only thing we can do is exit.

This is usually caused by creating a ton of very fast jobs. DJ was never intended to delay things that would execute quickly inline.

Other possible causes are a stuck worker that has obtained DB locks on things it shouldn't have or DB level operations that lock the table. Examples of locking DB level operations are backups (DB dumps), optimize, and vacuum commands

bv-vijay · 2015-09-11T15:53:32Z

@albus522 The deadlock happens not on table level, it is on indies level collectiveidea/delayed_job_active_record#63 (comment). We have to retry once to succeed the transaction. And for this deadlock condition we don't need tons of jobs. when we try to delete the job and reserve a job and both at same index page, then deadlock will encounter. This should not count as exception when we can recover from it. But we can count when it fails in retry also. It is better to retry only once or thrice. This will solve the problem.

bv-vijay · 2015-09-11T15:54:41Z

Comment in the github issue page:

We were bit by this as well during an upgrade.

I set up a concurrency test on my dev machine using two workers and a self-replicating job. The job simply creates two more of itself, with semi-random priority and run_at values. This setup reliably repros the deadlock within seconds.

The output of show engine innodb status says the contention is between the UPDATE query now used to reserve a job, and the DELETE query used to clean up a finished job. Surprising! Apparently the DELETE query first acquires a lock on the primary index (id) and then on the secondary index (priority, run_at). But the UPDATE query is using the (priority, run_at) index to scan the table, and is trying to grab primary key locks as it goes. Eventually the UPDATE and DELETE queries each grab one of two locks for a given row, try to acquire the other, and . MySQL resolves by killing the UPDATE, which crashes the worker.

The fix I've worked out locally is to replace the index on (priority, run_at) with an index on (priority, run_at, locked_by). This completely stabilizes my concurrency test! My theory is that it allows the UPDATE query's scan to skip over rows held by workers, which takes it out of contention with the DELETE query.

Hope this helps.

johnnyshields · 2015-09-22T17:40:47Z

I think #846 is a plausible candidate for my original issue I encountered. @vijayms issue may be something else.

sferik · 2015-09-22T20:54:59Z

Resolved by #846. Please reopen if that doesn’t fix your issue.

bv-vijay · 2015-09-23T10:54:29Z

@johnnyshields thanks for your reply. In my case deadlock was the problem.

sferik closed this as completed Sep 22, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Something is unstable about latest master #839

Something is unstable about latest master #839

johnnyshields commented Aug 17, 2015

albus522 commented Aug 18, 2015

Jesus commented Aug 18, 2015

johnnyshields commented Aug 18, 2015

bv-vijay commented Sep 10, 2015

johnnyshields commented Sep 10, 2015

bv-vijay commented Sep 10, 2015

bv-vijay commented Sep 10, 2015

bv-vijay commented Sep 10, 2015

albus522 commented Sep 11, 2015

bv-vijay commented Sep 11, 2015

bv-vijay commented Sep 11, 2015

johnnyshields commented Sep 22, 2015

sferik commented Sep 22, 2015

bv-vijay commented Sep 23, 2015

Something is unstable about latest master #839

Something is unstable about latest master #839

Comments

johnnyshields commented Aug 17, 2015

albus522 commented Aug 18, 2015

Jesus commented Aug 18, 2015

johnnyshields commented Aug 18, 2015

bv-vijay commented Sep 10, 2015

johnnyshields commented Sep 10, 2015

bv-vijay commented Sep 10, 2015

bv-vijay commented Sep 10, 2015

bv-vijay commented Sep 10, 2015

albus522 commented Sep 11, 2015

bv-vijay commented Sep 11, 2015

bv-vijay commented Sep 11, 2015

johnnyshields commented Sep 22, 2015

sferik commented Sep 22, 2015

bv-vijay commented Sep 23, 2015