-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
The timeout problem in watchbot 4 #203
Comments
oooh that heartbeat solution is really a good one. The watcher can do that for sure. We would want to change the watchbot options a little though. You'd give the caller the option to specify a max job duration. This is how long the watcher would let the worker run. It wouldn't be tied to the SQS message timeout except that the watcher would keep extending the timeout until either (a) the worker was done, or (b) hit that optional configuration timeout. I think this'd be killer -- a |
Note: I think that there is a limit to how long you can defer visibility. 12 hours as I recall. |
The
The other advantage to implementing this way over the heartbeat solution would be the number of API calls we make to the change-visibility-timeout API. |
If you have an option called With the hearbeat solution I was outlining, I imagined that by default, someone would not set Then, the In other words:
The last situation to consider would be what to do if a worker takes more than the maximum visibility timeout: do we cut it off, or do we let the work get duplicated?
p.s. SQS has no rate limits O_O |
The purpose I was imagining was an estimate for the
This clarifies a lot. I see what you mean now, and I'm more convinced with what you've outlined. With this additional information, it also seems like the Given that we have this
For smaller jobs, the end of the job will delete the message from the queue and the container, nullifying the need for a heartbeat, and ignoring the
For this situation, can we not delete the message from the queue when the MaxVisibilityTimeout is reached, and then send a fresh message to the queue (using the received SQS message), if the worker fails to finish, that will reset this VisibilityTimeout? We will be messing with the
Here's a hand-crafted expression of being mind-blown: #til :party_duck: |
I would suggest we pull from the queue with a 3 min visibility timeout, always. Then, every 2 minutes the watcher increases the timeout in 3 minute increments until you reach What this does is prevent you from having to wait a long time to retry a message if by some stroke of bad luck the watcher gets killed. In that case, the you have to wait for whatever visibility timeout the watcher asked for. If you originally asked for a long time, then you have to wait that long to retry. Bumping the visibility timeout in relatively small increments makes sure that in the case of a message pulled from the queue that never gets a response, you can retry soon. |
Timeout problem should be addressed with #230
|
We missed a key fact in the docs when pushing this out:
This max is not per extension of the timeout, it's total. If the heartbeat tries to extend the timeout any further than this you receive:
Right now watchbot4 is not compatible with messages that take longer than 12 hours. They will return to visibility and get picked up by a new worker while still being processed.
|
The most straightforward option to consider is to make 12 hour a hard job duration limit. If it gets to 12 hours, the child process gets killed and the job will be retried. That sucks, but its better than a 5 min lambda hard limit. Anything else is quite challenging without a source of truth about what is being processed.
Your single source of truth about what is in flight has to (a) track sqs message ids that are being processed, and (b) be able to be updated to the most recent receipt handle that's valid for the message. Its really quite the mess. |
@rclark I'm just curious how you guys verified that a message returns to the queue when the 12-hour visibility timeout is exceeded, could you please let me know? |
Hi @KeithYJohnson -- that is something we entirely rely on SQS to manage for us. Check out their docs here. |
Thanks! I did notice that because I'm dealing with similar issue and just wanted to verify that someone has observed the message returning to the queue when exceeding that hard limit. I suppose I should just stop being paranoid and trust the docs :) |
Background
With the distributed nature of watchbot 4’s polling, we resurface an old problem with receiving messages that are currently processing but have timed out their message timeouts and been sent back into the queue.
From @rclark:
Important reading on duplicate receives: #175
Further reading https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html
Next actions
We should draft up a list of potential options to mitigate this situation. @rclark’s solution in watchbot 3 is outlined here: #176. SQS suggests a “heartbeat” system in https://docs.aws.amazon.com/AWSSimpleQueueService/latest/SQSDeveloperGuide/sqs-visibility-timeout.html:
We also have some interesting methods of implementing hard timeouts on containers by killing sub-processes if we wanted to go in a more brutal direction.
The floor is open for other suggestions from @mapbox/platform-engine-room. Ideally we can get rolling on some implementations and tests soon.
cc/ @emilymcafee @rclark
The text was updated successfully, but these errors were encountered: