Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SuspendCampaign TSA action and retry_interval can produce unwanted results #322

Open
lianatech-jutaky opened this issue Dec 5, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@lianatech-jutaky
Copy link

What Operating System are you seeing this problem on?

Debian 12

What Hardware is this system running?

Various x86_64 systems

KumoMTA version

Observed with 2024.10.21.081529.1b8e8a8a

Did you try the latest release to see if the issue is better (or worse!) than your current version?

No, and I'll explain why below

Describe the bug

We use brief five minute campaign suspensions which are triggered often and with low thresholds to manage deliverability and kind of to simulate campaign based message rate reductions.

This setup has one nasty downside: KumoMTA's internal transient failures on scheduled queue suspension obey the exponential retry interval increase.

I believe this behavior has been labeled as a "fix" in 2024.11.08-d383b033

In addition, suspensions will now always respect the normal exponential backoff retry schedule instead of clumping together when the suspension expires.

This has a side effect of KumoMTA ballooning that five minute suspension to hours in practice depending on how the individual emails hammer against the internal suspension.

Currently we have mitigated this using a rather small max_retry_interval, but this is not a very good solution as it also affects emails which are actually experiencing external transient failures.

To Reproduce

No response

Configuration

Not really a technical bug. More of a mailing policy issue.

Expected Behavior

I'd want to see the campaign suspension to be practically more exact and what I describe in the TSA action. In my opinion emails should be retried after the internal suspension is lifted and scheduled queue is released from suspension. Let max_message_rate dictate how fast they go out after the queue unsuspension. Not the retry interval which should always be an action for external response in my opinion.

Currently the scheduled queue just sits idle until the lengthy retry intervals are met for individual emails and retry internal can increase very fast.

Anything else?

No response

@lianatech-jutaky lianatech-jutaky added the bug Something isn't working label Dec 5, 2024
@wez
Copy link
Collaborator

wez commented Dec 5, 2024

FWIW, we don't recommend suspensions that are shorter than the retry interval, because very short suspensions are unlikely to be especially effective, and you'll encounter the situation you described. You might consider throttling the message rate instead, but you'd need:

In this particular scenario, we chose to respect the exponential backoff because it is otherwise very easy to accumulate large queues that retry very frequently in the face of persistent issues that repeatedly trigger the suspension rules. Those are undesirable because the continual frequent retries impose increased and sustained I/O and memory pressure. Another side effect of rigidly following the suspension expiration time is a "hammer" effect when very large portions of, or perhaps the entire, queue are all suddenly ready at the same time, and all that I/O and memory are needed around the same time.

One of the options we'd considered as part of making suspensions respect the retry schedule was a "smart" option where we'd consider the difference between the suspension end time and the next retry window and pick whichever one felt "more reasonable". The reasonable-ness here is something that would benefit from the operator having some input to tune it, as some will want more aggressive options than others.

I think we need a bit more feedback on the various cases before we dive in to make changes here.

@lianatech-jutaky
Copy link
Author

I would accept implementation of #299 as a solution for this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: No status
Development

No branches or pull requests

2 participants