-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Even with flow control enabled, carbon aggregator drops incoming metrics when output queues are full #956
Labels
Comments
I'll submit patches for the non-aggregator fixes. |
bucko909
added a commit
to bucko909/carbon
that referenced
this issue
Aug 21, 2024
bucko909
added a commit
to bucko909/carbon
that referenced
this issue
Aug 21, 2024
I've used a smaller overage for this setting, as the cache is a much slower process than the relay, so hopefully its size is much higher.
This was referenced Aug 21, 2024
bucko909
added a commit
to bucko909/carbon
that referenced
this issue
Aug 21, 2024
bucko909
added a commit
to bucko909/carbon
that referenced
this issue
Aug 21, 2024
I've used a smaller overage for this setting, as the cache is a much slower process than the relay, so hopefully its size is much higher.
Wow, that's really impressive investigation and fix, much appreciated! |
bucko909
added a commit
to bucko909/carbon
that referenced
this issue
Aug 29, 2024
bucko909
added a commit
to bucko909/carbon
that referenced
this issue
Aug 29, 2024
I've used a smaller overage for this setting, as the cache is a much slower process than the relay, so hopefully its size is much higher.
bucko909
added a commit
to bucko909/carbon
that referenced
this issue
Aug 29, 2024
I've used a smaller overage for this setting, as the cache is a much slower process than the relay, so hopefully its size is much higher.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
There is a config setting,
USE_FLOW_CONTROL
, which defaults to enabled. The stated behaviour is that, when enabled, carbon will pause reading metrics when its sink is full. Indeed, carbon does pause receiving metrics, but it also drops any "in-flight" metrics.Test setup
Set up Carbon to not relay aggregator misses, and to use line protocol (so it's easy to test from bash). Also set up
a.*
to aggregate every 3 seconds, and carbon to spit out stats every 1 second, just to avoid waiting around. And verify thatFLOW_CONTROL
isTrue
, so that data shouldn't be dropped when buffers fill.Fire up a socket to look at what comes out:
Fire up an aggregator:
Send 30k unaggregated metrics:
This fairly consistently gives me stats like:
(If I make a longer metric name, the count drops, so this looks related to a buffer size at some point in the pipeline.)
Send 80k aggregated metrics:
This gives much less consistent output, but does consistently drop data anyway. The timestamps are later than in the unaggregated case, so the dropping is at aggregation time, not production time:
Send 1 aggregated metric with 80k datapoints:
This never drops anything, even with very, very large counts.
There are a few problem cases:
FORWARD_ALL
), acting as a relay, when theMAX_QUEUE_SIZE
is hit, thecacheFull
event is called. However, this apparently only pauses reading from the socket, but doesn't pause processing of already-read data. This means it turns on flow control, but drops a bunch of metrics on the floor immediately. This bug probably affects the relay service, too.IntervalBuffer
pairs as it can, then aggregate out the intervals when theirLoopingCall
s tick over. The aggregation process is not paused at all when the buffer fills, which means, again, metrics are just dropped on the floor when the output becomes full. Realistically, the problem here is that the aggregator's state isn't properly limiting its count ofIntervalBuffer
pairs toMAX_QUEUE_SIZE
-- if it ever exceeds this value, it is likely to drop data on the next aggregation tick, so should be considered independently full.Inspecting the code, the same logic exists in
_MetricCache.store
-- it relies onevents.cacheFull()
to prevent any more calls, and will drop any datapoints after this until the queue drains somewhat.Expected behavior
With
USE_FLOW_CONTROL = True
, no metrics should ever be dropped.Suggestion
I propose that:
USE_FLOW_CONTROL
should be properly implemented for daemons other than the cache. Fixing this reduces the number of drops in my example from 14500 to 6332 but not to zero. It's also very simple to do.MAX_QUEUE_SIZE
should be (allowed to be?) taken as a soft limit which causes the input streams to pause, and the daemon be allowed to exceed this limit perhaps by some factor once the inputs are paused -- or it should soft-cap belowMAX_QUEUE_SIZE
. PerhapsQUEUE_LOW_WATERMARK_PCT
should be lowered, andQUEUE_HIGH_WATERMARK_PCT
should be added, with the queue considered full above this but not actually dropping data. Experimentation suggestsQUEUE_LOW_WATERMARK_PCT = 0.6
andQUEUE_HIGH_WATERMARK_PCT = 0.7
are sufficient to avoid metric dropping on the defaultMAX_QUEUE_SIZE = 10000
. Setting 0.6/0.8 was not sufficient. Alternatively,HARD_MAX_QUEUE_SIZE_PCT = 1.25
(takingMAX_QUEUE_SIZE
as the high watermark) and the defaultQUEUE_LOW_WATERMARK_PCT = 0.8
was fine, too.MAX_CACHE_SIZE
.These will fix the relay behaviour.
IntervalBuffer
count should be capped to some function ofMAX_QUEUE_SIZE
or some new setting, and behave like a full queue when it exceeds this. Flow control might not be the correct solution here; the aggregation buffer count can never be "caught up" but is instead a function of the number of active metrics.IntervalBuffer
and/or the total number of datapoints shared between all of these.The fix for the aggregation case is difficult, as the code currently only understands when its sole sink (the relay output, in this case) is blocked. The logic needs to understand that it can be paused for multiple reasons. These might best be moved to a distinct ticket.
The text was updated successfully, but these errors were encountered: