You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 1, 2024. It is now read-only.
The error handling code in HttpInlinerSender/QueueableSender can lead to a huge amount of writes being pushed to influx in certain situations.
More specifically what happens:
Assuming our service is producing a metric which is rejected (in our case we were trying to write an int metric to a metric with type float), then every time HttpInlinerSender::doSend is called, the entire batch will be rejected. The issue arises once the measures queue is full. At that point every time QueueableSender.send is called, it will try to send the entire batch (5000 items). Assuming this batch continually gets rejected, that means for each reporting interval, for each metric, the library will try and send the entire batch.
Some back of the envelope estimates - assuming a reporting interval of 10 seconds, with 10 metrics, we would expect to see (5000 measures * 10 metrics)/10 seconds = 5000 measurements/second written to influx. This is a huge amount of writes to push to influx. Even if your influx cluster can handle that load OK, it still introduces quite a bit of resource overhead for the service doing the reporting.
So there are two distinct issues here:
The metrics that were failed to be written actually connected to Influx correctly, but a 400 was returned. However, on a non 200/204 return code HttpInlinerSender.java:81 tries to log the following: LOGGER.info("failed to send {} Measures to {}://{}:{}, HTTP CODE received: {}\n", measures.size(), writeURL.getProtocol(), writeURL.getHost(), writeURL.getPort(), responseCode, Miscellaneous.readFrom(con.getInputStream()));
In our case the con.getInputStream() was failing (presumably the connection was closed on the server's end). Which then went to the catch block which caused the metrics to never be cleared. This issue is likely most easily fixed by just removing the Miscellaneous.readFrom(con.getInputStream()) (especially since it seems to not be logged). I will submit a pull request for this once we test it out internally.
I am less sure whether this is actually an issue - the catch block causes the metric back to never be cleared. If there is another scenario (outside of the one described in the last bullet) that would cause the requests to be sent to influx but also cause the library to catch an IOException we would expect to see the same behavior. If that's possible it is probably worth adding something to handle expiring failed measurements after a number of failed sends.
The text was updated successfully, but these errors were encountered:
Thanks for reporting.
For your PR, do not forget, that for transient failure (network issue, unreacheable server, http status 5xx) we should retry.
Maybe the best would be to allow utilisation of a custom http client that could include several policy (circuit breaker, exponential retry, deadline,...). All of this feature are outside the scope of this lib and should be delegated. IMHO a side-car process should take care of this (eg. envoy), but allowing custom http client could be fine.
Anyway every PR are welcome, and do not hesitate to poll me if I too long to react.
Seems that I am facing the same issue, not sure though, is there a workaround for that?
Our case is that for some reason InfluxDB went down and our application kept hammering with metrics that where reject as InfluxDB was down. Once InfluxDB was up, all requests since then are 400. Any help would be appreciated!
Hey - we have been running with an internal patched version to fix the first issue for a little while now, I'll submit the MR for it today.
The only workaround we found was to resolve the issue that was causing the metrics to be rejected (in our case a service was trying to publish a float to an int field) and then restart the services that had the issue. Keep in mind you will lose any metrics that were queued up if they were never successfully written.
tcdevoe
pushed a commit
to tcdevoe/metrics-influxdb
that referenced
this issue
Apr 25, 2018
The error handling code in
HttpInlinerSender
/QueueableSender
can lead to a huge amount of writes being pushed to influx in certain situations.More specifically what happens:
Assuming our service is producing a metric which is rejected (in our case we were trying to write an
int
metric to a metric with typefloat
), then every timeHttpInlinerSender::doSend
is called, the entire batch will be rejected. The issue arises once themeasures
queue is full. At that point every timeQueueableSender.send
is called, it will try to send the entire batch (5000 items). Assuming this batch continually gets rejected, that means for each reporting interval, for each metric, the library will try and send the entire batch.Some back of the envelope estimates - assuming a reporting interval of 10 seconds, with 10 metrics, we would expect to see (5000 measures * 10 metrics)/10 seconds = 5000 measurements/second written to influx. This is a huge amount of writes to push to influx. Even if your influx cluster can handle that load OK, it still introduces quite a bit of resource overhead for the service doing the reporting.
So there are two distinct issues here:
The metrics that were failed to be written actually connected to Influx correctly, but a 400 was returned. However, on a non 200/204 return code
HttpInlinerSender.java:81
tries to log the following:LOGGER.info("failed to send {} Measures to {}://{}:{}, HTTP CODE received: {}\n", measures.size(), writeURL.getProtocol(), writeURL.getHost(), writeURL.getPort(), responseCode, Miscellaneous.readFrom(con.getInputStream()));
In our case the
con.getInputStream()
was failing (presumably the connection was closed on the server's end). Which then went to thecatch
block which caused the metrics to never be cleared. This issue is likely most easily fixed by just removing theMiscellaneous.readFrom(con.getInputStream())
(especially since it seems to not be logged). I will submit a pull request for this once we test it out internally.I am less sure whether this is actually an issue - the catch block causes the metric back to never be cleared. If there is another scenario (outside of the one described in the last bullet) that would cause the requests to be sent to influx but also cause the library to catch an
IOException
we would expect to see the same behavior. If that's possible it is probably worth adding something to handle expiring failed measurements after a number of failed sends.The text was updated successfully, but these errors were encountered: