-
Notifications
You must be signed in to change notification settings - Fork 598
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: deprecate the old user-error reporting via Prometheus #15192
Comments
compute_error_count
) via Prometheus
Are we going to encourage users (especially open-source deployments) to analyze the logs on their own with their logging pipelines? I'm concerned that some users may be proficient in metrics tools like Grafana but pay less attention to logging. BTW, I think it might be a good time to expose and advertise the JSON logging format to open-source users, rather than gating itself behind the |
I think this is probably because we only provide Prometheus + Grafana in our out-of-box Kubernetes deployment. If Loki is included, things might be different. |
Besides, today I just talked with @xiangjinwu, and he changed my mind. Rather than removing the Prometheus user-error reporting code completely, we might simplify it by only recording the error type, such as The previous design of Prometheus user-error reporting was becoming too complex because we want to record some fine-grain information, e.g. which expression fails or what the error message is. By combining structured logging and Prometheus metrics, we can just record a |
+1. At the very least we don't need error message in the metrics for cloud deployments.
However, should we make the change to avoid showing error message in the prom metric's label after we have provided out-of-box persistent structure logging solution for user? IIRC, the original motivation for using label to store error message is to make it easier for us and user to find out what the errors are when the logs are not persistent or lost in their environment. Maybe not applied to all kinds of error metrics, but at least the sink error message in the sink error label has helped us identify sink issues quickly in such caess. |
Previously, we chose to report streaming user errors via Prometheus. Particularly, a counter metric
compute_error_count
was introduced with an error type label. Details here: #7831.However, the usage is a bit counter-intuitive, so not all connectors are well integrated with it. IMO, it's not well maintained.
Now we have introduced structured logging, and it appears to be a better way than that. It's easier and also more natural for this case.
I'd like to deprecate the old one and completely move to structured logging.
UPDATE: See #15192 (comment)
The text was updated successfully, but these errors were encountered: