Discussion: persist system events and query via SQL #13267

zwang28 · 2023-11-06T08:22:25Z

Is your feature request related to a problem? Please describe.

I'd like to discuss whether introducing such system events can help address trivial issues from users more efficiently, saving the time for communication and environment setup.

The system event I'm talking about is basically a duplication of most crucial logs from all worker nodes, with certain amendment and supplement (for metrics). It's persisted and queryable via SQL. It feels like implementing a logging service inside kernel 🥵 .

The duplication may look unnecessary if there's already a logging service like Loki deployed. But if there is not, logs may become lost, e.g. pod restarts, which makes troubleshooting very difficult.
Another scenario it helps is whenever we cannot directly access logs during support.
- For on-premises deployment, I think it's less difficult for us to request a restricted SQL role, than request a k8s permission.
- For community user, dumping diagnosis info via one SQL would be more convenient than copying log everywhere.

Note that system event is not a replacement of logs or metrics, but as a method to filter out simple cases in advance.

Below are some system events I think helpful.

Alarms for unhealthy metrics, e.g. barrier latency, storage IO latency.
Cause of barrier collection failure. That's equivalent to the cause of the first actor failure among all compute nodes.
Start_time, end_time, result of a recovery.
Modification of system parameters.
Errors, warnings, panics.
Worker node opt, env var, config.

Describe the solution you'd like

With regards to the implementation,

I have thought about making worker nodes persist events locally and gathering them on demand when querying. However, currently we don't assume the storage used by a worker node can always persist, e.g. pod restart. So this may not work.
Alternatively, worker nodes can report events to meta node, which is subsequently persisted into meta store. This requires carefully choosing the target event, to limit the data volume.

Describe alternatives you've considered

No response

Additional context

No response

hzxa21 · 2023-11-07T03:55:56Z

Implementation aside, I like the idea of having SQL commands for

Help user to debug data or environment related issue. Examples: what error triggers the system recovery? is there any source/streaming/sink errors? Previously we need to dig into logs from different nodes with many noises to find out relevant information.
Help us to collect necessary information for further debugging. Examples: Which executor is slow? What is the shape of LSM? Previously we need to log onto the pod and run risectl to find out relevant information.

Most of the time, these SQL commands should be run by user, not us. Therefore, I suggest making it simple (one-line).

fuyufjh · 2023-11-08T04:21:24Z

Below are some system events I think helpful.

Alarms for unhealthy metrics, e.g. barrier latency, storage IO latency.

Cause of barrier collection failure. That's equivalent to the cause of the first actor failure among all compute nodes.

Start_time, end_time, result of a recovery.

Modification of system parameters.

Errors, warnings, panics.

Worker node opt, env var, config.

2,3,4,6 LGTM
1 might be difficult because one needs to do lots of work to collect and check these metrics. Now the metrics are directly collected by Prometheus client, which is good, and I don't want to make this part too complicated.
Similarly, 5 might be difficult because to do this generally, you will invade into the logging logic...

Overall, I think 2 is our major pain and I guess this is your motivation, so let's keep it simple now by only recording these events.

Alternatively, worker nodes can report events to meta node, which is subsequently persisted into meta store. This requires carefully choosing the target event, to limit the data volume.

This seems to be the only feasible approach. I don't think it can rely on Hummock because Hummock may also be a source of failure. Additionally, according to my comments above, the true critical events mostly happen on Meta node.

fuyufjh · 2023-11-10T03:51:29Z

feat(observability): store and show the reasons of last N times of recovery #12827
feat(observability): a command check cluster information & status #12826
feature request: print a error message including the failed create materialized view statement when recovery #12712

zwang28 added the type/feature label Nov 6, 2023

github-actions bot added this to the release-1.5 milestone Nov 6, 2023

zwang28 removed this from the release-1.5 milestone Nov 6, 2023

zwang28 mentioned this issue Nov 13, 2023

feat(meta): add event log service #13392

Merged

8 tasks

zwang28 closed this as completed Dec 4, 2023

fuyufjh mentioned this issue Dec 6, 2023

feat: Embedded (back-pressure) metrics for dashboard #13830

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Discussion: persist system events and query via SQL #13267

Discussion: persist system events and query via SQL #13267

zwang28 commented Nov 6, 2023 •

edited

Loading

hzxa21 commented Nov 7, 2023

fuyufjh commented Nov 8, 2023

fuyufjh commented Nov 10, 2023

Discussion: persist system events and query via SQL #13267

Discussion: persist system events and query via SQL #13267

Comments

zwang28 commented Nov 6, 2023 • edited Loading

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe alternatives you've considered

Additional context

hzxa21 commented Nov 7, 2023

fuyufjh commented Nov 8, 2023

fuyufjh commented Nov 10, 2023

zwang28 commented Nov 6, 2023 •

edited

Loading