Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: persist system events and query via SQL #13267

Closed
zwang28 opened this issue Nov 6, 2023 · 3 comments
Closed

Discussion: persist system events and query via SQL #13267

zwang28 opened this issue Nov 6, 2023 · 3 comments

Comments

@zwang28
Copy link
Contributor

zwang28 commented Nov 6, 2023

Is your feature request related to a problem? Please describe.

I'd like to discuss whether introducing such system events can help address trivial issues from users more efficiently, saving the time for communication and environment setup.

The system event I'm talking about is basically a duplication of most crucial logs from all worker nodes, with certain amendment and supplement (for metrics). It's persisted and queryable via SQL. It feels like implementing a logging service inside kernel 🥵 .

  • The duplication may look unnecessary if there's already a logging service like Loki deployed. But if there is not, logs may become lost, e.g. pod restarts, which makes troubleshooting very difficult.
  • Another scenario it helps is whenever we cannot directly access logs during support.
    • For on-premises deployment, I think it's less difficult for us to request a restricted SQL role, than request a k8s permission.
    • For community user, dumping diagnosis info via one SQL would be more convenient than copying log everywhere.

Note that system event is not a replacement of logs or metrics, but as a method to filter out simple cases in advance.

Below are some system events I think helpful.

  • Alarms for unhealthy metrics, e.g. barrier latency, storage IO latency.
  • Cause of barrier collection failure. That's equivalent to the cause of the first actor failure among all compute nodes.
  • Start_time, end_time, result of a recovery.
  • Modification of system parameters.
  • Errors, warnings, panics.
  • Worker node opt, env var, config.

Describe the solution you'd like

With regards to the implementation,

  • I have thought about making worker nodes persist events locally and gathering them on demand when querying. However, currently we don't assume the storage used by a worker node can always persist, e.g. pod restart. So this may not work.
  • Alternatively, worker nodes can report events to meta node, which is subsequently persisted into meta store. This requires carefully choosing the target event, to limit the data volume.

Describe alternatives you've considered

No response

Additional context

No response

@github-actions github-actions bot added this to the release-1.5 milestone Nov 6, 2023
@zwang28 zwang28 removed this from the release-1.5 milestone Nov 6, 2023
@hzxa21
Copy link
Collaborator

hzxa21 commented Nov 7, 2023

Implementation aside, I like the idea of having SQL commands for

  1. Help user to debug data or environment related issue. Examples: what error triggers the system recovery? is there any source/streaming/sink errors? Previously we need to dig into logs from different nodes with many noises to find out relevant information.
  2. Help us to collect necessary information for further debugging. Examples: Which executor is slow? What is the shape of LSM? Previously we need to log onto the pod and run risectl to find out relevant information.

Most of the time, these SQL commands should be run by user, not us. Therefore, I suggest making it simple (one-line).

@fuyufjh
Copy link
Member

fuyufjh commented Nov 8, 2023

Below are some system events I think helpful.

  1. Alarms for unhealthy metrics, e.g. barrier latency, storage IO latency.
  2. Cause of barrier collection failure. That's equivalent to the cause of the first actor failure among all compute nodes.
  3. Start_time, end_time, result of a recovery.
  4. Modification of system parameters.
  5. Errors, warnings, panics.
  6. Worker node opt, env var, config.
  • 2,3,4,6 LGTM
  • 1 might be difficult because one needs to do lots of work to collect and check these metrics. Now the metrics are directly collected by Prometheus client, which is good, and I don't want to make this part too complicated.
  • Similarly, 5 might be difficult because to do this generally, you will invade into the logging logic...

Overall, I think 2 is our major pain and I guess this is your motivation, so let's keep it simple now by only recording these events.

Alternatively, worker nodes can report events to meta node, which is subsequently persisted into meta store. This requires carefully choosing the target event, to limit the data volume.

This seems to be the only feasible approach. I don't think it can rely on Hummock because Hummock may also be a source of failure. Additionally, according to my comments above, the true critical events mostly happen on Meta node.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants