Tacking: Telemetry #16332

st1page · 2024-04-16T07:55:29Z

per-cluster

I believe that we do not need to concern ourselves with the specific locations of the expressions and aggregators used by the users within the plan; rather, we want to know their usage ratio in the product. What we care about is the exact expressions and aggregators that users write in their SQL, not the optimized and rewritten expressions.

Count the usage frequency of each type of aggregator used in streaming/batch queries, with statistics aggregated. Please note:
- With or without distinct and filter clause are considered as different types of aggregators.
- In RisingWave, aggregators such as AVG, VAR_POP, etc., may be rewritten into other aggregators; we need to count the aggregators before the rewrite.
Count the usage frequency of each type of function used in streaming/batch queries, with statistics aggregated.
- Count the usage before optimizations (such as constant folding) are applied.

per-streaming job

Regarding the analysis of the workload, we need more detailed information rather than simple grouped counts. The impact on the workload is significant depending on whether an aggregation (agg) is placed before or after a join. Therefore, we need to maintain a simple plan tree for each streaming job, with each node containing some telemetry information about itself.

For each streaming job, capture the plan tree without detailed information such as expressions (expr).
- Store attributes for each plan output, including:
  - Whether cleaning state with a watermark on join key.
  - Whether cleaning state with watermark on interval condition.
  - Whether it is append-only.
  - (Optional) Whether it is a stream that has been aggregated.
  - (Optional) Whether it has been constrained by a temporal filter.
- Specifically for joins, include the following:
  - Join type.
  - Whether it involves watermark cleanup.
  - Whether it uses interval join state cleaning.
- Specifically for aggregations (agg), include the following:
  - The number of materialized input and value states.
  - The number of distinct keys.
  - Whether it involves watermark cleanup.
  - Whether it is end-of-window-constraint (eowc) enabled.
applied rules and applied times in HeuristicOptimizer, which has been maintained in https://github.com/risingwavelabs/risingwave/blob/main/src/frontend/src/optimizer/heuristic_optimizer.rs

The text was updated successfully, but these errors were encountered:

st1page · 2024-04-16T08:14:20Z

request for comments c.c. @fuyufjh @tabVersion @chenzl25

fuyufjh · 2024-04-16T08:41:32Z

Keeping the plan tree, rather than simple number metrics such as count of operators, will introduce more complexity in telemetry backend - now it has to understand the plan tree, and it might need to traverse through the tree to get some detailed information. I am not sure how much complexity it is and whether it's worth.

Let me ask a question. Supposing that we have to write some queries to answer "how many joins per query for a specific user", either on telemetry backend or some subsequent analysis tool such as Grafana, Metabase, etc.. Which one do you prefer? Storing the plan tree or flatten numbers e.g. number of HashJoins in a query.

st1page · 2024-04-16T09:27:15Z

Keeping the plan tree, rather than simple number metrics such as count of operators, will introduce more complexity in telemetry backend - now it has to understand the plan tree, and it might need to traverse through the tree to get some detailed information. I am not sure how much complexity it is and whether it's worth.

Perhaps a better approach would be to flatten the storage of this tree, storing it within an array and using indices for mutual referencing. This way, we can

preserve the original data structure when we really need it.
store certain statistical data for each operator,
when simpler data is required, we can quickly obtain it by using some aggregation methods on the telemetry backend.

github-actions · 2024-08-01T02:07:50Z

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

tabVersion · 2024-08-06T08:55:33Z

A summary of the current status

after feat(telemetry): support report event to telemetry #17486, we support tracking feature usage for both source and sink connectors and recovery.
- collect when the creation logic is reached on each compute node and also collect the format/encode info

And the following things are going to be conducted in the following order

applied rules and applied times in HeuristicOptimizer
- This one has reached the conclusion that we can deliver the info when the streaming job is up and get its own catalog_id
capture the plan tree
- This item does not have a clear design yet. From the desc above, the involved attr comes from different steps in the planning stage, and may need more prepare work.

github-actions bot added this to the release-1.9 milestone Apr 16, 2024

st1page changed the title ~~Tacking(telemetry): optimizer & plan informations for each streaming job~~ Tacking(telemetry): operators and plan's informations for each streaming job Apr 16, 2024

tabVersion mentioned this issue Apr 17, 2024

feature request: collect feature usage stats by open telemetry #13192

Closed

2 tasks

st1page self-assigned this May 14, 2024

st1page modified the milestones: release-1.9, release-1.10 May 14, 2024

github-actions bot added the no-issue-activity label Aug 1, 2024

fuyufjh changed the title ~~Tacking(telemetry): operators and plan's informations for each streaming job~~ Tacking: Telemetry Aug 2, 2024

fuyufjh assigned tabVersion and unassigned st1page Aug 2, 2024

fuyufjh added type/tracking Tracking issue. and removed no-issue-activity labels Aug 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tacking: Telemetry #16332

Tacking: Telemetry #16332

st1page commented Apr 16, 2024 •

edited

Loading

st1page commented Apr 16, 2024

fuyufjh commented Apr 16, 2024 •

edited

Loading

st1page commented Apr 16, 2024

github-actions bot commented Aug 1, 2024

tabVersion commented Aug 6, 2024

Tacking: Telemetry #16332

Tacking: Telemetry #16332

Comments

st1page commented Apr 16, 2024 • edited Loading

per-cluster

per-streaming job

st1page commented Apr 16, 2024

fuyufjh commented Apr 16, 2024 • edited Loading

st1page commented Apr 16, 2024

github-actions bot commented Aug 1, 2024

tabVersion commented Aug 6, 2024

st1page commented Apr 16, 2024 •

edited

Loading

fuyufjh commented Apr 16, 2024 •

edited

Loading