Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tacking: Telemetry #16332

Open
2 tasks
st1page opened this issue Apr 16, 2024 · 5 comments
Open
2 tasks

Tacking: Telemetry #16332

st1page opened this issue Apr 16, 2024 · 5 comments
Assignees
Labels
type/tracking Tracking issue.
Milestone

Comments

@st1page
Copy link
Contributor

st1page commented Apr 16, 2024

per-cluster

I believe that we do not need to concern ourselves with the specific locations of the expressions and aggregators used by the users within the plan; rather, we want to know their usage ratio in the product. What we care about is the exact expressions and aggregators that users write in their SQL, not the optimized and rewritten expressions.

  • Count the usage frequency of each type of aggregator used in streaming/batch queries, with statistics aggregated. Please note:
    • With or without distinct and filter clause are considered as different types of aggregators.
    • In RisingWave, aggregators such as AVG, VAR_POP, etc., may be rewritten into other aggregators; we need to count the aggregators before the rewrite.
  • Count the usage frequency of each type of function used in streaming/batch queries, with statistics aggregated.
    • Count the usage before optimizations (such as constant folding) are applied.

per-streaming job

Regarding the analysis of the workload, we need more detailed information rather than simple grouped counts. The impact on the workload is significant depending on whether an aggregation (agg) is placed before or after a join. Therefore, we need to maintain a simple plan tree for each streaming job, with each node containing some telemetry information about itself.

  • For each streaming job, capture the plan tree without detailed information such as expressions (expr).
    • Store attributes for each plan output, including:
      • Whether cleaning state with a watermark on join key.
      • Whether cleaning state with watermark on interval condition.
      • Whether it is append-only.
      • (Optional) Whether it is a stream that has been aggregated.
      • (Optional) Whether it has been constrained by a temporal filter.
    • Specifically for joins, include the following:
      • Join type.
      • Whether it involves watermark cleanup.
      • Whether it uses interval join state cleaning.
    • Specifically for aggregations (agg), include the following:
      • The number of materialized input and value states.
      • The number of distinct keys.
      • Whether it involves watermark cleanup.
      • Whether it is end-of-window-constraint (eowc) enabled.
  • applied rules and applied times in HeuristicOptimizer, which has been maintained in https://github.com/risingwavelabs/risingwave/blob/main/src/frontend/src/optimizer/heuristic_optimizer.rs
@github-actions github-actions bot added this to the release-1.9 milestone Apr 16, 2024
@st1page st1page changed the title Tacking(telemetry): optimizer & plan informations for each streaming job Tacking(telemetry): operators and plan's informations for each streaming job Apr 16, 2024
@st1page
Copy link
Contributor Author

st1page commented Apr 16, 2024

request for comments c.c. @fuyufjh @tabVersion @chenzl25

@fuyufjh
Copy link
Member

fuyufjh commented Apr 16, 2024

Keeping the plan tree, rather than simple number metrics such as count of operators, will introduce more complexity in telemetry backend - now it has to understand the plan tree, and it might need to traverse through the tree to get some detailed information. I am not sure how much complexity it is and whether it's worth.

Let me ask a question. Supposing that we have to write some queries to answer "how many joins per query for a specific user", either on telemetry backend or some subsequent analysis tool such as Grafana, Metabase, etc.. Which one do you prefer? Storing the plan tree or flatten numbers e.g. number of HashJoins in a query.

@st1page
Copy link
Contributor Author

st1page commented Apr 16, 2024

Keeping the plan tree, rather than simple number metrics such as count of operators, will introduce more complexity in telemetry backend - now it has to understand the plan tree, and it might need to traverse through the tree to get some detailed information. I am not sure how much complexity it is and whether it's worth.

Perhaps a better approach would be to flatten the storage of this tree, storing it within an array and using indices for mutual referencing. This way, we can

  • preserve the original data structure when we really need it.
  • store certain statistical data for each operator,
  • when simpler data is required, we can quickly obtain it by using some aggregation methods on the telemetry backend.

Copy link
Contributor

github-actions bot commented Aug 1, 2024

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

@fuyufjh fuyufjh changed the title Tacking(telemetry): operators and plan's informations for each streaming job Tacking: Telemetry Aug 2, 2024
@fuyufjh fuyufjh assigned tabVersion and unassigned st1page Aug 2, 2024
@fuyufjh fuyufjh added type/tracking Tracking issue. and removed no-issue-activity labels Aug 2, 2024
@tabVersion
Copy link
Contributor

A summary of the current status

And the following things are going to be conducted in the following order

  • applied rules and applied times in HeuristicOptimizer
    • This one has reached the conclusion that we can deliver the info when the streaming job is up and get its own catalog_id
  • capture the plan tree
    • This item does not have a clear design yet. From the desc above, the involved attr comes from different steps in the planning stage, and may need more prepare work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/tracking Tracking issue.
Projects
None yet
Development

No branches or pull requests

3 participants