Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(telemetry): support report event to telemetry #17486

Merged
merged 22 commits into from
Jul 26, 2024
Merged

Conversation

tabVersion
Copy link
Contributor

@tabVersion tabVersion commented Jun 27, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

impl an event report framework to telemetry, it sends a data point when hitting a specific logic.

The report has the following schema

field name type notes
tracking_id varchar
event_time_sec uint64 The produce time in sec from UNIX time
event_stage i32 (enum) the event happens in which stage, CREATE/UPDATE/DROP/QUERY/RECOVERY
feature_name varchar The name of the feature
connector_name varchar/NULL The connector name if involved
object i32 (enum)/NULL The database object where the event happens, if involved. Accept TABLE/SOURCE/SINK/MV/INDEX
catalog_id i64 table_id/source_id/sink_id
node varchar The node where the event happens, accept FRONTEND/META/COMPUTE/COMPACTOR
attributes jsonb/NULL Other info to be collected
is_test bool If the message is from UT. If true, the report will not goes into storage.

tracing points contained in this pr

  • recovery: normal recovery, failure recovery (meet failure in nodes), ad-hoc recovery (request from the frontend)
  • source/sink: report when one executor builds

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

license-eye has totally checked 5169 files.

Valid Invalid Ignored Fixed
2211 1 2957 0
Click to see the invalid file list
  • src/stream/src/telemetry.rs

src/stream/src/telemetry.rs Outdated Show resolved Hide resolved
Signed-off-by: tabVersion <tabvision@bupt.icu>
@tabVersion tabVersion requested review from fuyufjh and xxchan June 28, 2024 06:53
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
static COMPUTE_TELEMETRY_SESSION_ID: OnceLock<String> = OnceLock::new();
static COMPUTE_TELEMETRY_TRACKING_ID: OnceLock<String> = OnceLock::new();

pub fn set_compute_telemetry_tracking_id_and_session_id(tracking_id: String, session_id: String) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm, these 3 functions usually should be part of the TelemetryManager in my mind.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but it needs to pass a telemetry manager instance when calling. Sometimes we want to know whether a corner case got hit, passing a telemetry manager instance can be annoying and involves more code changes.

I plan to simplify the func calling process, encouraging ENG to collect features usage at minimal complexity. Directly calling a func is the simplest way so I have to use a global variable to keep the tracking_id and session_id.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making it a singleton, such as GLOBAL_METRICS?

static GLOBAL_METRICS: LazyLock<MetricsVec> =
LazyLock::new(|| MetricsVec::new(&GLOBAL_METRICS_REGISTRY));

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For session_id, it varies for different nodes, and each time one node restarts. Putting it as a singleton can break the behavior, resulting in not being aligned with the prev reports.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand the points. 👀

  1. First, why do we need these get_set defined in different modules, but not just in common? Different nodes will just call it at different places. (Like how we share system params for different nodes)
  2. Regardless of what their values are, we just created tracking_id and session_id once in start_telemetry_reporting, so why aren't they singletons? Can you give an example how can we break it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The tracking ID is unique and immutable across the entire cluster. I support implementing this as a global singleton.
  • The session ID is updated whenever a node is created or restarted. In single-node mode, all nodes run within the same process. Using a singleton in this context might cause these session IDs to overlap, failing to reflect the true state of the cluster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying that the only problem you are concerned about is single-node mode.

But I'm not sure why do we need to use different session ID for different nodes in single-node mode. What's the purpose of session_id?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It marks how long a node lives, providing an identifier to show what happens in one session. For now, we have no analysis on it, not sure whether we want it in the future.

But I'm not sure why do we need to use different session ID for different nodes in single-node mode.

According to the git history, the logic is impl before introducing single-node mode.
In the prev impl, the session id is generated inside the func. The nodes get different session ids in both single-node mode and distributed mode. I introduce OnceLock to align the behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for removing session_id because we don't really have such a purpose now.

If we are worrying about restarts/crashes, consider reporting the events in the system table system_events, where critical events such as cause of recovery are stored.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, the expected behavior is

  • removing the session id part in the events framework, but keep it in stats report, may remove it in future pr.
  • impl the tracking id as a singleton in telemetry_common.

proto/telemetry.proto Outdated Show resolved Hide resolved
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
@tabVersion tabVersion marked this pull request as ready for review July 2, 2024 06:32
@tabVersion tabVersion requested a review from a team as a code owner July 2, 2024 06:32
Signed-off-by: tabVersion <tabvision@bupt.icu>
src/stream/src/telemetry.rs Outdated Show resolved Hide resolved
proto/telemetry.proto Outdated Show resolved Hide resolved
FRONTEND_TELEMETRY_SESSION_ID.set(session_id).unwrap();
}

fn _get_frontend_telemetry_tracking_id_and_session_id() -> (Option<String>, Option<String>) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn _get_frontend_telemetry_tracking_id_and_session_id() -> (Option<String>, Option<String>) {
fn get_frontend_telemetry_tracking_id_and_session_id() -> (Option<String>, Option<String>) {

Are you trying to say #[allow(unused)] or something?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I am not sure we want such attr. #[allow(dead_code)] cause some trouble in source, and @xxchan complained about it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I complained using #![allow(dead_code)] at the top-level.

  • Using it on individual functions is ok.
  • Whether top-level or function-level, #[expect(dead_code, reason="xxx")] is better. Because it tells the reason, and tells you to remove it when the lint is no longer fullfilled.

static COMPUTE_TELEMETRY_SESSION_ID: OnceLock<String> = OnceLock::new();
static COMPUTE_TELEMETRY_TRACKING_ID: OnceLock<String> = OnceLock::new();

pub fn set_compute_telemetry_tracking_id_and_session_id(tracking_id: String, session_id: String) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about making it a singleton, such as GLOBAL_METRICS?

static GLOBAL_METRICS: LazyLock<MetricsVec> =
LazyLock::new(|| MetricsVec::new(&GLOBAL_METRICS_REGISTRY));

@BugenZhao
Copy link
Member

Hi, could you provide some information in the PR body? Or is this PR not yet ready for review?

proto/telemetry.proto Outdated Show resolved Hide resolved
@tabVersion
Copy link
Contributor Author

Hi, could you provide some information in the PR body? Or is this PR not yet ready for review?

Updated

Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
@fuyufjh fuyufjh self-requested a review July 8, 2024 03:26
Copy link
Member

@fuyufjh fuyufjh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rest LGTM

let url = (TELEMETRY_REPORT_URL.to_owned() + "/" + TELEMETRY_EVENT_REPORT_TYPE).to_owned();
post_telemetry_report_pb(&url, report_bytes)
.await
.unwrap_or_else(|e| tracing::info!("{}", e))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
.unwrap_or_else(|e| tracing::info!("{}", e))
.unwrap_or_else(|e| tracing::error!("failed to report telemetry events", error = %e.as_report()))

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should not use error level for telemetry, which might make users unhappy. lol (including other existing logs for telemetry, not only here)

From users' point of view, failing to send telemetry report is indeed nothing harmful.

(Perhaps even warn should not be used?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same concern, a user wonders a telemetry's warning can affect the system.

Copy link
Member

@xxchan xxchan Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I think we are talking about the same user. lol

We can fix it in this PR by convenience


pub fn report_event_common(
event_stage: PbTelemetryEventStage,
feature_name: String,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we name it "event_name"? The examples in this PR are normal_recovery, adhoc_recovery, failure_recovery etc. which doesn't seem to be "feature" to me.

Besides, &str seems better, because it implies the event name must be a static string, and dynamic values should be put in attributes.

Suggested change
feature_name: String,
event_name: &str,

tabVersion and others added 4 commits July 25, 2024 01:04
Signed-off-by: tabversion <tabversion@bupt.icu>
Signed-off-by: tabversion <tabversion@bupt.icu>
fix
Signed-off-by: tabversion <tabversion@bupt.icu>
@tabVersion tabVersion added this pull request to the merge queue Jul 26, 2024
Merged via the queue into main with commit 517cde2 Jul 26, 2024
32 of 33 checks passed
@tabVersion tabVersion deleted the tab/telemetry-event branch July 26, 2024 09:16
@tabVersion tabVersion added the need-cherry-pick-release-1.10 Open a cherry-pick PR to branch release-1.10 after the current PR is merged label Jul 29, 2024
tabVersion added a commit that referenced this pull request Jul 29, 2024
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabversion <tabversion@bupt.icu>
Co-authored-by: tabversion <tabversion@bupt.icu>
@tabVersion
Copy link
Contributor Author

cherry pick to release-1.10 #17835

github-merge-queue bot pushed a commit that referenced this pull request Jul 29, 2024
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabversion <tabversion@bupt.icu>
Co-authored-by: tabversion <tabversion@bupt.icu>
@tabVersion tabVersion mentioned this pull request Aug 6, 2024
2 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci/run-e2e-single-node-tests need-cherry-pick-release-1.10 Open a cherry-pick PR to branch release-1.10 after the current PR is merged type/feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants