-
Notifications
You must be signed in to change notification settings - Fork 594
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(telemetry): support report event to telemetry #17486
Conversation
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
license-eye has totally checked 5169 files.
Valid | Invalid | Ignored | Fixed |
---|---|---|---|
2211 | 1 | 2957 | 0 |
Click to see the invalid file list
- src/stream/src/telemetry.rs
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
src/stream/src/telemetry.rs
Outdated
static COMPUTE_TELEMETRY_SESSION_ID: OnceLock<String> = OnceLock::new(); | ||
static COMPUTE_TELEMETRY_TRACKING_ID: OnceLock<String> = OnceLock::new(); | ||
|
||
pub fn set_compute_telemetry_tracking_id_and_session_id(tracking_id: String, session_id: String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmmm, these 3 functions usually should be part of the TelemetryManager in my mind.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but it needs to pass a telemetry manager instance when calling. Sometimes we want to know whether a corner case got hit, passing a telemetry manager instance can be annoying and involves more code changes.
I plan to simplify the func calling process, encouraging ENG to collect features usage at minimal complexity. Directly calling a func is the simplest way so I have to use a global variable to keep the tracking_id and session_id.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making it a singleton, such as GLOBAL_METRICS
?
risingwave/src/expr/core/src/expr/expr_udf.rs
Lines 270 to 271 in da28570
static GLOBAL_METRICS: LazyLock<MetricsVec> = | |
LazyLock::new(|| MetricsVec::new(&GLOBAL_METRICS_REGISTRY)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For session_id
, it varies for different nodes, and each time one node restarts. Putting it as a singleton can break the behavior, resulting in not being aligned with the prev reports.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand the points. 👀
- First, why do we need these
get_set
defined in different modules, but not just incommon
? Different nodes will just call it at different places. (Like how we share system params for different nodes) - Regardless of what their values are, we just created
tracking_id
andsession_id
once instart_telemetry_reporting
, so why aren't they singletons? Can you give an example how can we break it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- The tracking ID is unique and immutable across the entire cluster. I support implementing this as a global singleton.
- The session ID is updated whenever a node is created or restarted. In single-node mode, all nodes run within the same process. Using a singleton in this context might cause these session IDs to overlap, failing to reflect the true state of the cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for clarifying that the only problem you are concerned about is single-node mode.
But I'm not sure why do we need to use different session ID for different nodes in single-node mode. What's the purpose of session_id
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It marks how long a node lives, providing an identifier to show what happens in one session. For now, we have no analysis on it, not sure whether we want it in the future.
But I'm not sure why do we need to use different session ID for different nodes in single-node mode.
According to the git history, the logic is impl before introducing single-node mode.
In the prev impl, the session id is generated inside the func. The nodes get different session ids in both single-node mode and distributed mode. I introduce OnceLock
to align the behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for removing session_id
because we don't really have such a purpose now.
If we are worrying about restarts/crashes, consider reporting the events in the system table system_events
, where critical events such as cause of recovery are stored.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, the expected behavior is
- removing the session id part in the events framework, but keep it in stats report, may remove it in future pr.
- impl the tracking id as a singleton in telemetry_common.
Signed-off-by: tabVersion <tabvision@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu>
src/frontend/src/telemetry.rs
Outdated
FRONTEND_TELEMETRY_SESSION_ID.set(session_id).unwrap(); | ||
} | ||
|
||
fn _get_frontend_telemetry_tracking_id_and_session_id() -> (Option<String>, Option<String>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn _get_frontend_telemetry_tracking_id_and_session_id() -> (Option<String>, Option<String>) { | |
fn get_frontend_telemetry_tracking_id_and_session_id() -> (Option<String>, Option<String>) { |
Are you trying to say #[allow(unused)]
or something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, I am not sure we want such attr. #[allow(dead_code)]
cause some trouble in source, and @xxchan complained about it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I complained using #![allow(dead_code)]
at the top-level.
- Using it on individual functions is ok.
- Whether top-level or function-level,
#[expect(dead_code, reason="xxx")]
is better. Because it tells the reason, and tells you to remove it when the lint is no longer fullfilled.
src/stream/src/telemetry.rs
Outdated
static COMPUTE_TELEMETRY_SESSION_ID: OnceLock<String> = OnceLock::new(); | ||
static COMPUTE_TELEMETRY_TRACKING_ID: OnceLock<String> = OnceLock::new(); | ||
|
||
pub fn set_compute_telemetry_tracking_id_and_session_id(tracking_id: String, session_id: String) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about making it a singleton, such as GLOBAL_METRICS
?
risingwave/src/expr/core/src/expr/expr_udf.rs
Lines 270 to 271 in da28570
static GLOBAL_METRICS: LazyLock<MetricsVec> = | |
LazyLock::new(|| MetricsVec::new(&GLOBAL_METRICS_REGISTRY)); |
Hi, could you provide some information in the PR body? Or is this PR not yet ready for review? |
Updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The rest LGTM
let url = (TELEMETRY_REPORT_URL.to_owned() + "/" + TELEMETRY_EVENT_REPORT_TYPE).to_owned(); | ||
post_telemetry_report_pb(&url, report_bytes) | ||
.await | ||
.unwrap_or_else(|e| tracing::info!("{}", e)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
.unwrap_or_else(|e| tracing::info!("{}", e)) | |
.unwrap_or_else(|e| tracing::error!("failed to report telemetry events", error = %e.as_report())) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should not use error
level for telemetry, which might make users unhappy. lol (including other existing logs for telemetry, not only here)
From users' point of view, failing to send telemetry report is indeed nothing harmful.
(Perhaps even warn
should not be used?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have the same concern, a user wonders a telemetry's warning can affect the system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I think we are talking about the same user. lol
We can fix it in this PR by convenience
src/common/src/telemetry/report.rs
Outdated
|
||
pub fn report_event_common( | ||
event_stage: PbTelemetryEventStage, | ||
feature_name: String, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we name it "event_name"? The examples in this PR are normal_recovery
, adhoc_recovery
, failure_recovery
etc. which doesn't seem to be "feature" to me.
Besides, &str
seems better, because it implies the event name must be a static string, and dynamic values should be put in attributes
.
feature_name: String, | |
event_name: &str, |
Signed-off-by: tabversion <tabversion@bupt.icu>
Signed-off-by: tabVersion <tabvision@bupt.icu> Signed-off-by: tabversion <tabversion@bupt.icu> Co-authored-by: tabversion <tabversion@bupt.icu>
cherry pick to release-1.10 #17835 |
I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.
What's changed and what's your intention?
impl an event report framework to telemetry, it sends a data point when hitting a specific logic.
The report has the following schema
tracking_id
event_time_sec
event_stage
feature_name
connector_name
object
catalog_id
node
attributes
is_test
tracing points contained in this pr
Checklist
./risedev check
(or alias,./risedev c
)Documentation
Release note
If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.