Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(storage): allow dynamic enable/disable minitrace for tiered cache #17433

Closed
wants to merge 15 commits into from

Conversation

MrCroxx
Copy link
Contributor

@MrCroxx MrCroxx commented Jun 25, 2024

Signed-off-by: MrCroxx [email protected]I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Introduce system parameters to dynamically enable/disable and configure minitrace for tiered cache.

dev=> SHOW PARAMETERS;
                  Name                  |       Value        |                       Description                       | Mutable
----------------------------------------+--------------------+---------------------------------------------------------+---------
 # ... ...
  tracing_threshold_tiered_cache_read_ms         | 1000               | Threshold to record tiered cache read with minitrace.   | t
  tracing_threshold_tiered_cache_write_ms        | 1000               | Threshold to record tiered cache write with minitrace.  | t

To enable tracing, set opentelemetry endpoint via env RW_TRACING_ENDPOINT and set system parameter enable_tracing as true.

image

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@MrCroxx MrCroxx self-assigned this Jun 25, 2024
@MrCroxx MrCroxx marked this pull request as ready for review June 25, 2024 06:29
@MrCroxx MrCroxx requested a review from a team as a code owner June 25, 2024 06:29
@MrCroxx MrCroxx requested review from hzxa21 and yezizp2012 June 25, 2024 06:29
@MrCroxx
Copy link
Contributor Author

MrCroxx commented Jun 25, 2024

Discussed offline with @tabVersion and @BugenZhao . The jaeger exporter will be replaced with OpenTelemetry exporter and reuse the env of the distributed tracing.

Copy link
Member

@yezizp2012 yezizp2012 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for the system parameter part.

@BugenZhao BugenZhao self-requested a review June 25, 2024 09:21
Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@BugenZhao BugenZhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and set system parameter enable_tracing as true.

This doesn't seem to be implemented? Shall we also handle the change on system parameter enable_tracing for minitrace?

// Toggle the distributed tracing layer.
if let Some(enabled) = diff.enable_tracing {
toggle_otel_layer(enabled);
}

Comment on lines +607 to +608
optional uint32 tracing_threshold_tiered_cache_read_ms = 18;
optional uint32 tracing_threshold_tiered_cache_write_ms = 19;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I would suggest making them configuration entries (like under [storage.cache.developer]) instead of system parameters, since it looks to be less user-facing and does not require frequent online changes. 😕

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since it looks to be less user-facing and does not require frequent online changes

+1. But the thing is, tail-based tracing is supposed to be enabled for only a few seconds to capture abnormal spans in the current state. Using a config file requires a restart, which may destroy the scene.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simply use default values that are small enough and rely only on enable_tracing to toggle it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe enable/disable with RPC is better?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like we do now to capture await tree and other profiling data, add a RPC function to control the tracing configurations.

service MonitorService {
rpc StackTrace(StackTraceRequest) returns (StackTraceResponse);
rpc Profiling(ProfilingRequest) returns (ProfilingResponse);
rpc HeapProfiling(HeapProfilingRequest) returns (HeapProfilingResponse);
rpc ListHeapProfiling(ListHeapProfilingRequest) returns (ListHeapProfilingResponse);
rpc AnalyzeHeap(AnalyzeHeapRequest) returns (AnalyzeHeapResponse);
rpc GetBackPressure(GetBackPressureRequest) returns (GetBackPressureResponse);
}

.unwrap()
.unwrap();

let exporter = opentelemetry_otlp::new_exporter()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this exporter with the one in L449? Can we reuse them?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface requires a builder. 🥹

let reporter = OpenTelemetryReporter::new(
exporter,
SpanKind::Server,
Cow::Owned(Resource::new([KeyValue::new(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The resource can be reused as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The service name should be different from the distributed tracing.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. Is it because there is currently no way to maintain the parent relationship between tracing spans and minitrace spans? So we'd rather make them not interleaving with each other by assigning a different service name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Emmm, nop. It is because I think it is not a good idea to assign the distributed tracing and lib-level tracing in the same service name. 👀 Separating them can give a clearer view.

@MrCroxx
Copy link
Contributor Author

MrCroxx commented Jul 4, 2024

Discussed with @BugenZhao offline, I'll switch to rpc with risectl.

@MrCroxx MrCroxx closed this Jul 4, 2024
@MrCroxx MrCroxx deleted the xx/tracing branch July 4, 2024 06:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants