Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: improve progress msg in SHOW JOBS for source backfill #18925

Merged
merged 3 commits into from
Oct 22, 2024

Conversation

xxchan
Copy link
Member

@xxchan xxchan commented Oct 16, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Previous discussion here: #18112 (comment)

Previously, progress in SHOW JOBS is a percentile string. After we introduced different type of backfill, things become tricker:

  • For some types, like SourceBackfill, we cannot even tell what's the end position, thus no percentile at all.
  • We may have different type of backfills in one MV. How to display them together is a problem.
  • Snapshot backfill introduced complex string as progress:
    CreatingStreamingJobStatus::ConsumingSnapshot {
    create_mview_tracker,
    ..
    } => {
    if create_mview_tracker.has_pending_finished_jobs() {
    "Snapshot finished".to_string()
    } else {
    let progress = create_mview_tracker
    .gen_ddl_progress()
    .remove(&self.info.table_fragments.table_id().table_id)
    .expect("should exist");
    format!("Snapshot [{}]", progress.progress)
    }
    }
    CreatingStreamingJobStatus::ConsumingLogStore {
    start_consume_log_store_epoch,
    ..
    } => {
    let max_collected_epoch = max(
    self.barrier_control.max_collected_epoch().unwrap_or(0),
    self.backfill_epoch,
    );
    let lag = Duration::from_millis(
    Epoch(*start_consume_log_store_epoch)
    .physical_time()
    .saturating_sub(Epoch(max_collected_epoch).physical_time()),
    );
    format!(
    "LogStore [remain lag: {:?}, epoch cnt: {}]",
    lag,
    self.barrier_control.inflight_barrier_count()
    )
    }
    CreatingStreamingJobStatus::ConsumingUpstream { .. } => {
    format!(
    "Upstream [unattached: {}, epoch cnt: {}]",
    self.barrier_control.unattached_epochs().count(),
    self.barrier_control.inflight_barrier_count(),
    )
    }
    CreatingStreamingJobStatus::Finishing { .. } => {
    format!(
    "Finishing [epoch count: {}]",
    self.barrier_control.inflight_barrier_count()
    )
    }

The idea of this PR is simple: just report separately, and abuse the string.
(Ideas to improve how the progress string look like is welcomed)

match (mv_progress, source_progress) {
    (Some(mv_progress), Some(source_progress)) => {
        format!(
            "MView Backfill: {}, Source Backfill: {}",
            mv_progress, source_progress
        )
    }
    (Some(mv_progress), None) => mv_progress,
    (None, Some(source_progress)) => source_progress,
    (None, None) => "Unknown".to_string(),
}

Example:

┌────┬──────────────────────────────────────────────────────────────────────────────────────┬────────────────────────────────────────────────────────────────┐
│ Id │                                      Statement                                       │                            Progress                            │
├────┼──────────────────────────────────────────────────────────────────────────────────────┼────────────────────────────────────────────────────────────────┤
│ 22 │ CREATE MATERIALIZED VIEW mv AS SELECT * FROM t UNION SELECT v1 FROM s_before_produce │ MView Backfill: 99.99% (3/3), Source Backfill: 0 rows consumed │
└────┴──────────────────────────────────────────────────────────────────────────────────────┴────────────────────────────────────────────────────────────────┘

Part of #18338

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Copy link
Member Author

xxchan commented Oct 16, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @xxchan and the rest of your teammates on Graphite Graphite

@xxchan xxchan changed the title . feat: Oct 16, 2024
@xxchan xxchan changed the title feat: feat: improve progress msg for source backfill Oct 16, 2024
@xxchan xxchan changed the title feat: improve progress msg for source backfill feat(WIP): improve progress msg for source backfill Oct 16, 2024
@xxchan xxchan changed the title feat(WIP): improve progress msg for source backfill feat(WIP): improve progress msg in SHOW JOBS for source backfill Oct 16, 2024
@xxchan xxchan force-pushed the 09-06-feat_improve_progress_msg_for_source_backfill branch from 20a0b47 to 05b68bc Compare October 17, 2024 14:18
@xxchan xxchan marked this pull request as ready for review October 17, 2024 14:28
@xxchan xxchan changed the title feat(WIP): improve progress msg in SHOW JOBS for source backfill feat: improve progress msg in SHOW JOBS for source backfill Oct 17, 2024
@xxchan xxchan force-pushed the 09-06-feat_improve_progress_msg_for_source_backfill branch 2 times, most recently from c662fa7 to c9d083b Compare October 17, 2024 14:39
@xxchan xxchan force-pushed the 09-06-feat_improve_progress_msg_for_source_backfill branch 3 times, most recently from 3498778 to 4a18292 Compare October 18, 2024 00:40
Comment on lines 441 to 442
// TODO: persist backfilled row count?
let mut total_backfilled_rows: u64 = 0;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up here #18983, where I'm not quite satisfied with the solution. 😕

@xxchan xxchan force-pushed the 09-06-feat_improve_progress_msg_for_source_backfill branch from 4a18292 to 1c50a1b Compare October 18, 2024 03:15
@xxchan xxchan force-pushed the 09-06-feat_improve_progress_msg_for_source_backfill branch from 1c50a1b to eaa34e4 Compare October 18, 2024 07:04
@xxchan xxchan force-pushed the 09-06-feat_improve_progress_msg_for_source_backfill branch from eaa34e4 to 0f6248b Compare October 18, 2024 07:27
Comment on lines +174 to +204
let mv_progress = (mv_count > 0).then_some({
if self.upstream_mvs_total_key_count == 0 {
"99.99%".to_string()
} else {
let mut progress = self.mv_backfill_consumed_rows as f64
/ (self.upstream_mvs_total_key_count as f64);
if progress > 1.0 {
progress = 0.9999;
}
format!(
"{:.2}% ({}/{})",
progress * 100.0,
self.mv_backfill_consumed_rows,
self.upstream_mvs_total_key_count
)
}
});
let source_progress = (source_count > 0).then_some(format!(
"{} rows consumed",
self.source_backfill_consumed_rows
));
match (mv_progress, source_progress) {
(Some(mv_progress), Some(source_progress)) => {
format!(
"MView Backfill: {}, Source Backfill: {}",
mv_progress, source_progress
)
}
(Some(mv_progress), None) => mv_progress,
(None, Some(source_progress)) => source_progress,
(None, None) => "Unknown".to_string(),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

core logic here

@xxchan xxchan force-pushed the 09-06-feat_improve_progress_msg_for_source_backfill branch from 7d123f2 to d3ef9bb Compare October 21, 2024 01:47
@kwannoel
Copy link
Contributor

kwannoel commented Oct 21, 2024

Can we show progress via backfilled offset instead?

I assume offsets are monotonically increasing.

So something like:

average backfilled offset / upstream offset

@xxchan
Copy link
Member Author

xxchan commented Oct 21, 2024

Can we show progress via backfilled offset instead?

Good point. 🤔

  • If so, we don't need to persist num of rows any more feat: persist processed rows for SourceBackfill for SHOW JOBS #18983
  • We will need a new code path in progress reporting. Cannot rely on the same code as MV backfill.
  • This information is in state table (Can we just use state table for SHOW JOBS? And do not need to report progress any more 🤣 (still need report finish though.))
    • Maybe we can also add a column state tables in SHOW JOBS for users to check detailed progress.

@xxchan
Copy link
Member Author

xxchan commented Oct 21, 2024

But there's still a problem: we have per-partition offset. (i.e., in state table we have detailed per-split progress.) How to aggregate it for users?


So something like:

average backfilled offset / upstream offset

Oh, I see your point here.

@xxchan
Copy link
Member Author

xxchan commented Oct 21, 2024

Some points:

  • We don't always have the ability to calculate "Lag size" (or average backfilled offset / upstream offset)

    Note that, the "offset" may mean very different things in different connectors, unlike Kafka offset is as simple as array index.

    e.g., Kinesis offset is not consecutive. looks like "SequenceNumber": "49639543389862895699365693304023657829928072185817071618". It can't be averaged, and we have no idea about the progress just from the sequence number.

    See also https://arc.net/l/quote/rdvyjwyz (Let source backfill finish when there's no data from upstream #18299)

    Therefore, num_backfilled_rows might still be a valuable thing.

  • We actually have a metric on Grafana: "Kafka Consumer Lag Size".

    Not sure whether we can get similar things for other connectors like Kinesis.

  • Most of the time, backfill should be finished very quickly and not many users will care about observability of the backfill progress. (And we do have some advanced tricks for advanced users as mentioned above.) So I'm thinking it may not worth the effort to implement it perfectly. 🤪 Of course we can gradually improve it later.

@kwannoel
Copy link
Contributor

For kinesis offset, I think as long as long as it is monotonically increasing, we can still use its average.

@xxchan xxchan force-pushed the 09-06-feat_improve_progress_msg_for_source_backfill branch from d3ef9bb to a38b1a4 Compare October 21, 2024 08:48
@xxchan xxchan changed the base branch from main to 10-17-feat_persist_processed_rows_for_sourcebackfill_for_show_jobs October 21, 2024 08:48
@xxchan xxchan force-pushed the 09-06-feat_improve_progress_msg_for_source_backfill branch from a38b1a4 to 5ad1867 Compare October 21, 2024 08:55
Base automatically changed from 10-17-feat_persist_processed_rows_for_sourcebackfill_for_show_jobs to main October 21, 2024 10:02
@graphite-app graphite-app bot requested a review from a team October 21, 2024 10:04
@xxchan xxchan force-pushed the 09-06-feat_improve_progress_msg_for_source_backfill branch from 5ad1867 to 007900f Compare October 21, 2024 14:31
@xxchan
Copy link
Member Author

xxchan commented Oct 22, 2024

@kwannoel Do you think we can merge it in current form and refine later?

@xxchan xxchan added this pull request to the merge queue Oct 22, 2024
Merged via the queue into main with commit f637d0f Oct 22, 2024
33 of 34 checks passed
@xxchan xxchan deleted the 09-06-feat_improve_progress_msg_for_source_backfill branch October 22, 2024 05:15
xxchan added a commit that referenced this pull request Oct 22, 2024
github-merge-queue bot pushed a commit that referenced this pull request Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants