Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking: support partial checkpoint #14041

Open
11 of 25 tasks
wenym1 opened this issue Dec 18, 2023 · 1 comment
Open
11 of 25 tasks

Tracking: support partial checkpoint #14041

wenym1 opened this issue Dec 18, 2023 · 1 comment
Assignees
Milestone

Comments

@wenym1
Copy link
Contributor

wenym1 commented Dec 18, 2023

Original rfc: risingwavelabs/rfcs#84

This rfc involves changes in 3 layers: batch query, streaming job and hummock storage. We are going to implement the whole rfc from bottom layer to upper layer step by step. Since the current global checkpoint is a special case of partial checkpoint, though the some features of partial checkpoint will have been implemented in bottom layer, we can keep the same global checkpoint logic in upper layer before we have implemented logic in upper layer.

After the partial checkpoint is supported, we can then work on partial recovery to isolate the failure from different MVs.

Following are some changes in each layer

hummock Storage

First we will refactor the current code to implement part of the features required by partial checkpoint while remaining the same as current logic. This includes the following:

  • change the hummock version metadata to maintain per table max committed epoch and safe epoch
  • change the process of collecting barrier from two separate one-shot rpc calls to a long existing streaming rpc between CN and meta to report barrier progress.

Meanwhile, we can develop L0 as a log so that we can reuse the data of MV.

streaming job

First we will implement new streaming executor similar to a source executor that consume the logs of upstream MV.

Second we will implement a partial checkpoint manager that comprehend the streaming graph and collect the barriers reported from each MV parallelism and trigger partial checkpoint.

batch query

By now the batch query will use the global max committed epoch as the query epoch. We can implement different query consistency for batch query. Different query consistency means the policy to choose the query epoch of different state table. The default one is to use the global max committed epoch.

partial recovery

First we can support the failure isolation among MVs that are not connected.

Second we can support failure isolation between upstream and downstream MVs .

Development progress:

Ongoing:

All tasks:

@kwannoel
Copy link
Contributor

kwannoel commented Apr 2, 2024

batch query

By now the batch query will use the global max committed epoch as the query epoch. We can implement different query consistency for batch query. Different query consistency means the policy to choose the query epoch of different state table. The default one is to use the global max committed epoch.

For batch query can we use global checkpoint of created stream jobs i.e. excluding those stream jobs which are in creating process. Such that stream jobs being created will not affect freshness of batch query.

@wenym1 wenym1 modified the milestones: release-1.8, release-1.9 Apr 8, 2024
@wenym1 wenym1 modified the milestones: release-1.9, release-1.10 May 14, 2024
@wenym1 wenym1 modified the milestones: release-2.2, release-2.3 Dec 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants