-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
shared source: don't backfill for the first MV #16576
Comments
I came up with a simple solution: sleep 1s after the SourceExecutor is resumed. 🤡 Another simple solution: change the poll strategy in the SourceBackfillExecutor. At the beginning, prefer backfill side, then switch to prefer upstream. However, I'm not sure whether the first |
@BugenZhao reminded me that even after this problem is solved, some scenarios are still not optimal: e.g., if we create 10 MVs together, we cannot ensure the later MVs can catch up faster (and thus can share work). |
To conclude, the difference seems not very large. We need to backfill N times for N MVs, and can only share work after steady state.
Some further questions: Why can't risingwave/src/stream/src/executor/source/source_backfill_executor.rs Lines 495 to 502 in b86ffb2
This might also be the reason why in the original issue's figure, fragment 1 is faster than fragment 2. I think the algorithm can be optimized: at the beginning when backfilling is far from upstream, we don't need to check the upstream's offset, and we are in a "fast chasing" stage. But even when this is optimized, I think it still cannot catch up. Because if the backfill is fast, then the upstream source also has a lot of work to do. So it has to be backpressured or rate limited. Is it possible to share work for historical data? Then there are several ways to do it
Do we want to share work for historical data?
Anyway, if such use cases exist, it's of course nice to share work. |
I think case B (source starting from latest) is simple and works good in most simple cases.
It's because backfill always start from specified offset, and where the upstream starts only affect when it finishes backfilling. For the source executor, it only changes the starting position. When the offset is persisted in the state table, we will use that. So it's also fine. |
In most business logic, Kafka historical data does not matter, it is far away from "real-time". But for a streaming database, a more common case is that online data is first written to OLTP for online serving and duplicated to kafka for some analysis. For this question, the feature is not the most frequently used one but essential. |
I have a little concern with the pr's object, when creating a steaming job with source, we update |
@tabVersion with #16626,we are not going to implement the original idea of the issue any more. We will backfill for every MV 👀 |
Previously I thought #16348 is enough to avoid waste work, but it seems not the case.
During this benchmark, we can see the backfill never catches up with the upstream source before the benchmark finishes.
https://grafana.test.risingwave-cloud.xyz/d/EpkBw5W4k/risingwave-dev-dashboard?orgId=1&var-datasource=Prometheus:%20test-useast1-eks-a&from=1714671911000&to=1714673747000&var-namespace=xxtest1
Actually for the first MV, we can directly skip the backfill stage, since the source is paused before the MV is created. However, to implement this, we need to somehow let the MV know whether it's the first..
The text was updated successfully, but these errors were encountered: