-
Notifications
You must be signed in to change notification settings - Fork 590
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Handling inconsistent streams #14031
Comments
BTW, the internal state / cache inconsistency could also be led by data corruption in the storage layer, which is even harder to debug. 😕 The factor can be eliminated by using an in-memory storage backend under the testing feature. However, this also implies that it cannot be covered in production. |
Can you elaborate more on how this can catch data corruption on storage layer? Or what kinds of data corruption it can catch? |
IIRC, we once encountered an issue that the file cache was not correctly invalidated after node restart, which caused the read to return totally irrelevant results. As the in-memory state backend is simply enough, I guess we can assume that there's no such issue. |
Add another reason for inconsistent stream: backwards incompatibility. If the user is using stable features, it's likely the bug could be triggered by backwards compatibility after upgrade. |
Another case recently that occurred, which caused hash join state to become inconsistent. A complementary approach suggested by @fuyufjh . First we tolerate the inconsistency to prevent the cluster from crashing, but leave an error log out. For example, if a row is to be inserted into the cache, but it already exists, it means that this row has been seen before, and we should skip over it. It should still not happen though, so we should log an error. |
+1. I have discussed this with multiple people recently, and at least we all agree that we can hardly do anything when meeting this problem. |
I will work on the testing part, fuzzing with checks on the data stream. |
When talking with @stdrc yesterday, we think we can provide an option like "non-strict mode" to warn these inconsistent problems but not panic. As an option, we will always use "strict mode" for testing; while for production deployment, we can decide to enable or disable case by case. Some known places that need to be refactored:
There should be more cases... Need to review code to find out. |
Let me share more info about the three occurrences of the panics:
Updated: there is no temporal filter involved in 1 and 3. The only common stateful operator seems to be |
Suspicious join related PR introduced since v1.4.0: #13214 |
Could because #13351, I will do some invesitagitions |
one prod case of inconsistency: #14197 |
Another idea I have, we can enable the stream consistency check to be toggled at runtime. And it will catch most cases. It can be supported in the same way as described in the issue description, by adding an additional executor, and using a system variable to toggle it. Edit: Edit 2: So it can work, limited by the recent operations. Edit 3: |
This issue has been open for 60 days with no activity. If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the You can also confidently close this issue as not planned to keep our backlog clean. |
After some work in previous quarter, we already have a non-strict mode to allow inconsistent streams in out system. And we now have a description for the related config item in https://docs.risingwave.com/docs/dev/node-specific-configurations/#streaming-configurations. So I guess for now we can close this issue as completed? While we have to acknowledge that the non-strict mode is just to ignore inconsistency. We still don't have a better way to identify inconsistency. We do have many code in several executors that checks the If anyone has any other thoughts, plz feel free to reopen this issue. |
Problems
v1
is a primary key:Solutions
feature
, which adds an extra executor between each executor which stores the state of the stream passing through, and checks for inconsistency. Use it in fuzzing test, e.g. sqlsmith. Then we can find inconsistency bugs. This cannot solve the case in production though.Any other ideas?
The text was updated successfully, but these errors were encountered: