Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Handling inconsistent streams #14031

Closed
kwannoel opened this issue Dec 18, 2023 · 16 comments
Closed

Discussion: Handling inconsistent streams #14031

kwannoel opened this issue Dec 18, 2023 · 16 comments

Comments

@kwannoel
Copy link
Contributor

kwannoel commented Dec 18, 2023

Problems

  • For stream executors, the input stream from upstream does not undergo consistency checks.
  • For example, consider the following input where v1 is a primary key:
    | Op | v1 | v2 | v3 |
    | +  |  1 |  2 |  3 |
    | +  |  1 |  2 |  3 |
    
  • The executor may panic only when their internal state / cache becomes inconsistent.
  • The actual source is from upstream.
  • We have to enumerate all upstream sources to discover what's the actual root cause.
  • This might be fine when we have a small stream graph, but when our stream graph is large, we cannot find root cause fast.

Solutions

  1. Introduce a testing feature, which adds an extra executor between each executor which stores the state of the stream passing through, and checks for inconsistency. Use it in fuzzing test, e.g. sqlsmith. Then we can find inconsistency bugs. This cannot solve the case in production though.

Any other ideas?

@github-actions github-actions bot added this to the release-1.6 milestone Dec 18, 2023
@BugenZhao
Copy link
Member

BTW, the internal state / cache inconsistency could also be led by data corruption in the storage layer, which is even harder to debug. 😕 The factor can be eliminated by using an in-memory storage backend under the testing feature. However, this also implies that it cannot be covered in production.

@kwannoel
Copy link
Contributor Author

The factor can be eliminated by using an in-memory storage backend under the testing feature.

Can you elaborate more on how this can catch data corruption on storage layer? Or what kinds of data corruption it can catch?

@BugenZhao
Copy link
Member

BugenZhao commented Dec 18, 2023

The factor can be eliminated by using an in-memory storage backend under the testing feature.

Can you elaborate more on how this can catch data corruption on storage layer? Or what kinds of data corruption it can catch?

IIRC, we once encountered an issue that the file cache was not correctly invalidated after node restart, which caused the read to return totally irrelevant results. As the in-memory state backend is simply enough, I guess we can assume that there's no such issue.

@kwannoel
Copy link
Contributor Author

kwannoel commented Dec 18, 2023

Add another reason for inconsistent stream: backwards incompatibility.

If the user is using stable features, it's likely the bug could be triggered by backwards compatibility after upgrade.

@kwannoel
Copy link
Contributor Author

kwannoel commented Dec 20, 2023

Another case recently that occurred, which caused hash join state to become inconsistent.

A complementary approach suggested by @fuyufjh . First we tolerate the inconsistency to prevent the cluster from crashing, but leave an error log out.

For example, if a row is to be inserted into the cache, but it already exists, it means that this row has been seen before, and we should skip over it. It should still not happen though, so we should log an error.

@fuyufjh
Copy link
Member

fuyufjh commented Dec 20, 2023

+1. I have discussed this with multiple people recently, and at least we all agree that we can hardly do anything when meeting this problem.

@kwannoel
Copy link
Contributor Author

I will work on the testing part, fuzzing with checks on the data stream.

@kwannoel kwannoel self-assigned this Dec 20, 2023
@fuyufjh
Copy link
Member

fuyufjh commented Dec 20, 2023

When talking with @stdrc yesterday, we think we can provide an option like "non-strict mode" to warn these inconsistent problems but not panic. As an option, we will always use "strict mode" for testing; while for production deployment, we can decide to enable or disable case by case.

Some known places that need to be refactored:

  • HashAgg
    src/stream/src/executor/aggregation/agg_group.rs:320:13: row count should be non-negative
    
  • HashJoin
    src/stream/src/executor/managed_state/join/join_entry_state.rs:43:44 unwrap()
    src/stream/src/executor/managed_state/join/join_entry_state.rs:51:13 pk [...] should be in the cache
    
  • Inside MemTable
    • Search for handle_mem_table_error()

There should be more cases... Need to review code to find out.

@hzxa21
Copy link
Collaborator

hzxa21 commented Dec 20, 2023

When talking with @stdrc yesterday, we think we can provide an option like "non-strict mode" to warn these inconsistent problems but not panic. As an option, we will always use "strict mode" for testing; while for production deployment, we can decide to enable or disable case by case.

Some known places that need to be refactored:

  • HashAgg
    src/stream/src/executor/aggregation/agg_group.rs:320:13: row count should be non-negative
    
  • HashJoin
    src/stream/src/executor/managed_state/join/join_entry_state.rs:43:44 unwrap()
    src/stream/src/executor/managed_state/join/join_entry_state.rs:51:13 pk [...] should be in the cache
    

There should be more cases... Need to review code to find out.

Let me share more info about the three occurrences of the panics:

  1. join_entry_state.rs:51:13 pk [...] should be in the cache
    • image: nightly-20231123
    • stateful operators involved (no agg):
      • inner join
      • left outer join
  2. agg_group.rs:320:13: row count should be non-negative
    • image v1.5.0
    • stateful operators involved
      • inner join
      • left outer join
      • temporal filter
      • agg
  3. join_entry_state.rs:43:44 unwrap()
    • image v1.4.0
    • stateful operators involved
      • left outer join
      • agg

1 and 2 seem to be double delete while 3 seems to be double insert. I wonder whether it is possible to be caused by temporal filter emitting a previously emitted rows to downstream under some corner cases.

Updated: there is no temporal filter involved in 1 and 3. The only common stateful operator seems to be left outer join

@hzxa21
Copy link
Collaborator

hzxa21 commented Dec 20, 2023

When talking with @stdrc yesterday, we think we can provide an option like "non-strict mode" to warn these inconsistent problems but not panic. As an option, we will always use "strict mode" for testing; while for production deployment, we can decide to enable or disable case by case.
Some known places that need to be refactored:

  • HashAgg
    src/stream/src/executor/aggregation/agg_group.rs:320:13: row count should be non-negative
    
  • HashJoin
    src/stream/src/executor/managed_state/join/join_entry_state.rs:43:44 unwrap()
    src/stream/src/executor/managed_state/join/join_entry_state.rs:51:13 pk [...] should be in the cache
    

There should be more cases... Need to review code to find out.

Let me share more info about the three occurrences of the panics:

  1. join_entry_state.rs:51:13 pk [...] should be in the cache

    • image: nightly-20231123

    • stateful operators involved (no agg):

      • inner join
      • left outer join
      • temporal filter
  2. agg_group.rs:320:13: row count should be non-negative

    • image v1.5.0

    • stateful operators involved

      • inner join
      • left outer join
      • temporal filter
      • agg
  3. join_entry_state.rs:43:44 unwrap()

    • image v1.4.0

    • stateful operators involved

      • left outer join
      • agg

1 and 2 seem to be double delete while 3 seems to be double insert. I wonder whether it is possible to be caused by temporal filter emitting a previously emitted rows to downstream under some corner cases.

Updated: there is no temporal filter involved in 3. The only common stateful operator seems to be left outer join

Suspicious join related PR introduced since v1.4.0: #13214

@st1page
Copy link
Contributor

st1page commented Dec 23, 2023

Could because #13351, I will do some invesitagitions

@st1page
Copy link
Contributor

st1page commented Dec 23, 2023

Could because #13351, I will do some invesitagitions

It doesn't seem so... #14166

@lmatz
Copy link
Contributor

lmatz commented Dec 26, 2023

one prod case of inconsistency: #14197

@kwannoel
Copy link
Contributor Author

kwannoel commented Jan 9, 2024

Another idea I have, we can enable the stream consistency check to be toggled at runtime.
This means that inconsistent bugs which trigger crash loops can be caught and debugged in the user's cluster.

And it will catch most cases.

It can be supported in the same way as described in the issue description, by adding an additional executor, and using a system variable to toggle it.

Edit:
@fuyufjh raised a good point which is that historical data then needs to be stored all the way to detect it. So seems like it can only be used in testing again..

Edit 2:
From offline discussion with @st1page, @TennyZhuang proposed storing on log / local disk most recent operations, bounded by a period, e.g. 10,000 recent operations, we can use that to check.

So it can work, limited by the recent operations.

Edit 3:
Typically bugs are triggered by very large data amount.

Copy link
Contributor

github-actions bot commented Jul 3, 2024

This issue has been open for 60 days with no activity.

If you think it is still relevant today, and needs to be done in the near future, you can comment to update the status, or just manually remove the no-issue-activity label.

You can also confidently close this issue as not planned to keep our backlog clean.
Don't worry if you think the issue is still valuable to continue in the future.
It's searchable and can be reopened when it's time. 😄

@stdrc
Copy link
Member

stdrc commented Jul 23, 2024

After some work in previous quarter, we already have a non-strict mode to allow inconsistent streams in out system. And we now have a description for the related config item in https://docs.risingwave.com/docs/dev/node-specific-configurations/#streaming-configurations. So I guess for now we can close this issue as completed?

While we have to acknowledge that the non-strict mode is just to ignore inconsistency. We still don't have a better way to identify inconsistency. We do have many code in several executors that checks the Ops and data in input chunk, but not all executors are covered, so it's still hard to backtrack a panic to find the origin in strict mode.

If anyone has any other thoughts, plz feel free to reopen this issue.

@stdrc stdrc closed this as completed Jul 23, 2024
@stdrc stdrc self-assigned this Jul 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

7 participants