Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

doc: add doc on rate limit and backpressure #16450

Closed
wants to merge 2 commits into from

Conversation

kwannoel
Copy link
Contributor

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@github-actions github-actions bot added the component/doc Issues about documentation. label Apr 23, 2024
The data between barriers takes a long time to process.
This can be due to high amplification, UDF latency, or other reasons.

By default, Barriers have an interval of 256ms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be 1s.

barrier_interval_ms = 1000

However, the channel will not immediately backpressure A and B.
It will only do so once it is full.

This means that the stream chunks between barriers are equivalent to channel capacity.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we process streaming data in batches so there have to be some records buffered in each executor (typically within the chunk builder as the local variables). I suppose they also act as channels in this sense.


### Backfill

In a newly created MV, it will undergo Backfilling, to fill the MV with historical data and fresh data concurrently.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After creating new jobs we'll naturally have more channels to fill in order to make back-pressure take effects. This might also be a reason why we get temporary increase in barrier latency.

@kwannoel kwannoel added this pull request to the merge queue Apr 26, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 26, 2024
This can be due to high amplification, UDF latency, or other reasons.

By default, Barriers have an interval of 1s.
If records between those barrier take longer than 256ms to process, the barrier latency will increase by that difference.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate more on why 256ms? It seems to have few context here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, didn't get it

Copy link
Member

@stdrc stdrc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM except for some barrier interval mismatches in examples

This can be due to high amplification, UDF latency, or other reasons.

By default, Barriers have an interval of 1s.
If records between those barrier take longer than 256ms to process, the barrier latency will increase by that difference.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, didn't get it

```

The barrier latency for Y will be 0ms, since our barrier interval is 256ms,
and the time between X and Y is less than 256ms.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this example assume barrier_interval to be 256ms? Why not the default value 1000ms?

_example.toml_
```toml
[streaming.developer]
stream_exchange_initial_permits = 2048
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
stream_exchange_initial_permits = 2048
# The initial permits that a channel holds, i.e., the maximum row count can be buffered in the channel.
stream_exchange_initial_permits = 2048


### Considering Amplification and UDF latency

When there's high amplification or UDF latency in some downstream actor, upstream will naturally be backpressured.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Amplification" is not a widely-used term, although we are all know it. Let's add some introduction of it.

Amplification refers to a common phenomenon in streaming. For example, consider the following Join:

   SELECT * FROM users JOIN orders ON orders.user_id = users.id

We know that each user often has multiple orders, and a minority of users may have thousands of orders. If a field in the user table is updated, then reflected in the results of the above JOIN query, there could be thousands of changed rows. Such amplification can have unexpected performance impacts on streaming jobs, and therefore requires special attention.

Copy link
Contributor

github-actions bot commented Jul 3, 2024

This PR has been open for 60 days with no activity.
If it's blocked by code review, feel free to ping a reviewer or ask someone else to review it.
If you think it is still relevant today, and have time to work on it in the near future, you can comment to update the status, or just manually remove the no-pr-activity label.
You can also confidently close this PR to keep our backlog clean. (If no further action taken, the PR will be automatically closed after 7 days. Sorry! 🙏) Don't worry if you think the PR is still valuable to continue in the future. It's searchable and can be reopened when it's time. 😄

Copy link
Contributor

Close this PR as there's no further actions taken after it is marked as stale for 7 days. Sorry! 🙏

You can reopen it when you have time to continue working on it.

@github-actions github-actions bot closed this Jul 12, 2024
@stdrc
Copy link
Member

stdrc commented Jul 26, 2024

Any updates?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/doc Issues about documentation. no-pr-activity
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants