Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

performance regression (for debug & single binary) of backtrace capturing after bumping toolchain on macOS #6131

Closed
Tracked by #6103
xxchan opened this issue Oct 31, 2022 · 24 comments
Assignees
Labels
type/bug Something isn't working

Comments

@xxchan
Copy link
Member

xxchan commented Oct 31, 2022

Describe the bug

I found risingwave becomes significantly slow on my mac. e.g., e2e_test/batch/basic/array.slt.part takes 30+ seconds. After git bisect, I found the regression happens after #6025.

To Reproduce

No response

Expected behavior

No response

Additional context

@skyzh can't reproduce. CI also works fine. Need someone else to confirm it. Confirmed it happens on debug & single binary mode.

@xxchan xxchan added the type/bug Something isn't working label Oct 31, 2022
@github-actions github-actions bot added this to the release-0.1.14 milestone Oct 31, 2022
@skyzh
Copy link
Contributor

skyzh commented Oct 31, 2022

Can you collect some perf result on that? Thanks!

@chenzl25
Copy link
Contributor

I also found this regression on debug mode and Sysbench QPS dropped to almost zero. However everything works well on release mode.

@skyzh
Copy link
Contributor

skyzh commented Oct 31, 2022

single-binary doesn't work well, while multi-binary looks good. Is there anything that might go wrong with the madsim update? cc @wangrunji0408 there should be no problem with madsim

@xxchan xxchan changed the title performance regression after bumping toolchain performance regression (for debug & single binary) after bumping toolchain Oct 31, 2022
@skyzh
Copy link
Contributor

skyzh commented Oct 31, 2022

Probably not, seems to be a problem with frontend node.

@skyzh
Copy link
Contributor

skyzh commented Oct 31, 2022

image

And ok_or_else seems to consume significant amount of CPU time.

@skyzh
Copy link
Contributor

skyzh commented Oct 31, 2022

image

... because backtrace is slow to create

@skyzh
Copy link
Contributor

skyzh commented Oct 31, 2022

In multi-binary:

backtrace captured in 1ms
backtrace captured in 0ms
backtrace captured in 0ms
backtrace captured in 0ms
backtrace captured in 0ms
backtrace captured in 0ms
backtrace captured in 0ms
backtrace captured in 0ms
backtrace captured in 0ms
backtrace captured in 0ms

In single-binary:

backtrace captured in 393ms
backtrace captured in 369ms
backtrace captured in 370ms
backtrace captured in 365ms
backtrace captured in 362ms

Probably a bug with the Rust compiler, but I would still recommend not returning so many errors.

@skyzh
Copy link
Contributor

skyzh commented Oct 31, 2022

Related function:

    pub fn get_schema_by_name(&self, db_name: &str, schema_name: &str) -> Result<&SchemaCatalog> {
        self.get_database_by_name(db_name)?
            .get_schema_by_name(schema_name)
            .ok_or_else(|| {
                let x = Instant::now();
                let err = CatalogError::NotFound("schema", schema_name.to_string()).into();
                println!("backtrace captured in {}ms", x.elapsed().as_millis());
                err
            })
    }

@skyzh
Copy link
Contributor

skyzh commented Nov 1, 2022

The suspicious thing is the dwarf region in the binary.

On 2022-07-29:

objdump target/debug/risingwave  --dwarf=frame | wc -l
     371

On 2022-10-16:

objdump target/debug/risingwave  --dwarf=frame | wc -l
 14671494

@skyzh
Copy link
Contributor

skyzh commented Nov 1, 2022

Shall we file a bug report to rust-lang? (Though I believe this is more of an LLVM bug / LLVM upgrade)

@skyzh
Copy link
Contributor

skyzh commented Nov 1, 2022

Maybe related: rust-lang/rust#98051

@skyzh
Copy link
Contributor

skyzh commented Nov 1, 2022

I would suggest keep this issue open while changing frontend to return Option instead of error.

@BugenZhao
Copy link
Member

I would suggest keep this issue open while changing frontend to return Option instead of error.

I think most of the errors generated in the frontend are for responding to the user, which cannot be avoided.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 1, 2022

Is it caused by capturing the stacktrace?

@skyzh
Copy link
Contributor

skyzh commented Nov 1, 2022

I think most of the errors generated in the frontend are for responding to the user, which cannot be avoided.

No. The e2e test cases that timed out should not have any user-facing error. All tables exist.

Is it caused by capturing the stacktrace?

Yes.

@skyzh
Copy link
Contributor

skyzh commented Nov 1, 2022

There seems to be some new features in Rust / LLVM that generates a special kind of debug info DWARF, which contains stack unwinding region information. In debug mode it's just slow to rewind.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 2, 2022

It's not a big issue but still affected the developer's experience 🥲 What is our solution?

  1. Wait for rust community to fix it. (Don't know how long it will take)
  2. Refactor the code that captures the stack-trace into error
  3. Any other ideas?

@skyzh
Copy link
Contributor

skyzh commented Nov 2, 2022

Wait for rust community to fix it.

I don't think that's something to be fixed. It looks more like a feature that provides better backtrace.

Refactor the code that captures the stack-trace into error

I think a fix is going on to change them to cheaper CatalogEror.

@fuyufjh
Copy link
Member

fuyufjh commented Nov 2, 2022

cc. @BowenXiao1999

@BugenZhao
Copy link
Member

I've opened an issue for this -> #6157.

@xiangjinwu
Copy link
Contributor

Backtrace capturing is not only slower, but can also lead to a segfault: #6205

@xxchan
Copy link
Member Author

xxchan commented Nov 10, 2022

The performance problem is much better now. It took me 200s to run slt-all for multi-binary and 250s for single-binary (when the issue opened it was forever).

@xiangjinwu
Copy link
Contributor

Closing this issue in favor of #4077 and #6205.

The slowness is acknowledged by the doc of Backtrace::capture, and our take on it would be to avoid unnecessary backtrace capture via #4077.

The segfault is worse and tracked by #6205.

@xiangjinwu xiangjinwu closed this as not planned Won't fix, can't repro, duplicate, stale Nov 14, 2022
@skyzh
Copy link
Contributor

skyzh commented Nov 15, 2022

TImeout is a separate issue. reopen.

@skyzh skyzh reopened this Nov 15, 2022
@skyzh skyzh removed their assignment Nov 26, 2022
@BugenZhao BugenZhao removed this from the release-0.1.15 milestone Dec 19, 2022
@BugenZhao BugenZhao changed the title performance regression (for debug & single binary) after bumping toolchain performance regression (for debug & single binary) of backtrace capturing after bumping toolchain on macOS Dec 19, 2022
@xxchan xxchan closed this as completed Apr 8, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Something isn't working
Projects
None yet
Development

No branches or pull requests

6 participants