Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf(connector): support zero-copy access during parsing #17153

Merged
merged 8 commits into from
Jun 10, 2024

Conversation

BugenZhao
Copy link
Member

@BugenZhao BugenZhao commented Jun 6, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

Add a new method named access_cow to Access trait, allowing implementors to return a borrowed DatumRef, to achieve zero-copy parsing.

Currently only Varchar field in JSON returns a borrowed ScalarRef. This improves the performance by ~11% in micro-benchmark nexmark_integration. Will adopt it more in following PRs.

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • All checks passed in ./risedev check (or alias, ./risedev c)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

Copy link
Member Author

BugenZhao commented Jun 6, 2024

This stack of pull requests is managed by Graphite. Learn more about stacking.

Join @BugenZhao and the rest of your teammates on Graphite Graphite

@@ -423,7 +429,7 @@ impl JsonParseOptions {
.map_err(|_| create_error())?
.into(),
// ---- Varchar -----
(DataType::Varchar , ValueType::String) => value.as_str().unwrap().into(),
(DataType::Varchar, ValueType::String) => return Ok(DatumCow::Ref(Some(value.as_str().unwrap().into()))),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The core change: reduce allocation for a ScalarImpl::Utf8(Box<str>).

@BugenZhao BugenZhao marked this pull request as ready for review June 6, 2024 10:07
@BugenZhao BugenZhao requested review from stdrc, xxchan and xiangjinwu June 6, 2024 10:08
@BugenZhao BugenZhao changed the title refactor(connector): support zero-copy access during parsing perf(connector): support zero-copy access during parsing Jun 6, 2024
Base automatically changed from bz/parser-perf-1 to main June 6, 2024 10:48
@@ -533,7 +539,7 @@ impl JsonParseOptions {
}
&BorrowedValue::Static(simd_json::StaticNode::Null)
});
self.parse(field_value, field_type)
self.parse(field_value, field_type).map(|d| d.to_owned_datum())
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the nexmark benchmark takes a struct as input, while it's not possible to achieve zero-copy when parsing a Struct field (yet), the performance gain in this PR may not be reflected.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's possible?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need further investigation. 😕

Copy link
Member

@xxchan xxchan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this idea makes sense, and varchar is a commonly used path, although not in benchmark. 👍

Also thanks for cleaning up useless lifetimes.


use super::{Datum, DatumRef, ToDatumRef, ToOwnedDatum};

/// 🐮 A borrowed [`DatumRef`] or an owned [`Datum`].
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you explain in doc why not std::borrow::Cow<Datum> here?

Comment on lines -61 to 62
.take(1024)
.take(16384)
.map(|(i, e)| {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👀

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To minimize the impact of other operations.

path: &[&str],
type_expected: &DataType,
) -> AccessResult<DatumCow<'a>> {
self.access(path, type_expected).map(Into::into)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think mutual-recursive default implementation is awkward. But it's OK to me considering the TODO, and this trait will hardly have new implementors. 🤣

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It'll be soon cleaned up in #17165.

Comment on lines -21 to +24
pub(crate) fn json_object_get_case_insensitive<'a, 'b>(
v: &'b simd_json::BorrowedValue<'a>,
key: &'b str,
) -> Option<&'b BorrowedValue<'a>> {
pub(crate) fn json_object_get_case_insensitive<'b>(
v: &'b simd_json::BorrowedValue<'b>,
key: &str,
) -> Option<&'b BorrowedValue<'b>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Comment on lines 314 to 316
impl OpAction for OpActionUpdate {
type Output = (Datum, Datum);
type Output<'a> = (DatumCow<'a>, DatumCow<'a>);

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OpActionUpdate and do_update seems to be dead code

@@ -533,7 +539,7 @@ impl JsonParseOptions {
}
&BorrowedValue::Static(simd_json::StaticNode::Null)
});
self.parse(field_value, field_type)
self.parse(field_value, field_type).map(|d| d.to_owned_datum())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's possible?

@BugenZhao BugenZhao added this pull request to the merge queue Jun 10, 2024
Signed-off-by: Bugen Zhao <[email protected]>
Signed-off-by: Bugen Zhao <[email protected]>
Signed-off-by: Bugen Zhao <[email protected]>
Signed-off-by: Bugen Zhao <[email protected]>
Signed-off-by: Bugen Zhao <[email protected]>
Signed-off-by: Bugen Zhao <[email protected]>
Signed-off-by: Bugen Zhao <[email protected]>
@BugenZhao BugenZhao removed this pull request from the merge queue due to a manual request Jun 10, 2024
Copy link
Member Author

Merge activity

  • Jun 10, 12:51 AM EDT: Graphite failed to merge this pull request due to This repository has GitHub's merge queue enabled, which is currently incompatible with Graphite. Try your merge again, or report a bug if you see this consistently.

@BugenZhao BugenZhao added this pull request to the merge queue Jun 10, 2024
Merged via the queue into main with commit 3eaf2fb Jun 10, 2024
29 of 30 checks passed
@BugenZhao BugenZhao deleted the bz/parser-perf-2 branch June 10, 2024 05:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants