Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(source): parquet file source use number of rows to determine the end of the file reading #18149

Merged
merged 14 commits into from
Aug 29, 2024

Conversation

wcy-fdu
Copy link
Contributor

@wcy-fdu wcy-fdu commented Aug 20, 2024

I hereby agree to the terms of the RisingWave Labs, Inc. Contributor License Agreement.

What's changed and what's your intention?

For parquet file source, we use RecordBatchStream to read parquet file, and it's offset represents the current row read, not the bytes. The fetch executor determines whether a file has been read completely by comparing the offset with the file size, where the file size is obtained during the list process.

This pr modifies the semantics of the size of OpendalFsSplit in the parquet file, expressing it as the total number of rows in a parquet file. Specifically, before constructing OpendalFsSplit each time a file name is obtained, the metadata of the parquet file is read to obtain the total number of rows.

Todo:

  • add more comments
  • error handling

Checklist

  • I have written necessary rustdoc comments
  • I have added necessary unit tests and integration tests
  • I have added test labels as necessary. See details.
  • I have added fuzzing tests or opened an issue to track them. (Optional, recommended for new SQL features Sqlsmith: Sql feature generation #7934).
  • My PR contains breaking changes. (If it deprecates some features, please create a tracking issue to remove them in the future).
  • All checks passed in ./risedev check (or alias, ./risedev c)
  • My PR changes performance-critical code. (Please run macro/micro-benchmarks and show the results.)
  • My PR contains critical fixes that are necessary to be merged into the latest release. (Please check out the details)

Documentation

  • My PR needs documentation updates. (Please use the Release note section below to summarize the impact on users)

Release note

If this PR includes changes that directly affect users or other significant modifications relevant to the community, kindly draft a release note to provide a concise summary of these changes. Please prioritize highlighting the impact these changes will have on users.

@wcy-fdu wcy-fdu requested a review from a team as a code owner August 20, 2024 16:20
@wcy-fdu wcy-fdu requested a review from MrCroxx August 20, 2024 16:20
@github-actions github-actions bot added the type/fix Bug fix label Aug 20, 2024
@wcy-fdu wcy-fdu marked this pull request as draft August 20, 2024 16:21
@wcy-fdu wcy-fdu marked this pull request as ready for review August 21, 2024 14:00
@graphite-app graphite-app bot requested a review from a team August 21, 2024 14:00
@wcy-fdu wcy-fdu requested review from tabVersion and hzxa21 August 22, 2024 10:19
Copy link
Contributor

@tabVersion tabVersion left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@hzxa21 hzxa21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some minor comments. Rest LGTM

src/connector/src/source/filesystem/file_common.rs Outdated Show resolved Hide resolved
src/connector/src/parser/parquet_parser.rs Outdated Show resolved Hide resolved
src/connector/src/parser/parquet_parser.rs Outdated Show resolved Hide resolved
src/connector/src/parser/parquet_parser.rs Outdated Show resolved Hide resolved
@wcy-fdu wcy-fdu added need-cherry-pick-release-1.10 Open a cherry-pick PR to branch release-1.10 after the current PR is merged need-cherry-pick-release-2.0 labels Aug 26, 2024
@wcy-fdu wcy-fdu enabled auto-merge August 26, 2024 05:59
@wcy-fdu wcy-fdu disabled auto-merge August 26, 2024 06:32
@wcy-fdu wcy-fdu enabled auto-merge August 26, 2024 07:48
@wcy-fdu wcy-fdu disabled auto-merge August 26, 2024 07:56
@graphite-app graphite-app bot requested a review from a team August 29, 2024 06:28
@wcy-fdu wcy-fdu added this pull request to the merge queue Aug 29, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 29, 2024
@wcy-fdu wcy-fdu added this pull request to the merge queue Aug 29, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Aug 29, 2024
@wcy-fdu wcy-fdu enabled auto-merge August 29, 2024 07:33
@wcy-fdu wcy-fdu added this pull request to the merge queue Aug 29, 2024
Merged via the queue into main with commit a5cbeb7 Aug 29, 2024
29 of 30 checks passed
@wcy-fdu wcy-fdu deleted the wcy/parquet_source_offset branch August 29, 2024 08:36
github-actions bot pushed a commit that referenced this pull request Aug 29, 2024
github-merge-queue bot pushed a commit that referenced this pull request Aug 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
need-cherry-pick-release-1.10 Open a cherry-pick PR to branch release-1.10 after the current PR is merged need-cherry-pick-release-2.0 type/fix Bug fix
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants