Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support append data file and add e2e test #349

Merged
merged 22 commits into from
Nov 28, 2024
Merged

Conversation

ZENOTME
Copy link
Contributor

@ZENOTME ZENOTME commented Apr 24, 2024

This PR is complete #345.

  1. It adds the FastAppendAction to commit the data file

The design of this is based on https://github.com/apache/iceberg/blob/main/core/src/main/java/org/apache/iceberg/SnapshotProducer.java.

I implement a SnapshotProduceAction which will accept a Vec<ManifestFile> and Summary to generate a new snapshot and apply the snapshot to the tx.

FastAppendAction will reuse SnapshotProduceAction and have its own interface to process the added data files.

In the future, we can reuse SnapshotProduceAction to implement more append actions with different commit semantics as described in #348.

  1. It init the e2e test for write data file

Please let me know if this design has something that can be improved and other things missed.

@ZENOTME
Copy link
Contributor Author

ZENOTME commented Apr 25, 2024

cc @liurenjie1024 @Fokko @Xuanwo

crates/iceberg/src/io.rs Outdated Show resolved Hide resolved
@Fokko Fokko self-requested a review April 25, 2024 07:22
@Fokko Fokko added this to the 0.3.0 Release milestone Apr 25, 2024
@Fokko Fokko mentioned this pull request Apr 25, 2024
73 tasks
Copy link
Contributor

@marvinlanhenke marvinlanhenke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @ZENOTME for taking a stab at this! I like the e2e_test crate which could be beneficial to the catalog impl crates as well (less duplicate test setup and infr)


version: '3.8'

services:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll guess once this is merged when can reuse this and extend with images for Glue and HiveMetastore as well - this would remove duplicates from the specific catalog impl crates and the datafusion/integration crate and avoids having each crate run & define same testing infr.

crates/e2e_test/tests/append_data_file_test.rs Outdated Show resolved Hide resolved
@ZENOTME
Copy link
Contributor Author

ZENOTME commented May 13, 2024

Hi, I have tried to fix this PR. Some things may not be fixed well now:

  1. feat: support append data file and add e2e test #349 (comment)
    I'm not sure whether my understanding is correct
  2. todo, we can do them in later PR:
  1. feat: support append data file and add e2e test #349 (comment)
    Please let me know if there are other things I miss and need to fix. cc @Fokko

@ZENOTME ZENOTME force-pushed the tx_append branch 2 times, most recently from 60d548d to 6a035a3 Compare May 14, 2024 15:10
@Fokko
Copy link
Contributor

Fokko commented Nov 27, 2024

@ZENOTME thanks for getting back to me here. I don't want to be the one holding this up, so I would suggest that we get this in. There are still some gaps I see because I think it is pretty easy to brick a table by adding some incompatible files. I'll raise new issues for that so we can pick up those independently.

Any concerns @c-thiel @liurenjie1024 @Xuanwo ?

@Fokko Fokko merged commit f3a571d into apache:main Nov 28, 2024
17 checks passed
@Fokko Fokko mentioned this pull request Nov 28, 2024
@@ -122,6 +172,365 @@ impl<'a> Transaction<'a> {
}
}

/// FastAppendAction is a transaction action for fast append data files to the table.
pub struct FastAppendAction<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be better to move this part to a standalone module under transaction module.

fn process_manifeset(&self, manifests: Vec<ManifestFile>) -> Vec<ManifestFile>;
}

struct SnapshotProduceAction<'a> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Usually an action is a transaction action, I think naming it SnapshotProducer would be enough?

// # TODO
// Fulfill this function
fn summary<OP: SnapshotProduceOperation>(&self, snapshot_produce_operation: &OP) -> Summary {
Summary {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need snapshot summary builder.

@liurenjie1024
Copy link
Contributor

@ZENOTME thanks for getting back to me here. I don't want to be the one holding this up, so I would suggest that we get this in. There are still some gaps I see because I think it is pretty easy to brick a table by adding some incompatible files. I'll raise new issues for that so we can pick up those independently.

Any concerns @c-thiel @liurenjie1024 @Xuanwo ?

Hi, @Fokko sorry for late reply, I took a pr and agree that there are still some improvements, and we can resolve them in follow up issues.

@ZENOTME ZENOTME deleted the tx_append branch December 3, 2024 02:08
shaeqahmed pushed a commit to matanolabs/iceberg-rust that referenced this pull request Dec 9, 2024
* support append data file and add e2e test

* fix typos

* refine append action

* fix cargo sort

* add consistent check for partition value

* generate unique snapshot id

* avoid to set snapshot id for v2

* refine test

* fix unit test

* export ports

* fix None case for parant_snapshot_id

* fix parquect schema check

* refactor append action of transaction

* refine

* refine e2e test

* refine commit uuid

* fix file format field to uppercase in manifest

* refine SnapshotProduceAction

* rename e2e_test to integration_tests

* fix

* use docker-compose.yaml from rest catalog

* fix check

---------

Co-authored-by: ZENOTME <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants