forked from delta-io/delta-rs
-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update #2
Merged
Merged
Update #2
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
# Description To address [CVE-2023-47248](https://nvd.nist.gov/vuln/detail/CVE-2023-47248), propose we update include the install of `pyarrow-hotfix`. # Related Issue(s) N/A # Tasks - [x] Updated `pyproject.toml` for the pyarrow dependency to `pyarrow>=14.0.1` - [x] Updated `Cargo.toml` to `version=0.13.1` - [x] Minor updates to `CONTRIBUTING.md` to remove `$` so copy/paste works. # Tests ran - [x] Wheel built ``` 📦 Built wheel for abi3 Python ≥ 3.8 to /var/folders/vb/c_pm7tr92h155xzpz4b0f1kh0000gn/T/.tmpLPVJTe/deltalake-0.13.1-cp38-abi3-macosx_11_0_arm64.whl 🛠 Installed deltalake-0.13.1 ``` - [x] Format passed ``` 41 files left unchanged ``` - [x] Check Python passed ``` Success: no issues found in 8 source files ``` - [x] Unit tests passed ``` 234 passed, 4 skipped, 25 deselected in 48.66s ``` --------- Co-authored-by: Ion Koutsouris <[email protected]>
# Description This PR adds CHECK constraints on delta tables. I still have some outstanding work to do on this, but this is my working draft PR for this feature. ```rust let constraint = table.add_constraint().with_constraint("id", "id < 100"); constraint.await?; ``` # Related Issue(s) #1881 # Documentation <!--- Share links to useful documentation ---> --------- Co-authored-by: Stephen Carman <[email protected]> Co-authored-by: scarman-db <[email protected]>
Signed-off-by: Nikolay Ulmasov <[email protected]>
…oming from parquet checkpoints, to prevent tombstone and file paths mismatch (e.g. file path is read from checkpoint while tombstone path is read from JSON)
# Description Enable usage of z-order optimization on columns that have capitalization. # Related Issue(s) - closes #1586
#1979) # Description Triggers metadata retrieval only on metadata call, this is a better approach otherwise we need to add it after each method that does an alteration to the table config. Also now it will only be triggered if the user actually wants to retrieve the metadata. Co-authored-by: Robert Pack <[email protected]>
# Description Implements consistent formatting for constraint expressions so something like `value < 1000` is normalized to `value < 1000` Also includes drive by improvements. 1. Test & Fix that Datafusion expressions can actually be used when adding a constraint 2. Test & Fix that constraints can be added to column with capitalization # Related Issue(s) - closes #1971
…ersion` (#1968) # Description Combines the two functions into one. # Related Issue(s) - closes #1910 - closes #1967 --------- Co-authored-by: Robert Pack <[email protected]>
# Description I've changed the API to consolidate that how we use writer properties. You now need to instantiate a WriterProperties class and then pass it to the writer, merge, delete, update, optimize operations. ```python wp = WriterProperties(compression='gzip', compression_level=1) dt.optimize.z_order(['foo'], writer_properties=wp) ``` A potential idea I had is to allow users to set the write properties in the DeltaTable class once, so the properties can be grabbed from the tableclass so you don't have to provide them to each method. --------- Co-authored-by: Robert Pack <[email protected]>
# Description Currently limited to a single constraint at once. I didn't want to chance the function defintion for this so I just check if more than 1 constraint is passed it will raise an error. Once we have the multiple constraints possibility in Rust, I can simply remove this size check. --------- Co-authored-by: Robert Pack <[email protected]>
# Description The GCS tests seem to be failing sometimes, looked around and hoping this change fixes that.
# Description This upgrades merge so that it can leverage partitions where specified in the join predicate. There are two ways we can leverage partitions: 1. static references, i.e `target.partition = 1`. 2. Inferring from the data, i.e `source.partition = target.partition`. In the first case, this implements the logic described in [this comment](https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/operations/merge.rs#L670). Any predicate mentioning the source that is not covered by (2) is pruned, which will leave predicates on just the target columns (and will be amenable to file pruning) In the second case, we first construct a version of the predicate with references to source replaced with placeholders: ```sql target.partition = source.partition and foo > 42 ``` becomes: ```sql target.partition = $1 and foo > 42 ``` We then stream through the source table, gathering the distinct tuples of the mentioned partitions: ``` | partition | ------------- | 1 | | 5 | | 7 | ``` and then expand out the sql to take these into account: ```sql (target.partition = 1 and foo > 42) or (target.partition = 5 and foo > 42) or (target.partition = 7 and foo > 42) ``` And insert this filter into the target chain. We also use the same filter to process the file list, meaning we only make remove actions for files that will be targeted by the scan. I considered whether it would be possible to do this via datafusion sql in a generic manner, for example by first joining against the distinct partitions. I don't think it's possible - because each of the filters on the logical plans are static, there's no opportunity for it to push the distinct partition tuples down into the scan. Another variant would be to make it so the source and partition tables share the same `output_partitioning` structure, but as far as I can tell you wouldn't be able to make the partitions line up such that you can do the merge effectively and not read the whole table (plus `DeltaScan` doesn't guarantee that one datafusion partition is one DeltaTable partition). I think the static bit is a no brainer but the eager read of the source table may cause issues if the source table is of a similar size to the target table. It may be prudent hide that part behind a feature flag on the merge, but would love comments on it. # Performance I created a 16GB table locally with 1.25 billion rows over 1k partitions, and when updating 1 partition a full merge takes 1000-ish seconds: ``` merge took 985.0801 seconds merge metrics: MergeMetrics { num_source_rows: 1250000, num_target_rows_inserted: 468790, num_target_rows_updated: 781210, num_target_rows_deleted: 0, num_target_rows_copied: 1249687667, num_output_rows: 1250937667, num_target_files_added: 1001, num_target_files_removed: 1001, execution_time_ms: 983851, scan_time_ms: 0, rewrite_time_ms: 983322 } ``` but with partitioning it takes about 3: ``` merge took 2.6337671 seconds merge metrics: MergeMetrics { num_source_rows: 1250000, num_target_rows_inserted: 468877, num_target_rows_updated: 781123, num_target_rows_deleted: 0, num_target_rows_copied: 468877, num_output_rows: 1718877, num_target_files_added: 2, num_target_files_removed: 2, execution_time_ms: 2622, scan_time_ms: 0, rewrite_time_ms: 2316 } ``` In practice, the tables I'm wanting to use this for are terabytes in size so using merge is currently impractical. This would be a significant speed boost to them. # Related Issue(s) closes #1846 --------- Co-authored-by: Ion Koutsouris <[email protected]>
…es (#1959) # Description Delta-rs always uses `item` as the list item name when writing lists. If you read data which is for example written by Spark, the item name is `element`, in the current implemantation it's not possible to write RecordBatches with a different item name. This leads for example to the problem that you cann't optimize tables which are written by Spark and contain a List column. In this MR I add condition which will intiate a cast if the list item name of the record batch is different to the target schema one. I have also tried to explain this behaviour in the tests, but unfortunately creating the test data has become complicated (Happy to get feedback) This is my first MR in this project # Related Issue(s) https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/kernel/arrow/mod.rs#L58 https://github.com/delta-io/delta-rs/pull/684/files#r940790524 https://delta-users.slack.com/archives/C013LCAEB98/p1701885637615699 --------- Co-authored-by: Jonas Schmitz <[email protected]> Co-authored-by: Ion Koutsouris <[email protected]> Co-authored-by: Robert Pack <[email protected]>
# Description Fixed few warnings picked up by lint --------- Signed-off-by: Nikolay Ulmasov <[email protected]>
# Description Implements a new Datafusion node called `MergeBarrier` that determines which files have modifications. For files that do not have modifications a remove action is no longer created. # Related Issue(s) - enhances #850
Signed-off-by: Nikolay Ulmasov <[email protected]>
# Description This helps to avoid this [error](#1998 )since you can now set to large_dtypes=False. Also once upstream in arrow-rs there is better type coercion, this param should be able to be removed completely in the writer and merge operation.
# Description - exposes the custom_metadata to pyarrow and rust writer - addresses a bug in the create operation, we were not passing the app_metadata to the actual commit # Related Issue(s) - closes #1990
# Description Forgot to add WriterProperties to the docs page and mark a deprecation in the docs.
Again, forgot some docs, and added missing descriptions
Signed-off-by: Nikolay Ulmasov <[email protected]>
There are a number of changes here to untangle the coupling inside of deltalake-core to allow deltalake-aws to be separated properly
Cloudflare R2 doesn't require the use of an external lock when using conditional operation headers like so: ```rust let s3 = AmazonS3Builder::from_env() .with_url(url) .with_region("auto") .with_access_key_id(&config.s3_access_key_id) .with_secret_access_key(&config.s3_secret_access_key) .with_endpoint(&config.s3_endpoint) // Allows using S3-API without an external locking provider since Cloudflare R2 // provides atomic Put and Copy. .with_config( AmazonS3ConfigKey::CopyIfNotExists, "header: cf-copy-destination-if-none-match: *".to_string(), ) .build() ``` - https://developers.cloudflare.com/r2/api/s3/extensions/#putobject-1 - https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/object_store/src/aws/precondition.rs#L29-L42
This adds an Integration page to the docs re: Dagster. --------- Co-authored-by: Matthew Powers <[email protected]>
use `dagster-deltalake-polars ` instead of `dagster-polars`
# Description Release GIL in deltalake.write_deltalake by wrapping it in py.allow_threads # Related Issue(s) - closes #2234 # Documentation
…ctions (#2271) # Description - Always encapsulates column names in backticks to in the insert_all and update_all calls. - Added note that users need to add backticks for special column names - Removed bigint cast, this was temporarily needed while we were still relying on a physical plan # Related Issue(s) - closes #2230 - closes #2167
# Description Correct the spelling of `without`. # Related Issue(s) N/A # Documentation N/A
…2286) # Description - Replaces assert and AssertionError with built-in exceptions - Amended tests to reference new exception types - Following conventions in file of using built-in exceptions rather than custom exceptions # Related Issue(s) Closes #2242 # Documentation <!--- Share links to useful documentation --->
# Description Refactor the commit function to instead be accessed through a builder interface to prevent breaking call sites when new commit attributes are added. This also introduces a new public struct called `CommitProperties` to be used by any delta operation and to ensure a consistent interface for end users. With this change we should break usage of `app_metadata` on operations and instead have users use this struct. # Related Issue(s) - closes: #2131
…2274) # Description Such a small change, but fixes many issues where parquets were written with arrow where the source data was in large dtype format. By default the parquet::ParquetReader decodes the arrow metadata which in return may give you large dtypes. This would cause issues during DataFusion parquet scan with a filter since the filter wouldn't coerce to the large dtypes. Simply disabling the arrow metadata decoding gives us the parquet schema converted to an arrow schema without large types 👯♂️ # Related issue(s) - closes #1470
…rd link (#1868) compatible to write to local file systems that do not support hard link. # Description When we write to the local file system, sometimes hard link is not supported, such as blobfuse, goofys, s3fs, so deal with it with compatibility. It is important to note that: There is another problem with blobfuse, that is, when it comes to rename, it will report errors. Because rename did not release the file handle before. See here for details: #1765 Arrow-rs is required to cooperate with the modification, for example: https://github.com/GlareDB/arrow-rs/pull/2/files Because object_store has been upgraded to 0.8, there are a lot of breaking change, so I haven't changed this one for the time being. Will fix it after upgrading to 0.8 #1858 # Related Issue(s) #1765 #1376 # Documentation
github-actions
bot
added
documentation
Improvements or additions to documentation
delta-inspect
binding/python
binding/rust
crate/core
labels
Mar 15, 2024
ACTION NEEDED delta-rs follows the Conventional Commits specification for release automation. The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
binding/python
binding/rust
crate/core
delta-inspect
documentation
Improvements or additions to documentation
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
The description of the main changes of your pull request
Related Issue(s)
Documentation