Update #2

JonasDev1 · 2024-03-15T18:20:48Z

Description

The description of the main changes of your pull request

Related Issue(s)

Documentation

# Description To address [CVE-2023-47248](https://nvd.nist.gov/vuln/detail/CVE-2023-47248), propose we update include the install of `pyarrow-hotfix`. # Related Issue(s) N/A # Tasks - [x] Updated `pyproject.toml` for the pyarrow dependency to `pyarrow>=14.0.1` - [x] Updated `Cargo.toml` to `version=0.13.1` - [x] Minor updates to `CONTRIBUTING.md` to remove `$` so copy/paste works. # Tests ran - [x] Wheel built ``` 📦 Built wheel for abi3 Python ≥ 3.8 to /var/folders/vb/c_pm7tr92h155xzpz4b0f1kh0000gn/T/.tmpLPVJTe/deltalake-0.13.1-cp38-abi3-macosx_11_0_arm64.whl 🛠 Installed deltalake-0.13.1 ``` - [x] Format passed ``` 41 files left unchanged ``` - [x] Check Python passed ``` Success: no issues found in 8 source files ``` - [x] Unit tests passed ``` 234 passed, 4 skipped, 25 deselected in 48.66s ``` --------- Co-authored-by: Ion Koutsouris <[email protected]>

# Description This PR adds CHECK constraints on delta tables. I still have some outstanding work to do on this, but this is my working draft PR for this feature. ```rust let constraint = table.add_constraint().with_constraint("id", "id < 100"); constraint.await?; ``` # Related Issue(s) #1881 # Documentation  --------- Co-authored-by: Stephen Carman <[email protected]> Co-authored-by: scarman-db <[email protected]>

Signed-off-by: Nikolay Ulmasov <[email protected]>

…iables

…oming from parquet checkpoints, to prevent tombstone and file paths mismatch (e.g. file path is read from checkpoint while tombstone path is read from JSON)

# Description Enable usage of z-order optimization on columns that have capitalization. # Related Issue(s) - closes #1586

#1979) # Description Triggers metadata retrieval only on metadata call, this is a better approach otherwise we need to add it after each method that does an alteration to the table config. Also now it will only be triggered if the user actually wants to retrieve the metadata. Co-authored-by: Robert Pack <[email protected]>

# Description Implements consistent formatting for constraint expressions so something like `value < 1000` is normalized to `value < 1000` Also includes drive by improvements. 1. Test & Fix that Datafusion expressions can actually be used when adding a constraint 2. Test & Fix that constraints can be added to column with capitalization # Related Issue(s) - closes #1971

…ersion` (#1968) # Description Combines the two functions into one. # Related Issue(s) - closes #1910 - closes #1967 --------- Co-authored-by: Robert Pack <[email protected]>

# Description I've changed the API to consolidate that how we use writer properties. You now need to instantiate a WriterProperties class and then pass it to the writer, merge, delete, update, optimize operations. ```python wp = WriterProperties(compression='gzip', compression_level=1) dt.optimize.z_order(['foo'], writer_properties=wp) ``` A potential idea I had is to allow users to set the write properties in the DeltaTable class once, so the properties can be grabbed from the tableclass so you don't have to provide them to each method. --------- Co-authored-by: Robert Pack <[email protected]>

# Description Currently limited to a single constraint at once. I didn't want to chance the function defintion for this so I just check if more than 1 constraint is passed it will raise an error. Once we have the multiple constraints possibility in Rust, I can simply remove this size check. --------- Co-authored-by: Robert Pack <[email protected]>

# Description The GCS tests seem to be failing sometimes, looked around and hoping this change fixes that.

# Description This upgrades merge so that it can leverage partitions where specified in the join predicate. There are two ways we can leverage partitions: 1. static references, i.e `target.partition = 1`. 2. Inferring from the data, i.e `source.partition = target.partition`. In the first case, this implements the logic described in [this comment](https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/operations/merge.rs#L670). Any predicate mentioning the source that is not covered by (2) is pruned, which will leave predicates on just the target columns (and will be amenable to file pruning) In the second case, we first construct a version of the predicate with references to source replaced with placeholders: ```sql target.partition = source.partition and foo > 42 ``` becomes: ```sql target.partition = $1 and foo > 42 ``` We then stream through the source table, gathering the distinct tuples of the mentioned partitions: ``` | partition | ------------- | 1 | | 5 | | 7 | ``` and then expand out the sql to take these into account: ```sql (target.partition = 1 and foo > 42) or (target.partition = 5 and foo > 42) or (target.partition = 7 and foo > 42) ``` And insert this filter into the target chain. We also use the same filter to process the file list, meaning we only make remove actions for files that will be targeted by the scan. I considered whether it would be possible to do this via datafusion sql in a generic manner, for example by first joining against the distinct partitions. I don't think it's possible - because each of the filters on the logical plans are static, there's no opportunity for it to push the distinct partition tuples down into the scan. Another variant would be to make it so the source and partition tables share the same `output_partitioning` structure, but as far as I can tell you wouldn't be able to make the partitions line up such that you can do the merge effectively and not read the whole table (plus `DeltaScan` doesn't guarantee that one datafusion partition is one DeltaTable partition). I think the static bit is a no brainer but the eager read of the source table may cause issues if the source table is of a similar size to the target table. It may be prudent hide that part behind a feature flag on the merge, but would love comments on it. # Performance I created a 16GB table locally with 1.25 billion rows over 1k partitions, and when updating 1 partition a full merge takes 1000-ish seconds: ``` merge took 985.0801 seconds merge metrics: MergeMetrics { num_source_rows: 1250000, num_target_rows_inserted: 468790, num_target_rows_updated: 781210, num_target_rows_deleted: 0, num_target_rows_copied: 1249687667, num_output_rows: 1250937667, num_target_files_added: 1001, num_target_files_removed: 1001, execution_time_ms: 983851, scan_time_ms: 0, rewrite_time_ms: 983322 } ``` but with partitioning it takes about 3: ``` merge took 2.6337671 seconds merge metrics: MergeMetrics { num_source_rows: 1250000, num_target_rows_inserted: 468877, num_target_rows_updated: 781123, num_target_rows_deleted: 0, num_target_rows_copied: 468877, num_output_rows: 1718877, num_target_files_added: 2, num_target_files_removed: 2, execution_time_ms: 2622, scan_time_ms: 0, rewrite_time_ms: 2316 } ``` In practice, the tables I'm wanting to use this for are terabytes in size so using merge is currently impractical. This would be a significant speed boost to them. # Related Issue(s) closes #1846 --------- Co-authored-by: Ion Koutsouris <[email protected]>

…es (#1959) # Description Delta-rs always uses `item` as the list item name when writing lists. If you read data which is for example written by Spark, the item name is `element`, in the current implemantation it's not possible to write RecordBatches with a different item name. This leads for example to the problem that you cann't optimize tables which are written by Spark and contain a List column. In this MR I add condition which will intiate a cast if the list item name of the record batch is different to the target schema one. I have also tried to explain this behaviour in the tests, but unfortunately creating the test data has become complicated (Happy to get feedback) This is my first MR in this project # Related Issue(s) https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/kernel/arrow/mod.rs#L58 https://github.com/delta-io/delta-rs/pull/684/files#r940790524 https://delta-users.slack.com/archives/C013LCAEB98/p1701885637615699 --------- Co-authored-by: Jonas Schmitz <[email protected]> Co-authored-by: Ion Koutsouris <[email protected]> Co-authored-by: Robert Pack <[email protected]>

# Description Fixed few warnings picked up by lint --------- Signed-off-by: Nikolay Ulmasov <[email protected]>

# Description Implements a new Datafusion node called `MergeBarrier` that determines which files have modifications. For files that do not have modifications a remove action is no longer created. # Related Issue(s) - enhances #850

Signed-off-by: Nikolay Ulmasov <[email protected]>

# Description This helps to avoid this [error](#1998 )since you can now set to large_dtypes=False. Also once upstream in arrow-rs there is better type coercion, this param should be able to be removed completely in the writer and merge operation.

# Description - exposes the custom_metadata to pyarrow and rust writer - addresses a bug in the create operation, we were not passing the app_metadata to the actual commit # Related Issue(s) - closes #1990

# Description Forgot to add WriterProperties to the docs page and mark a deprecation in the docs.

Again, forgot some docs, and added missing descriptions

Signed-off-by: Nikolay Ulmasov <[email protected]>

There are a number of changes here to untangle the coupling inside of deltalake-core to allow deltalake-aws to be separated properly

# Description Spark-scala uses the plural form to construct intervals, so "day**s**" instead of "day". To keep backwards compatibility, I kept the singular form as well. - closes #2180 - closes: #2072

Cloudflare R2 doesn't require the use of an external lock when using conditional operation headers like so: ```rust let s3 = AmazonS3Builder::from_env() .with_url(url) .with_region("auto") .with_access_key_id(&config.s3_access_key_id) .with_secret_access_key(&config.s3_secret_access_key) .with_endpoint(&config.s3_endpoint) // Allows using S3-API without an external locking provider since Cloudflare R2 // provides atomic Put and Copy. .with_config( AmazonS3ConfigKey::CopyIfNotExists, "header: cf-copy-destination-if-none-match: *".to_string(), ) .build() ``` - https://developers.cloudflare.com/r2/api/s3/extensions/#putobject-1 - https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/object_store/src/aws/precondition.rs#L29-L42

# Description Update timing fields using the suffix `_ms` to be in milliseconds. Doing a quick grep through the repo I only found a few instances of this in the delete.rs file. # Related Issue(s) Closes #2256 # Documentation

This adds an Integration page to the docs re: Dagster. --------- Co-authored-by: Matthew Powers <[email protected]>

Fixes #2210

use `dagster-deltalake-polars ` instead of `dagster-polars`

# Description Release GIL in deltalake.write_deltalake by wrapping it in py.allow_threads # Related Issue(s) - closes #2234 # Documentation

…ctions (#2271) # Description - Always encapsulates column names in backticks to in the insert_all and update_all calls. - Added note that users need to add backticks for special column names - Removed bigint cast, this was temporarily needed while we were still relying on a physical plan # Related Issue(s) - closes #2230 - closes #2167

…2270) # Description We only read the checkpoint files that match the version in _last_checkpoint now. # Related Issue(s) - closes #2258

# Description Correct the spelling of `without`. # Related Issue(s) N/A # Documentation N/A

…2286) # Description - Replaces assert and AssertionError with built-in exceptions - Amended tests to reference new exception types - Following conventions in file of using built-in exceptions rather than custom exceptions # Related Issue(s) Closes #2242 # Documentation

# Description Refactor the commit function to instead be accessed through a builder interface to prevent breaking call sites when new commit attributes are added. This also introduces a new public struct called `CommitProperties` to be used by any delta operation and to ensure a consistent interface for end users. With this change we should break usage of `app_metadata` on operations and instead have users use this struct. # Related Issue(s) - closes: #2131

…2274) # Description Such a small change, but fixes many issues where parquets were written with arrow where the source data was in large dtype format. By default the parquet::ParquetReader decodes the arrow metadata which in return may give you large dtypes. This would cause issues during DataFusion parquet scan with a filter since the filter wouldn't coerce to the large dtypes. Simply disabling the arrow metadata decoding gives us the parquet schema converted to an arrow schema without large types 👯‍♂️ # Related issue(s) - closes #1470

…rd link (#1868) compatible to write to local file systems that do not support hard link. # Description When we write to the local file system, sometimes hard link is not supported, such as blobfuse, goofys, s3fs, so deal with it with compatibility. It is important to note that: There is another problem with blobfuse, that is, when it comes to rename, it will report errors. Because rename did not release the file handle before. See here for details: #1765 Arrow-rs is required to cooperate with the modification, for example: https://github.com/GlareDB/arrow-rs/pull/2/files Because object_store has been upgraded to 0.8, there are a lot of breaking change, so I haven't changed this one for the time being. Will fix it after upgrading to 0.8 #1858 # Related Issue(s) #1765 #1376 # Documentation

github-actions · 2024-03-15T18:21:15Z

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

Blajda and others added 30 commits December 12, 2023 09:03

fix: respect case sensitivity on operations

b4c055d

Use default if response is empty

fee96a7

add auto-release for docs merged to main

363b6ea

Signed-off-by: Nikolay Ulmasov <[email protected]>

fix: Enable S3 integration tests to be configured via environment var…

be75827

…iables

fix: make sure percent-encoded file paths are properly decoded when c…

763d39e

…oming from parquet checkpoints, to prevent tombstone and file paths mismatch (e.g. file path is read from checkpoint while tombstone path is read from JSON)

fix: case sensitivity for z-order (#1982)

df04624

# Description Enable usage of z-order optimization on columns that have capitalization. # Related Issue(s) - closes #1586

feat(python): combine load_version/load_with_datetime into `load_as_v…

a5a4e69

…ersion` (#1968) # Description Combines the two functions into one. # Related Issue(s) - closes #1910 - closes #1967 --------- Co-authored-by: Robert Pack <[email protected]>

fix: flaky gcs test (#1987)

c14d577

# Description The GCS tests seem to be failing sometimes, looked around and hoping this change fixes that.

chore: fix CI breaking lint issues (#1999)

02e26e5

# Description Fixed few warnings picked up by lint --------- Signed-off-by: Nikolay Ulmasov <[email protected]>

docs: datafusion integration

cb20ad7

docs: explain why delta-rs depends on datafusion

6da3b3b

feat: omit unmodified files during merge write (#1969)

74f9d33

# Description Implements a new Datafusion node called `MergeBarrier` that determines which files have modifications. For files that do not have modifications a remove action is no longer created. # Related Issue(s) - enhances #850

fix the test_restore_by_datetime test

40e3b0d

Signed-off-by: Nikolay Ulmasov <[email protected]>

chore: version bump python release (#2011)

7add491

feat(python): expose custom metadata to writers (#1994)

093a756

# Description - exposes the custom_metadata to pyarrow and rust writer - addresses a bug in the create operation, we were not passing the app_metadata to the actual commit # Related Issue(s) - closes #1990

docs: add writer properties to docs (#2002)

1f9898a

# Description Forgot to add WriterProperties to the docs page and mark a deprecation in the docs.

docs: add alterer (#2014)

4cb754b

Again, forgot some docs, and added missing descriptions

use temporary table names during the constraint checks

6d41b37

Signed-off-by: Nikolay Ulmasov <[email protected]>

Refactor AWS specific code into the deltalake-aws crate

8762c09

There are a number of changes here to untangle the coupling inside of deltalake-core to allow deltalake-aws to be separated properly

fix: windows tests

08132a0

chore: prune dependencies

90739a3

ion-elgreco and others added 23 commits March 6, 2024 07:38

typo

6168aba

Fixes #2253

154f9b5

ruff

98636ab

fix chrono version

7211b07

adjust test

42767ff

fix(rust): make interval parsing compatible with plural form (#2250)

7c1bc35

# Description Spark-scala uses the plural form to construct intervals, so "day**s**" instead of "day". To keep backwards compatibility, I kept the singular form as well. - closes #2180 - closes: #2072

fix(#2256): use consistent units of time (#2260)

3e6a4d6

# Description Update timing fields using the suffix `_ms` to be in milliseconds. Doing a quick grep through the repo I only found a few instances of this in the delete.rs file. # Related Issue(s) Closes #2256 # Documentation

docs: create Dagster integration page (#2159)

fe36b13

This adds an Integration page to the docs re: Dagster. --------- Co-authored-by: Matthew Powers <[email protected]>

removing rusoto and replacing it with the Rust AWS SDK

2f53872

Fixes #2210

docs: use dagster deltalake polars library (#2263)

16fd95f

use `dagster-deltalake-polars ` instead of `dagster-polars`

feat(python): release GIL in the write_deltalake function (#2257)

4bcb2e6

# Description Release GIL in deltalake.write_deltalake by wrapping it in py.allow_threads # Related Issue(s) - closes #2234 # Documentation

fix: clean up unnecessay mut and dead code warnings

7491e00

Optimize commit v1

f82cbd3

cargo fmt

2786e92

fix: include .venv in .gitignore

1e19cf3

fix(rust): read only checkpoints that match _last_checkpoint version (#…

9a7ee6f

…2270) # Description We only read the checkpoint files that match the version in _last_checkpoint now. # Related Issue(s) - closes #2258

docs: fix typo in delta-lake-polars.md (#2285)

6b39cb9

# Description Correct the spelling of `without`. # Related Issue(s) N/A # Documentation N/A

github-actions bot added documentation Improvements or additions to documentation delta-inspect binding/python binding/rust crate/core labels Mar 15, 2024

JonasDev1 merged commit a4d4170 into JonasDev1:main Mar 15, 2024
21 of 22 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update #2

Update #2

JonasDev1 commented Mar 15, 2024

github-actions bot commented Mar 15, 2024

Update #2

Update #2

Conversation

JonasDev1 commented Mar 15, 2024

Description

Related Issue(s)

Documentation

github-actions bot commented Mar 15, 2024