Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update #2

Merged
merged 168 commits into from
Mar 15, 2024
Merged

Update #2

merged 168 commits into from
Mar 15, 2024

Conversation

JonasDev1
Copy link
Owner

Description

The description of the main changes of your pull request

Related Issue(s)

Documentation

Blajda and others added 30 commits December 12, 2023 09:03
# Description
To address
[CVE-2023-47248](https://nvd.nist.gov/vuln/detail/CVE-2023-47248),
propose we update include the install of `pyarrow-hotfix`.

# Related Issue(s)
N/A

# Tasks
- [x] Updated `pyproject.toml` for the pyarrow dependency to
`pyarrow>=14.0.1`
- [x] Updated `Cargo.toml` to `version=0.13.1`
- [x] Minor updates to `CONTRIBUTING.md` to remove `$` so copy/paste
works.

# Tests ran
- [x] Wheel built
```
📦 Built wheel for abi3 Python ≥ 3.8 to /var/folders/vb/c_pm7tr92h155xzpz4b0f1kh0000gn/T/.tmpLPVJTe/deltalake-0.13.1-cp38-abi3-macosx_11_0_arm64.whl
🛠 Installed deltalake-0.13.1
```
- [x] Format passed
```
41 files left unchanged
```
- [x] Check Python passed
```
Success: no issues found in 8 source files
```   
- [x] Unit tests passed
```
234 passed, 4 skipped, 25 deselected in 48.66s 
```

---------

Co-authored-by: Ion Koutsouris <[email protected]>
# Description
This PR adds CHECK constraints on delta tables. I still have some
outstanding work to do on this, but this is my working draft PR for this
feature.

```rust
let constraint = table.add_constraint().with_constraint("id", "id < 100");
constraint.await?;
```

# Related Issue(s)
#1881 

# Documentation

<!---
Share links to useful documentation
--->

---------

Co-authored-by: Stephen Carman <[email protected]>
Co-authored-by: scarman-db <[email protected]>
…oming from parquet checkpoints, to prevent tombstone and file paths mismatch (e.g. file path is read from checkpoint while tombstone path is read from JSON)
# Description
Enable usage of z-order optimization on columns that have
capitalization.

# Related Issue(s)
- closes #1586
#1979)

# Description
Triggers metadata retrieval only on metadata call, this is a better
approach otherwise we need to add it after each method that does an
alteration to the table config. Also now it will only be triggered if
the user actually wants to retrieve the metadata.

Co-authored-by: Robert Pack <[email protected]>
# Description
Implements consistent formatting for constraint expressions so something
like `value < 1000` is normalized to `value < 1000`


Also includes drive by improvements.
1. Test & Fix that Datafusion expressions can actually be used when
adding a constraint
2. Test & Fix that constraints can be added to column with
capitalization
 
# Related Issue(s)
- closes #1971
…ersion` (#1968)

# Description
Combines the two functions into one.

# Related Issue(s)
- closes #1910
- closes #1967

---------

Co-authored-by: Robert Pack <[email protected]>
# Description
I've changed the API to consolidate that how we use writer properties.
You now need to instantiate a WriterProperties class and then pass it to
the writer, merge, delete, update, optimize operations.

```python
wp = WriterProperties(compression='gzip', compression_level=1)
dt.optimize.z_order(['foo'], writer_properties=wp)
```

A potential idea I had is to allow users to set the write properties in
the DeltaTable class once, so the properties can be grabbed from the
tableclass so you don't have to provide them to each method.

---------

Co-authored-by: Robert Pack <[email protected]>
# Description
Currently limited to a single constraint at once. I didn't want to
chance the function defintion for this so I just check if more than 1
constraint is passed it will raise an error. Once we have the multiple
constraints possibility in Rust, I can simply remove this size check.

---------

Co-authored-by: Robert Pack <[email protected]>
# Description

The GCS tests seem to be failing sometimes, looked around and hoping
this change fixes that.
# Description
This upgrades merge so that it can leverage partitions where specified
in the join predicate. There are two ways we can leverage partitions:

1. static references, i.e `target.partition = 1`.
2. Inferring from the data, i.e `source.partition = target.partition`.

In the first case, this implements the logic described in [this
comment](https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/operations/merge.rs#L670).
Any predicate mentioning the source that is not covered by (2) is
pruned, which will leave predicates on just the target columns (and will
be amenable to file pruning)

In the second case, we first construct a version of the predicate with
references to source replaced with placeholders:

```sql
target.partition = source.partition and foo > 42
```

becomes:

```sql
target.partition = $1 and foo > 42
```

We then stream through the source table, gathering the distinct tuples
of the mentioned partitions:

```
| partition |
-------------
|       1   |
|       5   |
|       7   |
```

and then expand out the sql to take these into account:

```sql
(target.partition = 1 and foo > 42)
or (target.partition = 5 and foo > 42)
or (target.partition = 7 and foo > 42)
```
And insert this filter into the target chain. We also use the same
filter to process the file list, meaning we only make remove actions for
files that will be targeted by the scan.

I considered whether it would be possible to do this via datafusion sql
in a generic manner, for example by first joining against the distinct
partitions. I don't think it's possible - because each of the filters on
the logical plans are static, there's no opportunity for it to push the
distinct partition tuples down into the scan. Another variant would be
to make it so the source and partition tables share the same
`output_partitioning` structure, but as far as I can tell you wouldn't
be able to make the partitions line up such that you can do the merge
effectively and not read the whole table (plus `DeltaScan` doesn't
guarantee that one datafusion partition is one DeltaTable partition).

I think the static bit is a no brainer but the eager read of the source
table may cause issues if the source table is of a similar size to the
target table. It may be prudent hide that part behind a feature flag on
the merge, but would love comments on it.

# Performance

I created a 16GB table locally with 1.25 billion rows over 1k
partitions, and when updating 1 partition a full merge takes 1000-ish
seconds:

```
merge took 985.0801 seconds
merge metrics: MergeMetrics { num_source_rows: 1250000, num_target_rows_inserted: 468790, num_target_rows_updated: 781210, num_target_rows_deleted: 0, num_target_rows_copied: 1249687667, num_output_rows: 1250937667, num_target_files_added: 1001, num_target_files_removed: 1001, execution_time_ms: 983851, scan_time_ms: 0, rewrite_time_ms: 983322 }
```

but with partitioning it takes about 3:
```
merge took 2.6337671 seconds
merge metrics: MergeMetrics { num_source_rows: 1250000, num_target_rows_inserted: 468877, num_target_rows_updated: 781123, num_target_rows_deleted: 0, num_target_rows_copied: 468877, num_output_rows: 1718877, num_target_files_added: 2, num_target_files_removed: 2, execution_time_ms: 2622, scan_time_ms: 0, rewrite_time_ms: 2316 }
```

In practice, the tables I'm wanting to use this for are terabytes in
size so using merge is currently impractical. This would be a
significant speed boost to them.


# Related Issue(s)
closes #1846

---------

Co-authored-by: Ion Koutsouris <[email protected]>
…es (#1959)

# Description
Delta-rs always uses `item` as the list item name when writing lists. If
you read data which is for example written by Spark, the item name is
`element`, in the current implemantation it's not possible to write
RecordBatches with a different item name. This leads for example to the
problem that you cann't optimize tables which are written by Spark and
contain a List column.
In this MR I add condition which will intiate a cast if the list item
name of the record batch is different to the target schema one.
I have also tried to explain this behaviour in the tests, but
unfortunately creating the test data has become complicated (Happy to
get feedback)

This is my first MR in this project 

# Related Issue(s)

https://github.com/delta-io/delta-rs/blob/main/crates/deltalake-core/src/kernel/arrow/mod.rs#L58
https://github.com/delta-io/delta-rs/pull/684/files#r940790524
https://delta-users.slack.com/archives/C013LCAEB98/p1701885637615699

---------

Co-authored-by: Jonas Schmitz <[email protected]>
Co-authored-by: Ion Koutsouris <[email protected]>
Co-authored-by: Robert Pack <[email protected]>
# Description
Fixed few warnings picked up by lint

---------

Signed-off-by: Nikolay Ulmasov <[email protected]>
# Description
Implements a new Datafusion node called `MergeBarrier` that determines
which files have modifications. For files that do not have modifications
a remove action is no longer created.

# Related Issue(s)
- enhances #850
# Description
This helps to avoid this
[error](#1998 )since you can
now set to large_dtypes=False.

Also once upstream in arrow-rs there is better type coercion, this param
should be able to be removed completely in the writer and merge
operation.
# Description
- exposes the custom_metadata to pyarrow and rust writer
- addresses a bug in the create operation, we were not passing the
app_metadata to the actual commit

# Related Issue(s)
- closes #1990
# Description
Forgot to add WriterProperties to the docs page and mark a deprecation
in the docs.
Again, forgot some docs, and added missing descriptions
There are a number of changes here to untangle the coupling inside of
deltalake-core to allow deltalake-aws to be separated properly
ion-elgreco and others added 23 commits March 6, 2024 07:38
# Description
Spark-scala uses the plural form to construct intervals, so "day**s**"
instead of "day". To keep backwards compatibility, I kept the singular
form as well.

- closes #2180
- closes: #2072
Cloudflare R2 doesn't require the use of an external lock when using conditional operation headers like so:

```rust
let s3 = AmazonS3Builder::from_env()
    .with_url(url)
    .with_region("auto")
    .with_access_key_id(&config.s3_access_key_id)
    .with_secret_access_key(&config.s3_secret_access_key)
    .with_endpoint(&config.s3_endpoint)
    // Allows using S3-API without an external locking provider since Cloudflare R2
    // provides atomic Put and Copy.
    .with_config(
        AmazonS3ConfigKey::CopyIfNotExists,
        "header: cf-copy-destination-if-none-match: *".to_string(),
    )
    .build()
```

- https://developers.cloudflare.com/r2/api/s3/extensions/#putobject-1
- https://github.com/apache/arrow-rs/blob/c6ba0f764a9142b74c9070db269de04d2701d112/object_store/src/aws/precondition.rs#L29-L42
# Description

Update timing fields using the suffix `_ms` to be in milliseconds.

Doing a quick grep through the repo I only found a few instances of this
in the delete.rs file.

# Related Issue(s)

Closes #2256

# Documentation
This adds an Integration page to the docs re: Dagster.

---------

Co-authored-by: Matthew Powers <[email protected]>
use `dagster-deltalake-polars ` instead of `dagster-polars`
# Description
Release GIL in deltalake.write_deltalake by wrapping it in
py.allow_threads
# Related Issue(s)
- closes #2234
# Documentation
…ctions (#2271)

# Description
- Always encapsulates column names in backticks to in the insert_all and
update_all calls.
- Added note that users need to add backticks for special column names
- Removed bigint cast, this was temporarily needed while we were still
relying on a physical plan

# Related Issue(s)
- closes #2230
- closes #2167
…2270)

# Description
We only read the checkpoint files that match the version in
_last_checkpoint now.

# Related Issue(s)
- closes #2258
# Description

Correct the spelling of `without`.

# Related Issue(s)

N/A

# Documentation

N/A
…2286)

# Description

- Replaces assert and AssertionError with built-in exceptions
- Amended tests to reference new exception types
- Following conventions in file of using built-in exceptions rather than
custom exceptions

# Related Issue(s)
Closes #2242

# Documentation

<!---
Share links to useful documentation
--->
# Description
Refactor the commit function to instead be accessed through a builder
interface to prevent breaking call sites when new commit attributes are
added.

This also introduces a new public struct called `CommitProperties` to be
used by any delta operation and to ensure a consistent interface for end
users. With this change we should break usage of `app_metadata` on
operations and instead have users use this struct.

# Related Issue(s)
- closes: #2131
…2274)

# Description
Such a small change, but fixes many issues where parquets were written
with arrow where the source data was in large dtype format.

By default the parquet::ParquetReader decodes the arrow metadata which
in return may give you large dtypes. This would cause issues during
DataFusion parquet scan with a filter since the filter wouldn't coerce
to the large dtypes. Simply disabling the arrow metadata decoding gives
us the parquet schema converted to an arrow schema without large types
👯‍♂️

# Related issue(s)
- closes #1470
…rd link (#1868)

compatible to write to local file systems that do not support hard link.

# Description

When we write to the local file system, sometimes hard link is not
supported, such as blobfuse, goofys, s3fs, so deal with it with
compatibility.

It is important to note that:
There is another problem with blobfuse, that is, when it comes to
rename, it will report errors. Because rename did not release the file
handle before.
See here for details: #1765

Arrow-rs is required to cooperate with the modification, for example:
https://github.com/GlareDB/arrow-rs/pull/2/files
Because object_store has been upgraded to 0.8, there are a lot of
breaking change, so I haven't changed this one for the time being. Will
fix it after upgrading to 0.8
#1858

# Related Issue(s)

#1765
 
#1376 

# Documentation
@JonasDev1 JonasDev1 merged commit a4d4170 into JonasDev1:main Mar 15, 2024
21 of 22 checks passed
Copy link

ACTION NEEDED

delta-rs follows the Conventional Commits specification for release automation.

The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.