Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue with reading timestamps from spark Delta tables #3155

Closed
kejtos opened this issue Jan 23, 2025 · 1 comment
Closed

Issue with reading timestamps from spark Delta tables #3155

kejtos opened this issue Jan 23, 2025 · 1 comment
Labels
bug Something isn't working

Comments

@kejtos
Copy link

kejtos commented Jan 23, 2025

Environment

Delta-rs version: 0.24.0

Binding: Python

Environment:

  • Cloud provider: Azure Databricks
  • Runtime: DBR 12.2 LTS
  • Driver: Standard_DS3_v2

Bug

What happened:

The following code on Databricks:

from deltalake import DeltaTable

DeltaTable('path', storage_options={'allow_unsafe_rename': 'true'}).to_pyarrow_table()

results in the following error:
ComputeError: ArrowInvalid: Casting from timestamp[ns] to timestamp[us, tz=UTC] would lose data: -number

What you expected to happen:

Reading the timestamp in the original format, or coerce the casting (I do not need such precision anyway).

How to reproduce it:
Run any code on databricks DeltaTable that reads timestamp[ns] columns. E.g.,

from deltalake import DeltaTable

DeltaTable('path', storage_options={'allow_unsafe_rename': 'true'}).to_pyarrow_table()

or

from deltalake import DeltaTable
import duckdb

delta_table = DeltaTable('path', storage_options={'allow_unsafe_rename': 'true'}).to_pyarrow_dataset()
quack = duckdb.arrow(delta_table)
print(quack.select("*"))

or

import polars as pl

delta_table = pl.read_table('path', storage_options={'allow_unsafe_rename': 'true'}) # use_pyarrow=True did not help
@kejtos kejtos added the bug Something isn't working label Jan 23, 2025
@ion-elgreco ion-elgreco changed the title Issue with reading timestamps from databricks Delta tables Issue with reading timestamps from spark Delta tables Jan 23, 2025
@ion-elgreco
Copy link
Collaborator

ion-elgreco commented Jan 23, 2025

@kejtos this because spark still uses an old type for timestamps (int96) which is discouraged. You should set a proper spark config, to prevent your tables getting the wrong timestamp.

SparkSession.config("spark.sql.parquet.outputTimestampType", "TIMESTAMP_MICROS")

To properly read int96 types as microsecond instead is to use parquet_read_options

parquet_read_options: Optional read options for Parquet. Use this to handle INT96 to timestamp conversion for edge cases like 0001-01-01 or 9999-12-31

DeltaTable(tmp_path)
        .to_pyarrow_dataset(
            parquet_read_options=ParquetReadOptions(coerce_int96_timestamp_unit="us")
        )

@ion-elgreco ion-elgreco closed this as not planned Won't fix, can't repro, duplicate, stale Jan 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants