Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add arrow cast #962

Open
wants to merge 33 commits into
base: main
Choose a base branch
from
Open

Add arrow cast #962

wants to merge 33 commits into from

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Dec 3, 2024

Which issue does this PR close?

Completes a task in #463

Rationale for this change

This PR introduces the implementation of the arrow_cast function from datafusion's similar arrow_cast function.

What changes are included in this PR?

Functionality:

Implements the arrow_cast function

Tests:

Adds a test case to validate the functionality of arrow_cast.

Are there any user-facing changes?

Yes, this PR adds the arrow_cast function to the API, enabling users to columns into specific data type.

Copy link
Contributor

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the additions. Do you think this is needed or to just use Expr.cast?

python/datafusion/__init__.py Outdated Show resolved Hide resolved
python/datafusion/functions.py Show resolved Hide resolved
@kosiew
Copy link
Contributor Author

kosiew commented Dec 12, 2024

Thank you for the additions. Do you think this is needed or to just use Expr.cast?

@timsaucer
Copy link
Contributor

So doing a little testing to see if this is necessary:

from datafusion import SessionContext, col, lit
import pyarrow as pa
import datetime
ctx = SessionContext()

df = ctx.from_pydict({
    "a": pa.array([1], type=pa.int64()),
    "date": pa.array([datetime.datetime.today()], type=pa.timestamp("us")),
})

df.show()
print(df.schema())

df = (
    df
    .with_column("b", col("a").cast(pa.int8()))
    .with_column("ms_date", col("date").cast(pa.timestamp("ms")))
)

df.show()
print(df.schema())

Produces:

DataFrame()
+---+----------------------------+
| a | date                       |
+---+----------------------------+
| 1 | 2024-12-12T07:28:24.431099 |
+---+----------------------------+
a: int64
date: timestamp[us]
DataFrame()
+---+----------------------------+---+-------------------------+
| a | date                       | b | ms_date                 |
+---+----------------------------+---+-------------------------+
| 1 | 2024-12-12T07:28:24.431099 | 1 | 2024-12-12T07:28:24.431 |
+---+----------------------------+---+-------------------------+
a: int64
date: timestamp[us]
b: int8
ms_date: timestamp[ms]

So maybe we can use the existing method?

@kosiew
Copy link
Contributor Author

kosiew commented Dec 16, 2024

@timsaucer ,

Thanks for the detailed example.

The other reason for this PR is also to add arrow_cast so that the full set of datafusion Rust scalar functions are available in datafusion-python - #463

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants