-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] DataFrame.to_arrow
is inconsistent with pa.Table.from_pandas()
when preserve_index=True
#14159
Labels
0 - Backlog
In queue waiting for assignment
bug
Something isn't working
Python
Affects Python cuDF API.
Comments
rjzamora
added
bug
Something isn't working
Needs Triage
Need team to review and classify
labels
Sep 21, 2023
GregoryKimball
added
0 - Backlog
In queue waiting for assignment
Python
Affects Python cuDF API.
and removed
Needs Triage
Need team to review and classify
labels
Nov 9, 2023
wence-
added a commit
to wence-/cudf
that referenced
this issue
Mar 22, 2024
When preserving the index and we have a RangeIndex, we must materialize it, and write that information in the metadata correctly. - Closes rapidsai#14159
wence-
added a commit
to wence-/cudf
that referenced
this issue
Mar 22, 2024
3 tasks
wence-
changed the title
[BUG]
[BUG] Mar 22, 2024
DataFrame.to_arrow
is inconsistent with pa.Table.from_pandas()
when preserve_index=True
DataFrame.to_arrow
is inconsistent with pa.Table.from_pandas()
when preserve_index=True
wence-
added a commit
to wence-/cudf
that referenced
this issue
Mar 22, 2024
When preserving the index and we have a RangeIndex, we must materialize it, and write that information in the metadata correctly. - Closes rapidsai#14159
wence-
added a commit
to wence-/cudf
that referenced
this issue
Mar 22, 2024
wence-
added a commit
to wence-/cudf
that referenced
this issue
Mar 22, 2024
When preserving the index and we have a RangeIndex, we must materialize it, and write that information in the metadata correctly. - Closes rapidsai#14159
wence-
added a commit
to wence-/cudf
that referenced
this issue
Mar 22, 2024
wence-
added a commit
to wence-/cudf
that referenced
this issue
Mar 22, 2024
3 tasks
rapids-bot bot
pushed a commit
that referenced
this issue
Apr 22, 2024
Looks like these overrides should be safe to remove now that #14159 is closed out. This should unblock the GPU CI failures we're seeing on Dask with 24.06 in dask/dask#11045. Authors: - Charles Blackmon-Luca (https://github.com/charlesbluca) - Richard (Rick) Zamora (https://github.com/rjzamora) - GALI PREM SAGAR (https://github.com/galipremsagar) Approvers: - Richard (Rick) Zamora (https://github.com/rjzamora) - GALI PREM SAGAR (https://github.com/galipremsagar) URL: #15514
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
0 - Backlog
In queue waiting for assignment
bug
Something isn't working
Python
Affects Python cuDF API.
Describe the bug
It is my understanding that we want the
DataFrame.to_arrow
API to be consistent withpa.Table.from_pandas()
(when possible). This is currently not the case whenDataFrame.index
is aRangeIndex
, andpreserve_index=True
is specified. In this case,pa.Table.from_pandas()
will use theRangeIndex
information to produce an explicit"__index_level_0__"
column in the output pyarrow Table.Side Note: Creating an explicit column is the only way to generate a Table schema that will "safely" preserve the index in dask, because that schema may be used to "rebuild" partitions with a different number of rows later on (and so the original start/stop metadata can be "wrong").
Steps/Code to reproduce bug
The schema extracted from pandas will have an
"__index_level_0__"
column whenpreserve_index=True
, and will storeRangeIndex
metadata ifpreserve_index=True
orpreserve_index=None
:The schema extracted from cudf will never have an explicit index column, and will only store
RangeIndex
metadata ifpreserve_index=True
.Expected behavior
I'd like for
DataFrame.to_arrow(preserve_index=True)
to be consistent withpa.Table.from_pandas(..., preserve_index=True)
. More specifically, I'd like cudf to produce an explicit column in the pyarrow Table. When the originalRangeIndex
is un-named, it probably makes sense to call the column"__index_level_0__"
. However,DataFrame.from_arrow
will also need to recognize that"__index_level_0__"
should round-trip to an un-named index.Additional context
This bug complicates #13893, because
distributed
's "p2p" shuffle now assumes thatto_pyarrow_table_dispatch
will include the actual data for an index column whenpreserve_index=True
(not just range metadata). We can certainly include a band-aid for this indask_cudf
, but the best long-term fix belongs incudf
.The text was updated successfully, but these errors were encountered: