Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Patched DataFusion, Oct 30, 2024 #48

Draft
wants to merge 10 commits into
base: df-upgrade-base-oct-30
Choose a base branch
from

Conversation

wiedld
Copy link
Collaborator

@wiedld wiedld commented Dec 12, 2024

This takes us up to Oct-30 SHA 4975829

It advances DF by adding these oct-15 to oct-30 commits: 399e840...influxdata:arrow-datafusion:497582906e826d3045e2c7b9d4ddb44ce6d42adf

NOTE: this does not include:

  • the SPM update, which is the last change on oct-30: f23360f
  • the internal use of view types for array record batch processing: 2d7892b

Patches applied

  • Patches already fixed in future upstream commits:

  • Patches still not fixed:

  • Patches specific to our DF branch (no need for upstream fix):

    • bit_length(utf8view) patch: 90423e2
    • we are keeping in the warn logging for the schema mismatch. (Upstream only returns error, or not, based on schema misalignment): our patch 69d6118
    • we are using an extra few bytes for the GroupedHashAggregateStream: 50e1209
  • tech debt to be fixed:

    • we are keeping in the warn logging for the schema mismatch. (Upstream only returns error, or not, based on schema misalignment): our patch 69d6118
    • I reverted/excluded the commit to turn on view types by default (for arrow processing):
      • we had errors for iox tests with existing parquet, which contained string value fields.
        • e.g. end_to_end_cases::cli::write_and_query, end_to_end_cases::cli::table::create_write_and_query.
        • The exact error thrown was an iox schema check. Even if this was updated to accept view types as valid => we had arrow cast errors where it kept attempting ot cast to string array (not view).
          * I can guess this is due to our various iox codepoints where we map String fields to arrow's Utf8. I would rather not muck around with that, unless it's done separately.
          * Additionally, setting schema_force_view_types=false feels fragile since it's nested in a various options. To be on the safe side, I would rather revert this commit for now.

@wiedld wiedld changed the title Patched DataFusion, Oct 15, 2024 Patched DataFusion, Oct 30, 2024 Dec 12, 2024
@wiedld wiedld force-pushed the iox-12627/patched-DF-oct-30 branch 2 times, most recently from 03d928c to 3afc128 Compare December 12, 2024 22:02
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Dec 12, 2024
@wiedld wiedld force-pushed the iox-12627/patched-DF-oct-30 branch from 89ae9bf to 1e9c1f1 Compare December 12, 2024 23:03
@wiedld
Copy link
Collaborator Author

wiedld commented Dec 12, 2024

  • we are using an extra few bytes for the GroupedHashAggregateStream: 50e1209

@alamb note that this is a new commit.

.

@alamb note that this is a newly re-written commit, now that we advanced to the latest (on DF main) version of union_schema(). This is the remaining bit we patch to handle our iox-specific issue.

@wiedld wiedld force-pushed the iox-12627/patched-DF-oct-30 branch from 1e9c1f1 to 50e1209 Compare December 12, 2024 23:31
…improvement for entire ClickBench suite) (apache#13101)"

This reverts commit 2d7892b.
@wiedld
Copy link
Collaborator Author

wiedld commented Dec 13, 2024

In order to get iox CI passing, I decided to back out the commit to turn on view types. See this PR description.

In brief, existing parquet with string fields was failing in our iox tests. IMO, we should try making this change separately; not with 2 weeks of other DF changes. What do you think @alamb ?

@alamb
Copy link
Collaborator

alamb commented Dec 13, 2024

I reverted/excluded the commit to turn on view types by default (for arrow processing):

the internal use of view types for array record batch processing: 2d7892b

I recommend:

  1. Include the code change

  2. Changing the influxdb_iox config to turn off the view types by default (e.g. https://github.com/influxdata/influxdb_iox/blob/496b298676097aaf0eb52f67957809f147ed4a56/datafusion_util/src/config.rs#L25-L29). Specifically set datafusion.execution.parquet.schema_force_view_types to false

  3. File a ticket tracking the work needed to turn it on by default in IOx

The rationale to do this is to keep our fork as close as possible to DataFusion main (and we can then work on using the default value of the datafusion.execution.parquet.schema_force_view_types setting without having to change DataFusion versions too

(we can also do this in a follow on PR)

@wiedld
Copy link
Collaborator Author

wiedld commented Dec 13, 2024

I recommend:

1. Include the code change

2. Changing the influxdb_iox config to turn off the view types by default (e.g. https://github.com/influxdata/influxdb_iox/blob/496b298676097aaf0eb52f67957809f147ed4a56/datafusion_util/src/config.rs#L25-L29). Specifically set `datafusion.execution.parquet.schema_force_view_types` to `false`

3. File a ticket tracking the work needed to turn it on by default in IOx

@alamb -- Agreed that we should file this as a tech debt ticket in iox, to get us onto view types. However, I'm not sure whether our "tech debt state" should be the revert upstream commit -- or figuring out how to turn it off via the config.

Yesterday I tried turning off via the config first, but was still getting CI errors. Here is an example WIP showing that we need more than just setting the config. (Or maybe I'm missing another point of where the config should be set? 🤔 )

Specifically, the CI errors were the bits mentioned in the PR description. It was still attempting to use utf8view types and failing the iox schema check. I could modify the check, but then it failed on the various iox-type to arrow-type mappings. Etc. It became clear that this was a larger thing, to be done in it's own (atomically revertable) pr IMO.

@wiedld
Copy link
Collaborator Author

wiedld commented Dec 13, 2024

Specifically, the CI errors were the bits mentioned in the PR description. It was still attempting to use utf8view types and failing the iox schema check. I could modify the check, but then it failed on the various iox-type to arrow-type mappings. Etc. It became clear that this was a larger thing, to be done in it's own (atomically revertable) pr IMO.

I just realized that this^^ observation may mean that we have an upstream DF bug. Since when turning off view types, we were still getting view types. 🤦🏼‍♀️ (Or I'm missing a config setting somewhere in iox.)

@wiedld
Copy link
Collaborator Author

wiedld commented Dec 13, 2024

And another thing that bugs me in upstream. There are a whole bunch of duplicate places where the ParquetTableOptions are provided in the session. Which is why I'm not sure if I'm (a) still missing a place to set the config (even after changing all of these places), or (b) there is a bug in upstream where DF is still internally trying to use view types when we turned it off. 🤷🏼‍♀️

@alamb
Copy link
Collaborator

alamb commented Dec 14, 2024

And another thing that bugs me in upstream. There are a whole bunch of duplicate places where the ParquetTableOptions are provided in the session.

What I normally do in this case is go start making small PRs upstream to make the situation better

Which is why I'm not sure if I'm (a) still missing a place to set the config (even after changing all of these places), or (b) there is a bug in upstream where DF is still internally trying to use view types when we turned it off. 🤷🏼‍♀️

I likewise don't know -- can you give me a hint about how I can help you debug this (aka how do I reproduce the issue?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants