Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test: show a mismatch for initcap between Spark and DataFusion #1051

Closed
wants to merge 1 commit into from

Conversation

Blizzara
Copy link
Contributor

@Blizzara Blizzara commented Nov 4, 2024

This PR shows the problem by adding two cases to the test, one using a dash and one using non-ascii letters (from Finnish).

== Results ==
!== Correct Answer - 7 ==       == Spark Answer - 7 ==
 struct<initcap(name):string>   struct<initcap(name):string>
 [James Smith]                  [James Smith]
 [James Smith]                  [James Smith]
![James Ähtäri]                 [James äHtäRi]
 [Michael Rose]                 [Michael Rose]
 [Rames Rose]                   [Rames Rose]
![Robert Rose-smith]            [Robert Rose-Smith]
 [Robert Williams]              [Robert Williams]

Closes #1052

```
== Results ==
!== Correct Answer - 7 ==       == Spark Answer - 7 ==
 struct<initcap(name):string>   struct<initcap(name):string>
 [James Smith]                  [James Smith]
 [James Smith]                  [James Smith]
![James Ähtäri]                 [James äHtäRi]
 [Michael Rose]                 [Michael Rose]
 [Rames Rose]                   [Rames Rose]
![Robert Rose-smith]            [Robert Rose-Smith]
 [Robert Williams]              [Robert Williams]
 ```
@kazuyukitanimura
Copy link
Contributor

#1052 should be already fixed with the DF44 release. Would you like to rebase and re-trigger this test? @Blizzara

@kazuyukitanimura kazuyukitanimura changed the title show a mismatch for initcap between Spark and DataFusion test: show a mismatch for initcap between Spark and DataFusion Jan 9, 2025
@andygrove
Copy link
Member

andygrove commented Jan 10, 2025

I tested this PR locally and the test fails. We will likely need to implement a custom version that matches Spark's logic.

Spark                           Comet
![Robert Rose-smith]            [Robert Rose-Smith]

@kazuyukitanimura
Copy link
Contributor

This issue has been outstanding for a while, we should perhaps disable initcap for now

@parthchandra
Copy link
Contributor

Depending on which 'style' guide you follow initcap behavior could or could not capitalize the second part of a hyphenated word (or name).
I doubt if initcap can even be properly defined for English, let alone all other languages.
I feel it would be absolutely fine to either keep the current behavior (and modify the test) or fall back to Spark, whichever the committers prefer. Custom implementation of initcap would be an impossible task.

@kazuyukitanimura
Copy link
Contributor

FYI @Blizzara I included this test in #1276

@Blizzara
Copy link
Contributor Author

Thanks @kazuyukitanimura !

FWIW, for our internal stuff, I did make a modified copy of DFs initcap, basically just replacing prev_is_alphanumeric = c.is_alphanumeric() with previous_character_space = c == ' ';. I'm sure there are edge cases where it fails to match spark, but for us it felt good enough.

@Blizzara Blizzara closed this Jan 14, 2025
@kazuyukitanimura
Copy link
Contributor

@Blizzara please feel free to reopen once the proper fix is ready

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Initcap behaves differently in Spark and in DataFusion (also Comet)
4 participants