Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Optionally support titlecase for capitalize #14144

Open
revans2 opened this issue Sep 20, 2023 · 7 comments
Open

[FEA] Optionally support titlecase for capitalize #14144

revans2 opened this issue Sep 20, 2023 · 7 comments
Assignees
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@revans2
Copy link
Contributor

revans2 commented Sep 20, 2023

Is your feature request related to a problem? Please describe.
Spark has a method calling initcap. We implemented this using strings::capitalize, but recently ran into some problems because the first letter it uses is not an uppercase letter, it is a title case letter.

https://unicode.org/faq/casemap_charprop.html#4

Most of the time they are the same, but there are a few cases where they are not and ß is one of them. I would love an option for capitalize that uses title case instead of upper case. Or if we could get a separate initcap function that uses title case would also be great.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify labels Sep 20, 2023
@revans2 revans2 added the Spark Functionality that helps Spark RAPIDS label Sep 20, 2023
@davidwendt
Copy link
Contributor

For reference here is a list of the suspect characters: https://www.compart.com/en/unicode/category/Lt
Seems the cudf::strings::capitalize() should handle these by default (no special option) if possible.

@davidwendt davidwendt self-assigned this Sep 20, 2023
@revans2
Copy link
Contributor Author

revans2 commented Sep 21, 2023

Great to hear that CUDF will do it by default. I ma a little concerned because ß is the one that bit us in our testing, but it does not show up in https://www.compart.com/en/unicode/category/Lt

@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment strings strings issues (C++ and Python) libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Sep 27, 2023
@GregoryKimball GregoryKimball moved this to To be revisited in libcudf Nov 27, 2023
@davidwendt
Copy link
Contributor

davidwendt commented Jan 8, 2024

So the ß character looks to be a separate special case.
The upper-case of ß is actually SS (two capital S's) which the code already supports:

>>> import cudf
>>> s = 'ßeta'
>>> s.upper()
'SSETA'
>>> gs = cudf.Series([s])
>>> gs.str.upper()
0    SSETA

But it looks like when capitalizing ß the second S is not upper-cased in Python:

>>> s.capitalize()
'Sseta'
>>> gs.str.capitalize()
0    SSeta

I've not been able to find documentation on this behavior so I would be curious to know what is expected by Spark when capitalizing ß
I did a quick test with the capitalize() function from org.apache.commons.lang3.StringUtils and got a different result as well. Also, the upperCase() and String.toUpperCase() functions both return SSETA.

@revans2
Copy link
Contributor Author

revans2 commented Jan 8, 2024

val df = Seq("ßeta", "Sseta").toDF
df.selectExpr("value", "upper(value)", "lower(value)", "initcap(value)", "lower(upper(value))").show()
+-----+------------+------------+--------------+-------------------+
|value|upper(value)|lower(value)|initcap(value)|lower(upper(value))|
+-----+------------+------------+--------------+-------------------+
| ßeta|       SSETA|        ßeta|          ßeta|              sseta|
|Sseta|       SSETA|       sseta|         Sseta|              sseta|
+-----+------------+------------+--------------+-------------------+

I hope that this helps. Strings in Spark are kind of special as they wrote their own UTF8String implementation
upper is UTF8String.toUpperCase,
lower is UTF8String.toLowerCase, and
initcap is UTF8String.toLowerCase.toTitleCase.

@davidwendt
Copy link
Contributor

The initcap() appears to match results I see with org.apache.commons.lang3.StringUtils.capitalize() both of which just pass through the ß character unchanged.

I found a few more characters that are not part of the titlecase Unicode definition and behave like ß:

ß   (223) -> SS (83,83)     : Ss (83,115)
և  (1415) -> ԵՒ (1333,1362) : Եւ (1333,1410)
ff (64256) -> FF (70,70)     : Ff (70,102)
fi (64257) -> FI (70,73)     : Fi (70,105)
fl (64258) -> FL (70,76)     : Fl (70,108)
ffi (64259) -> FFI (70,70,73) : Ffi (70,102,105)
ffl (64260) -> FFL (70,70,76) : Ffl (70,102,108)
ſt (64261) -> ST (83,84)     : St (83,116)
st (64262) -> ST (83,84)     : St (83,116)
ﬓ (64275) -> ՄՆ (1348,1350) : Մն (1348,1398)
ﬔ (64276) -> ՄԵ (1348,1333) : Մե (1348,1381)
ﬕ (64277) -> ՄԻ (1348,1339) : Մի (1348,1387)
ﬖ (64278) -> ՎՆ (1358,1350) : Վն (1358,1398)
ﬗ (64279) -> ՄԽ (1348,1341) : Մխ (1348,1389)

The Python (and Pandas) output for capitalize() (which also matchestitle()) is included above after the :. Generally, in the multi-character output for upper() the characters after the first character are lower-cased for capitalize() (and title()).

But all of these pass through unchanged with org.apache.commons.lang3.StringUtils.capitalize() so I suspect the same pass through result from initcap() for these as well.

Regardless, the libcudf result matches neither and so the inclination is to fix it to match the Python/Pandas result.
I was also able to verify that C++ Boost Locale library supports these characters and match the Python results as well.
The boost::locale class is implemented using the ICU library which provides a rich set of globalization functions for software applications.

@revans2
Copy link
Contributor Author

revans2 commented Jan 22, 2024

Sorry I have not been following this as closely as I should.

@davidwendt so the proposal is to make the CUDF code match python/pandas, but not Spark?

@sameerz if that is true then we will need to write a custom kernel for initcap for Spark.

@revans2
Copy link
Contributor Author

revans2 commented Jan 22, 2024

Just FYI: From a Spark perspective I found 265 characters that produce different values between the CPU implementation and the GPU one. Their code points are.

(223, 304, 329, 452, 454, 455, 457, 458, 460, 496, 497, 499, 604, 609, 618, 620, 642, 647, 669, 670, 912, 944, 1011, 1012, 1321, 1323, 1325, 1327, 1415, 4304, 4305, 4306, 4307, 4308, 4309, 4310, 4311, 4312, 4313, 4314, 4315, 4316, 4317, 4318, 4319, 4320, 4321, 4322, 4323, 4324, 4325, 4326, 4327, 4328, 4329, 4330, 4331, 4332, 4333, 4334, 4335, 4336, 4337, 4338, 4339, 4340, 4341, 4342, 4343, 4344, 4345, 4346, 4349, 4350, 4351, 5112, 5113, 5114, 5115, 5116, 5117, 7296, 7297, 7298, 7299, 7300, 7301, 7302, 7303, 7304, 7566, 7830, 7831, 7832, 7833, 7834, 7838, 8016, 8018, 8020, 8022, 8064, 8065, 8066, 8067, 8068, 8069, 8070, 8071, 8080, 8081, 8082, 8083, 8084, 8085, 8086, 8087, 8096, 8097, 8098, 8099, 8100, 8101, 8102, 8103, 8114, 8115, 8116, 8118, 8119, 8130, 8131, 8132, 8134, 8135, 8146, 8147, 8150, 8151, 8162, 8163, 8164, 8166, 8167, 8178, 8179, 8180, 8182, 8183, 8486, 8490, 8491, 42649, 42651, 42900, 42903, 42905, 42907, 42909, 42911, 42933, 42935, 42937, 42939, 42941, 42943, 42947, 43859, 43888, 43889, 43890, 43891, 43892, 43893, 43894, 43895, 43896, 43897, 43898, 43899, 43900, 43901, 43902, 43903, 43904, 43905, 43906, 43907, 43908, 43909, 43910, 43911, 43912, 43913, 43914, 43915, 43916, 43917, 43918, 43919, 43920, 43921, 43922, 43923, 43924, 43925, 43926, 43927, 43928, 43929, 43930, 43931, 43932, 43933, 43934, 43935, 43936, 43937, 43938, 43939, 43940, 43941, 43942, 43943, 43944, 43945, 43946, 43947, 43948, 43949, 43950, 43951, 43952, 43953, 43954, 43955, 43956, 43957, 43958, 43959, 43960, 43961, 43962, 43963, 43964, 43965, 43966, 43967, 64256, 64257, 64258, 64259, 64260, 64261, 64262, 64265, 64266, 64267, 64268, 64269, 64275, 64276, 64277, 64278, 64279)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
Status: To be revisited
Development

No branches or pull requests

3 participants