Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add utf8.lower & utf8.upper #2164

Open
wants to merge 2 commits into
base: master
Choose a base branch
from
Open

Conversation

Be1zebub
Copy link
Contributor

@Be1zebub Be1zebub commented Nov 28, 2024

Implementation based on mapping.
This is compromise - performance and simplicity, in exchange for 48kb mem.
If this is unacceptable, i suggest doing the implementation using native functions - but this is beyond the scope of lua repo.

@CornerPin
Copy link
Contributor

CornerPin commented Nov 28, 2024

Unicode case mappings are not 1:1, one example is the Greek capital sigma Σ which has two lowercase forms σ and ς, the one that's chosen depends on if it's at the end of a word or not (5.18.2 Complications for Case Mapping).

When building uc_lc, conflicting characters will be overwritten in an undefined order. Here's a list of such characters that are in lc_uc:

Φ (U+03A6) -> φ (U+03C6) | ϕ (U+03D5)
Ρ (U+03A1) -> ρ (U+03C1) | ϱ (U+03F1)
I (U+0049) -> i (U+0069) | ı (U+0131)
S (U+0053) -> ſ (U+017F) | s (U+0073)
NJ (U+01CA) -> nj (U+01CC) | Nj (U+01CB)
Ε (U+0395) -> ϵ (U+03F5) | ε (U+03B5)
DZ (U+01F1) -> dz (U+01F3) | Dz (U+01F2)
Ṡ (U+1E60) -> ṡ (U+1E61) | ẛ (U+1E9B)
Β (U+0392) -> β (U+03B2) | ϐ (U+03D0)
Μ (U+039C) -> µ (U+00B5) | μ (U+03BC)
Κ (U+039A) -> ϰ (U+03F0) | κ (U+03BA)
Θ (U+0398) -> θ (U+03B8) | ϑ (U+03D1)
Σ (U+03A3) -> ς (U+03C2) | σ (U+03C3)
Ι (U+0399) -> ͅ (U+0345) | ι (U+1FBE) | ι (U+03B9)
LJ (U+01C7) -> Lj (U+01C8) | lj (U+01C9)
DŽ (U+01C4) -> dž (U+01C6) | Dž (U+01C5)
Π (U+03A0) -> π (U+03C0) | ϖ (U+03D6)

The Unicode Standard provides files such as UnicodeData.txt and SpecialCasing.txt that are relevant to these two functions (4.2 Case). The former defines simple 1:1 context-independent case mappings while the latter defines special mappings that require additional context.

Perhaps a better way to handle this would be something like

  • utf8.lower(str, locale=nil)
  • utf8.upper(str, locale=nil)

where locale=nil will only perform simple mappings? This is how ICU does it.

@Be1zebub
Copy link
Contributor Author

Be1zebub commented Dec 13, 2024

@CornerPin
I dont think it plays a big role.
Just looked at the godot code and tested few other projects, it seems no one cares about special casing.
It also goes beyond a simple and effective implementation, better to make os bindings in this case.

ps: check php, they tried to implement this - but their code is incredibly slow and this feature doesnt even work - i think this is a good proof why we shouldnt implement this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants