Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Automatic import from CLDR #1107

Open
wants to merge 6 commits into
base: master
Choose a base branch
from
Open

Automatic import from CLDR #1107

wants to merge 6 commits into from

Conversation

c960657
Copy link
Contributor

@c960657 c960657 commented Oct 27, 2023

I suggest we fetch translations from CLDR when possible. This reduces the maintenance burden and ensures a general high translation quality, also for locales with few or no active contributors.

This PR adds an automatic importer based on ruby-cldr. The PR implements the lowest-hanging fruits. It may be possible to fetch even more data from CLDR, but that requires some further fiddling e.g. with date format patterns.

Obviously, this will introduce some changes to existing translations. These changes mainly fall into these categories:

  1. There may be several valid translations for a given string. Even though everybody has their own personal preference, but the translations used in CLDR are usually a pretty good choice.
  2. The strings in CLDR are more likely to use periods for abbreviated weekday/month names. A matter of taste, but in most cases this is probably linguistically more correct.
  3. CLDR uses no-breaking space in formatting patterns to a larger extent than rails-i18n. This is probably fine.
  4. The native en locale in Rails defines some strings in title case, e.g. “Million” and “Bytes” etc., even though these are often used in a sentence, i.e. not as standalone labels. I would personally add these strings in their natural case, but for now I have converted them using upcase_first to be consistent with Rails. However, this changes the locales which have chosen to deviate from the upper-casing in en. In either case, the convention should be consistent across all locales.

I hope these changes are not too controversial. Otherwise, let's discuss how we can adjust the import logic.

How to

To build all files, run these commands:

bin/thor cldr:download
bin/thor cldr:build
bin/thor cldr:dump | bin/i18n-tasks data-merge
bin/thor locales:normalize_all

If you only want to update certain strings, use tree-filter:

bin/thor cldr:dump | bin/i18n-tasks tree-filter -p 'number.human.storage_units.*' | bin/i18n-tasks data-merge

i18n-tasks doesn't seem to handle the locales stored in a subdirectory well. It creates duplicates of the files in rails/locale/iso-639-2 in rails/locale. Delete these manually.

@pama
Copy link
Collaborator

pama commented Oct 28, 2023

@c960657 I appreciate your effort here, but I'm not inclined to accept your PR.

It will change many i18n files and may break the overall stability.

I'm not confident that the changes introduced are accurate. Based on the pt.yml modifications, I can tell you that in Portugal, we don't use the full stop in the month and day abbreviations. The proposed changes for units are incorrect, as you can see in this commit: 0b8193b

Also, after a quick once-over, I see spaces after commas being removed and other situations that raise my concerns.

@c960657
Copy link
Contributor Author

c960657 commented Oct 31, 2023

You are right; these changes cannot be committed in bulk. However, I still think the overall idea has some merit.

The tools may be a useful for translators to point out typos and suggest alternatives. Any changes can be submitted as separate PRs.

Below are some cherry-picked examples and which look like errors in a quick glance (needs further investigation). This is the kind of issues that this tool will help translators find.

diff --git a/rails/locale/af.yml b/rails/locale/af.yml
index 6529b9a..e91f3e9 100644
--- a/rails/locale/af.yml
+++ b/rails/locale/af.yml
@@ -48,7 +48,7 @@ af:
     - Februarie
     - Maart
     - April
-    - Mai
+    - Mei
     - Junie
     - Julie
     - Augustus
@@ -100,7 +100,7 @@ af:
         one: "%{count} jaar"
         other: "%{count} jare"
     prompts:
-      second: Sekondes
+      second: Sekonde
       minute: Minuut
       hour: Uur
       day: Dag
diff --git a/rails/locale/el.yml b/rails/locale/el.yml
index 448000f..4de13c6 100644
--- a/rails/locale/el.yml
+++ b/rails/locale/el.yml
@@ -22,7 +22,7 @@ el:
     - Φεβ
     - Μαρ
     - Απρ
-    - Μαϊ
+    - Μαΐ
     - Ιουν
     - Ιουλ
     - Αυγ
diff --git a/rails/locale/fr.yml b/rails/locale/fr.yml
index fbdb8c5..6efc02a 100644
--- a/rails/locale/fr.yml
+++ b/rails/locale/fr.yml
@@ -177,11 +177,19 @@ fr:
       decimal_units:
         format: "%n %u"
         units:
-          billion: milliard
-          million: million
+          billion:
+            one: milliard
+            other: milliards
+          million:
+            one: million
+            other: millions
           quadrillion: million de milliards
-          thousand: millier
-          trillion: billion
+          thousand:
+            one: millier
+            other: mille
+          trillion:
+            one: billion
+            other: billions
           unit: ''
       format:
         delimiter: ''
diff --git a/rails/locale/it.yml b/rails/locale/it.yml
index c0066a7..e0a0584 100644
--- a/rails/locale/it.yml
+++ b/rails/locale/it.yml
@@ -100,7 +100,7 @@ it:
         one: "%{count} anno"
         other: "%{count} anni"
     prompts:
-      second: Secondi
+      second: Secondo
       minute: Minuto
       hour: Ora
       day: Giorno

The tool would also be useful for adding initial translations of new strings, assuming that a translation imported directly from CLDR is better than no translation at all.

I'm not confident that the changes introduced are accurate. Based on the pt.yml modifications, I can tell you that in Portugal, we don't use the full stop in the month and day abbreviations.

I don't know anything about Portuguese, but I can see that Google Sheets as well as the native Calendar app on iPhone uses period in abbreviations in Portuguese, so I would assume that variant is not completely wrong?

But in any case, some of these strings, date formats in particular, exist in several valid, widespread variants, and for stability reaons I agree we should not change them unnecessarily.

The proposed changes for units are incorrect, as you can see in this commit: 0b8193b

This problem occurs, because Portugal Portuguese has the language code pt_PT in CLDR and pt in rails-i18n. CLDR has the proper translations for pt_PT (see https://github.com/unicode-org/cldr/blob/d17bf3c/common/main/pt_PT.xml#L5412-L5437). We could make a mapping table for the few locales whose code differ between the two projects.

@pama
Copy link
Collaborator

pama commented Nov 16, 2023

Your code's merit isn't in question. This is more about aligning with the project's philosophy and objectives.

We usually accept contributions as they are, unless there are concerns from someone (including maintainers). This can lead to further discussions or seek input from native speakers.

This PR suggests that CLDR translations are preferable to our existing ones and proposes a bulk update, which is why I won't merge it. You're welcome to submit individual PRs for each translation change. In cases where I'm unsure, I'll ask for feedback from native speakers.

However, if you're considering introducing a new tool that compares with the CLDR project's suggestions, I wouldn't oppose its integration. This tool would ideally offer options to:

  • Compare all our translations with those from the CLDR project and show the differences like git diff.
  • Enable users to select specific language translations in the terminal for comparison with their CLDR counterparts.
  • Merge translations by adding missing ones and prioritizing those from the CLDR, and then incorporate these changes into the Rails project, copying the updated file to config/locales (without changing the translations offered by rails-i18n.

It would be a nice tool to have.

About the full stop, in Portuguese, abbreviating words usually involves adding a full stop. However, there are at least one exception when the abbreviation is the last word in a sentence. Unfortunately, Rails i18n doesn't support these kinds of exceptions. IMHO, it's better to omit the full stop in the Portuguese translations and let users manage these specific cases themselves.

Translations are not easy, and we can't always approach them strictly following language rules or international standards. This is especially true in a project like ours that is used in unknown contexts, and we want to keep it flexible.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants