Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-11888 Update French speakers #3985

Conversation

conradarcturus
Copy link
Contributor

@conradarcturus conradarcturus commented Aug 27, 2024

This change makes multiple updates to the French-speaking population listed in CLDR. Gratefully, much of this data was compiled in a comprehensive meta-study by the International Francophone Organization (OIF), with many collaborators such as the University of Laval. Some estimates come from other sources during the process of vetting the OIF estimates.

Jira Tickets

There are 4 CLDR tickets about French population estimates fixed by this change

  • Syria CLDR-11885 -- French is no longer and official language of Syria. Unfortunately I could not find a valid population estimate so I left the number as-is
  • Haiti CLDR-11886
  • DRCongo CLDR-11887
  • Djibouti CLDR-11888 -- Note that after this change it will recognize that French IS the biggest langauge of Djibouti, not Afar. It's surprising by back up by use data.

Sources

I couldn't accept all data points from that study (as Mark Davis recommended, I sought corroborating sources) -- leaving out Burundi, Cameroon, Mauritius, Germany, Portugal, Belgium, and Andorra because of large, uncertain discrepancies.

Furthermore, I used the Canadian census directly for Canada, and the Swiss census website (updating all of them because it wasn't that much and the estimates we very different).

I didn't cite every primary source since sometimes the OIF added up numbers from multiple sources and I didn't have time to thoroughly open every cited census. The Eurostat website contains many surveys that were compiled by the French organization -- the website for one of them is here .

Effects

  • Likely subtags (based on the relative population)
    • French is now the largest language in 2 countries
      • Democratic Republic of Congo (replacing Swahili)
      • Djibouti (replacing Afar)
    • French is now no longer the largest language in 3 countries
      • Central African Republic (replaced by Sango)
      • Senegal (replaced by Wolof)
      • Chad (replaced by Arabic)
  • French is no longer considered official for Syria

Many countries lost a lot of French users -- but since the prior figure did not have citations and other sources tended to agree, I made the change. Mali, Madagascar, Hungary, Niger, Comoros, Chad, and Central African Republic.

Some countries gained a lot: D.R. Congo, Djibouti, Haiti, Lebanon, Switzerland, Morocco, Walls & Futuna, French Polynesia, Rwanda.

For countries that had large changes I double-checked with other sources that it made sense, especially since it will change likely subtags. There could be literacy gaps favoring French -- if you want me to press on that data I can investigate.

Steps

  • This PR completes 4 tickets.
  • mvn package -DskipTests=true
  • java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData
  • java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags
  • java -jar tools/cldr-code/target/cldr-code.jar GenerateTestData
  • mvn package

Thanks for reviewing this large change -- it was fun to read so many sources, read in French and German, and learn more about countries around world.

CLDR-11888

ALLOW_MANY_COMMITS=true

@conradarcturus conradarcturus force-pushed the CLDR-11888-Add-L2-French-Speakers-for-Djibouti branch from 838e118 to d6f6b7f Compare August 27, 2024 18:40
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • common/supplemental/likelySubtags.xml is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My assumption is that the likelySubtags changes are caused by running the tool. Is that correct?

@@ -566,7 +566,7 @@ not be patched by hand, as any changes made in that fashion may be lost.
<likelySubtag from="pko" to="pko_Latn_KE"/> <!--Pökoot‧?‧? ➡ Pökoot‧Latin‧Kenya-->
<likelySubtag from="pl" to="pl_Latn_PL"/> <!--Polish‧?‧? ➡ Polish‧Latin‧Poland-->
<likelySubtag from="pms" to="pms_Latn_IT"/> <!--Piedmontese‧?‧? ➡ Piedmontese‧Latin‧Italy-->
<likelySubtag from="pnt" to="pnt_Grek_GR"/> <!--Pontic‧?‧? ➡ Pontic‧Greek‧Greece-->
<likelySubtag from="pnt" to="pnt_Cyrl_GR"/> <!--Pontic‧?‧? ➡ Pontic‧Cyrillic‧Greece-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems bizarre, based on the Wikipedia article: I'd expect Grek.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is caused by gitter in ConvertLanguageData. pnt has the scripts Grek Cyrl Latn right now without language population estimates -- it gets re-sorted into alphabetical Cyrl Grek Latn when running ConvertLanguageData and GenerateLikelySubtags picks up the new order.

I can look into getting language population data to stabilize this and avoid these swaps.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I grabbed Pontic Greek data per country. It's roughly Grek (200k in Greece), Cyrl (50k in Russia and Ukraine), Latn (5k in Turkey).

However, I guess these lists are not intentionally ordered. It's not correct to assume the first script here listed is the most common one.

@@ -1979,7 +1979,7 @@ XXX Code for transations where no currency is involved
<language type="mzn" scripts="Arab"/>
<language type="mzn" territories="IR" alt="secondary"/>
<language type="na" scripts="Latn" territories="NR"/>
<language type="nan" scripts="Hans"/>
<language type="nan" scripts="Hans" territories="TW"/>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems just slightly odd that these use Simplified in TW.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is odd -- its automatically picked up from ConvertLanguageData though, hmm...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, its coming from hak being listed as official for Taiwan but it not specifying the script and somehow its resolved as Hans.

@@ -2735,9 +2735,9 @@ XXX Code for transations where no currency is involved
</territory>
<territory type="CH" gdp="733800000000" literacyPercent="99" population="8860570"> <!--Switzerland-->
<languagePopulation type="de" populationPercent="73" officialStatus="official"/> <!--German-->
<languagePopulation type="fr" populationPercent="67" officialStatus="official" references="R1335"/> <!--French-->
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very high. While living in Switzerland (Zürich), my experience was that just small percentage of German Swiss had sufficient French to be comfortable using it in UIs. Many had some years in school, but that didn't translate into adult competency. When German Swiss had conversations with French Swiss, it was far more likely to be in English. Now, it is likely a higher percentage in the cantons nearest (or split across) Romandy, I just don't see the figures adding to 67%.

Not at all hard evidence, but an AI responded

"The information regarding the percentage of the Swiss population who speak fluent French or can get by with the language comes from a combination of sources, including:

Swiss Federal Statistical Office:

This official government agency collects and publishes data on various aspects of Swiss society, including language use. Their surveys and censuses provide reliable information on the language proficiency of the population.  

Academic Research and Publications: Studies conducted by linguists and sociologists specializing in Switzerland often include data on language distribution and proficiency. These studies provide a more nuanced understanding of language use in the country.

Reputable News Outlets and Online Resources: Articles and reports from trusted news sources and websites that focus on Switzerland often cite statistics on language demographics. These sources can help corroborate information from official and academic sources.

While these sources may not always perfectly align in their exact figures, they generally converge on the estimate that approximately 23% of the Swiss population speaks fluent French and roughly 28% can get by with the language, including those who are fluent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The French source I used expected 67%, but it was counting L2 speakers, not strictly filtered by literacy. FWIW most of my Swiss friends (from either side of the border) can speak French but I wouldn't consider them a representative sample. I can look for other sources for Switzerland though since they will be very thorough.

@conradarcturus conradarcturus marked this pull request as draft August 28, 2024 16:18
@conradarcturus conradarcturus changed the base branch from main to ddl/v47 August 28, 2024 16:18
@conradarcturus
Copy link
Contributor Author

Changing the merge target to v47 and also will investigate making the document more stable -- I'll pursue these population count changes later.

@conradarcturus
Copy link
Contributor Author

My assumption is that the likelySubtags changes are caused by running the tool. Is that correct?

Yea, from running java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData

@srl295
Copy link
Member

srl295 commented Aug 29, 2024

Changing the merge target to v47 and also will investigate making the document more stable -- I'll pursue these population count changes later.

you need to also rebase it on v47.

@srl295
Copy link
Member

srl295 commented Aug 29, 2024

I've updated the v47 branch.

@macchiati
Copy link
Member

macchiati commented Aug 29, 2024 via email

@srl295
Copy link
Member

srl295 commented Sep 9, 2024

still needs rebase to reflect the branch change.

@srl295 srl295 deleted the branch unicode-org:main October 25, 2024 16:35
@srl295 srl295 closed this Oct 25, 2024
@srl295 srl295 reopened this Oct 25, 2024
@srl295 srl295 changed the base branch from _ddl/v47 to main October 25, 2024 17:17
@srl295 srl295 force-pushed the CLDR-11888-Add-L2-French-Speakers-for-Djibouti branch from 6549966 to 8b37042 Compare October 25, 2024 17:31
@jira-pull-request-webhook
Copy link

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

@conradarcturus conradarcturus force-pushed the CLDR-11888-Add-L2-French-Speakers-for-Djibouti branch from 8b37042 to 7c74957 Compare October 29, 2024 17:02
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • common/supplemental/likelySubtags.xml is different
  • common/supplemental/supplementalData.xml is different
  • common/testData/localeIdentifiers/likelySubtags.txt is now changed in the branch
  • common/testData/localeIdentifiers/localeDisplayName.txt is now changed in the branch
  • tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Swiss numbers very suspect, which makes me doubt the other numbers.

		<languagePopulation type="fr" populationPercent="67" officialStatus="official" references="R1030"/>	<!--French-->

I think this is coming from https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf page 32

It is a bit confusing, having Suisse romande as 81%, but that is 81% of Swiss Romandy, which is ~20-24% of the overall population, or about 19% of the country.

Then it lists Suisse separately as 67%. (Does that mean the whole country? The remainder?) But either way that is far too high.

Take the Swiss government figures:

https://www.bfs.admin.ch/bfs/de/home/statistiken/bevoelkerung/sprachen-religionen/sprachen.html under Regelmässig verwendete Sprachen
("regularly used languages"), under Details, lists French as ~38.8%.

[Aside: And under "Learned Languages, 15.0% for French. And for native French speakers, 22.80%. So we if we count native + "learned" we'd get:

22.8% native
77.2% non native, and of that, 15% learned French; multiplying, we get 12%
total: 34.38%

So even the higher of these two figures is 38.8%, not 67%.

That is much more like what I experienced in Switzerland; many people might have had French in school but are not comfortable using it.

FYI, comparable figures for English are: 45% regularly used.

So I'm inclined to take the figures from https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf with a big grain of salt, and only use them if we have corroborating evidence.

@conradarcturus
Copy link
Contributor Author

Thanks @macchiati for checking the data -- yea it definitely looks suspect. I was just re-basing this diff since it was on a pretty stale branch. I'll need to interrogate the data better. That Swiss government source looks great.

https://unicode-org.atlassian.net/browse/CLDR-11888 was created to update the French speakers for Djibouti but while I was researching that I found many other Francophone countries that significantly underestimated French populations. Most of those gaps probably come from the number being L1 users but the point of this file is L1+L2 users -- basically how many people in each country could use an interface in this language.

See the original data in:
https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf

mvn package -DskipTests=true
java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData
java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags

CLDR-11888 Update French speakers

https://unicode-org.atlassian.net/browse/CLDR-11888 was created to update the French speakers for Djibouti but while I was researching that I found many other Francophone countries that significantly underestimated French populations. Most of those gaps probably come from the number being L1 users but the point of this file is L1+L2 users -- basically how many people in each country could use an interface in this language.

See the original data in:
https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf

mvn package -DskipTests=true
java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData
java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags

CLDR-11888 Redo automated scripts after merge conflicts
This change makes multiple updates to the French-speaking population listed in CLDR. Gratefully, much of this data was compiled in a comprehensive meta-study by the International Francophone Organization (OIF), with many collaborators such as the University of Laval. Some estimates come from other sources during the process of vetting the OIF estimates.

There are 4 CLDR tickets avoud French population estimates fixed by this change
*  Syria [CLDR-11885](https://unicode-org.atlassian.net/browse/CLDR-11885) -- French is no longer and official language of Syria. Unfortunately I could not find a valid population estimate so I left the number as-is
*  Haiti [CLDR-11886](https://unicode-org.atlassian.net/browse/CLDR-11886)
*  DRCongo [CLDR-11887](https://unicode-org.atlassian.net/browse/CLDR-11887)
*  Djibouti [CLDR-11888](https://unicode-org.atlassian.net/browse/CLDR-11888) -- Note that after this change it will recognize that French IS the biggest langauge of Djibouti, not Afar. It's surprising by back up by use data.

## Sources

I couldn't accept all data points from that study (as Mark Davis recommended, I sought corroborating sources) -- leaving out Burundi, Cameroon, Mauritius, Germany, Portugal, Belgium, and Andorra because of large, uncertain discrepancies.

Furthermore, I used the Canadian census directly for Canada, and the Swiss census website (updating all of them because it wasn't that much and the estimates we very different).

I didn't cite every primary source since sometimes the OIF added up numbers from multiple sources and I didn't have time to thoroughly open every cited census. The Eurostat website contains many surveys that were compiled by the French organization -- the website for one of them is here https://ec.europa.eu/eurostat/web/microdata/adult-education-survey .

## Effects
* Likely subtags (based on the relative population)
  * French is now the largest language in 2 countries
    * Democratic Republic of Congo (replacing Swahili)
    * Djibouti (replacing Afar)
  * French is now no longer the largest language in 3 countries
    * Central African Republic (replaced by Sango)
    * Senegal (replaced by Wolof)
    * Chad (replaced by Arabic)
* French is no longer considered official for Syria

Many countries lost a lot of French users -- but since the prior figure did not have citations and other sources tended to agree, I made the change. Mali, Madagascar, Hungary, Niger, Comoros, Chad, and Central African Republic.

Some countries gained a lot: D.R. Congo, Djibouti, Haiti, Lebanon, Switzerland, Morocco, Walls & Futuna, French Polynesia, Rwanda.

For countries that had large changes I double-checked with other sources that it made sense, especially since it will change likely subtags. There could be literacy gaps favoring French -- if you want me to press on that data I can investigate.

Thanks for reviewing this large change -- it was fun to read so many sources, read in French and German, and learn more about countries aroudn world.
@conradarcturus conradarcturus marked this pull request as ready for review November 4, 2024 23:55
@conradarcturus conradarcturus force-pushed the CLDR-11888-Add-L2-French-Speakers-for-Djibouti branch from 7c74957 to 420c663 Compare November 5, 2024 00:14
@jira-pull-request-webhook
Copy link

Notice: the branch changed across the force-push!

  • common/supplemental/likelySubtags.xml is different
  • common/supplemental/supplementalData.xml is different
  • common/testData/localeIdentifiers/likelySubtags.txt is different
  • common/testData/localeIdentifiers/localeDisplayName.txt is no longer changed in the branch
  • tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Copy link
Member

@macchiati macchiati left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work!

@conradarcturus conradarcturus merged commit 78ff1ae into unicode-org:main Nov 5, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants