CLDR-11888 Update French speakers #3985

conradarcturus · 2024-08-27T17:27:31Z

This change makes multiple updates to the French-speaking population listed in CLDR. Gratefully, much of this data was compiled in a comprehensive meta-study by the International Francophone Organization (OIF), with many collaborators such as the University of Laval. Some estimates come from other sources during the process of vetting the OIF estimates.

Jira Tickets

There are 4 CLDR tickets about French population estimates fixed by this change

Syria CLDR-11885 -- French is no longer and official language of Syria. Unfortunately I could not find a valid population estimate so I left the number as-is
Haiti CLDR-11886
DRCongo CLDR-11887
Djibouti CLDR-11888 -- Note that after this change it will recognize that French IS the biggest langauge of Djibouti, not Afar. It's surprising by back up by use data.

Sources

I couldn't accept all data points from that study (as Mark Davis recommended, I sought corroborating sources) -- leaving out Burundi, Cameroon, Mauritius, Germany, Portugal, Belgium, and Andorra because of large, uncertain discrepancies.

Furthermore, I used the Canadian census directly for Canada, and the Swiss census website (updating all of them because it wasn't that much and the estimates we very different).

I didn't cite every primary source since sometimes the OIF added up numbers from multiple sources and I didn't have time to thoroughly open every cited census. The Eurostat website contains many surveys that were compiled by the French organization -- the website for one of them is here .

Effects

Likely subtags (based on the relative population)
- French is now the largest language in 2 countries
  - Democratic Republic of Congo (replacing Swahili)
  - Djibouti (replacing Afar)
- French is now no longer the largest language in 3 countries
  - Central African Republic (replaced by Sango)
  - Senegal (replaced by Wolof)
  - Chad (replaced by Arabic)
French is no longer considered official for Syria

Many countries lost a lot of French users -- but since the prior figure did not have citations and other sources tended to agree, I made the change. Mali, Madagascar, Hungary, Niger, Comoros, Chad, and Central African Republic.

Some countries gained a lot: D.R. Congo, Djibouti, Haiti, Lebanon, Switzerland, Morocco, Walls & Futuna, French Polynesia, Rwanda.

For countries that had large changes I double-checked with other sources that it made sense, especially since it will change likely subtags. There could be literacy gaps favoring French -- if you want me to press on that data I can investigate.

Steps

Thanks for reviewing this large change -- it was fun to read so many sources, read in French and German, and learn more about countries around world.

CLDR-11888

ALLOW_MANY_COMMITS=true

jira-pull-request-webhook · 2024-08-27T18:40:21Z

Notice: the branch changed across the force-push!

common/supplemental/likelySubtags.xml is now changed in the branch

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

macchiati

My assumption is that the likelySubtags changes are caused by running the tool. Is that correct?

macchiati · 2024-08-27T20:58:14Z

common/supplemental/likelySubtags.xml

@@ -566,7 +566,7 @@ not be patched by hand, as any changes made in that fashion may be lost.
 		<likelySubtag from="pko" to="pko_Latn_KE"/>		<!--Pökoot‧?‧?	➡ Pökoot‧Latin‧Kenya-->
 		<likelySubtag from="pl" to="pl_Latn_PL"/>		<!--Polish‧?‧?	➡ Polish‧Latin‧Poland-->
 		<likelySubtag from="pms" to="pms_Latn_IT"/>		<!--Piedmontese‧?‧?	➡ Piedmontese‧Latin‧Italy-->
-		<likelySubtag from="pnt" to="pnt_Grek_GR"/>		<!--Pontic‧?‧?	➡ Pontic‧Greek‧Greece-->
+		<likelySubtag from="pnt" to="pnt_Cyrl_GR"/>		<!--Pontic‧?‧?	➡ Pontic‧Cyrillic‧Greece-->


This seems bizarre, based on the Wikipedia article: I'd expect Grek.

I believe this is caused by gitter in ConvertLanguageData. pnt has the scripts Grek Cyrl Latn right now without language population estimates -- it gets re-sorted into alphabetical Cyrl Grek Latn when running ConvertLanguageData and GenerateLikelySubtags picks up the new order.

I can look into getting language population data to stabilize this and avoid these swaps.

I grabbed Pontic Greek data per country. It's roughly Grek (200k in Greece), Cyrl (50k in Russia and Ukraine), Latn (5k in Turkey).

However, I guess these lists are not intentionally ordered. It's not correct to assume the first script here listed is the most common one.

macchiati · 2024-08-27T21:02:11Z

common/supplemental/supplementalData.xml

@@ -1979,7 +1979,7 @@ XXX Code for transations where no currency is involved
 		<language type="mzn" scripts="Arab"/>
 		<language type="mzn" territories="IR" alt="secondary"/>
 		<language type="na" scripts="Latn" territories="NR"/>
-		<language type="nan" scripts="Hans"/>
+		<language type="nan" scripts="Hans" territories="TW"/>


It seems just slightly odd that these use Simplified in TW.

It is odd -- its automatically picked up from ConvertLanguageData though, hmm...

Ah, its coming from hak being listed as official for Taiwan but it not specifying the script and somehow its resolved as Hans.

macchiati · 2024-08-27T21:23:05Z

common/supplemental/supplementalData.xml

@@ -2735,9 +2735,9 @@ XXX Code for transations where no currency is involved
 		</territory>
 		<territory type="CH" gdp="733800000000" literacyPercent="99" population="8860570">	<!--Switzerland-->
 			<languagePopulation type="de" populationPercent="73" officialStatus="official"/>	<!--German-->
+			<languagePopulation type="fr" populationPercent="67" officialStatus="official" references="R1335"/>	<!--French-->


This is very high. While living in Switzerland (Zürich), my experience was that just small percentage of German Swiss had sufficient French to be comfortable using it in UIs. Many had some years in school, but that didn't translate into adult competency. When German Swiss had conversations with French Swiss, it was far more likely to be in English. Now, it is likely a higher percentage in the cantons nearest (or split across) Romandy, I just don't see the figures adding to 67%.

Not at all hard evidence, but an AI responded

"The information regarding the percentage of the Swiss population who speak fluent French or can get by with the language comes from a combination of sources, including:

Swiss Federal Statistical Office:

This official government agency collects and publishes data on various aspects of Swiss society, including language use. Their surveys and censuses provide reliable information on the language proficiency of the population.

Academic Research and Publications: Studies conducted by linguists and sociologists specializing in Switzerland often include data on language distribution and proficiency. These studies provide a more nuanced understanding of language use in the country.

Reputable News Outlets and Online Resources: Articles and reports from trusted news sources and websites that focus on Switzerland often cite statistics on language demographics. These sources can help corroborate information from official and academic sources.

While these sources may not always perfectly align in their exact figures, they generally converge on the estimate that approximately 23% of the Swiss population speaks fluent French and roughly 28% can get by with the language, including those who are fluent.

The French source I used expected 67%, but it was counting L2 speakers, not strictly filtered by literacy. FWIW most of my Swiss friends (from either side of the border) can speak French but I wouldn't consider them a representative sample. I can look for other sources for Switzerland though since they will be very thorough.

conradarcturus · 2024-08-28T16:19:20Z

Changing the merge target to v47 and also will investigate making the document more stable -- I'll pursue these population count changes later.

conradarcturus · 2024-08-28T16:34:12Z

My assumption is that the likelySubtags changes are caused by running the tool. Is that correct?

Yea, from running java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData

srl295 · 2024-08-29T16:29:09Z

Changing the merge target to v47 and also will investigate making the document more stable -- I'll pursue these population count changes later.

you need to also rebase it on v47.

srl295 · 2024-08-29T16:32:57Z

I've updated the v47 branch.

macchiati · 2024-08-29T17:26:04Z

I think https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf is overstating the case for French. I've seen the same mistakes show up for other sources over-estimating English competency (eg in Mauritius), so francophones are certainly not alone! Our aim for the language population is to provide two figures, the overall population of reasonably competent L1+L2 speakers, and the "literacy" percentage of those. The 'literacy' percentage should really reflect usage, being a proxy for something like 'weekly active readers of the language'. When a language is written in multiple scripts, then it should also reflect the script usage. For example, if the language xx could be written in both Latin and Cyrillic, we'd expect to see xx_Latn and xx_Cyrl (one of them would be just xx, if that script is the default for xx globally). The 'competency' is very roughly "if the person were literate, could they read and understand an application UI in the language, including help messages, instructions for usage, etc." Now, we rarely get a lot of detailed information about language capabilities; sometimes the figures we see include just L1, sometimes also L2, and typically doesn't say which. So we have to use a fair amount of judgement in assessing various reports. For languages that are rarely written, such as Swiss German, in the absence of good information we tend to get a literacy value of 5%.

…

On Thu, Aug 29, 2024 at 9:33 AM Steven R. Loomis ***@***.***> wrote: I've updated the v47 branch. — Reply to this email directly, view it on GitHub <#3985 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACJLEMB6P7UJJRSYDUSQCZDZT5EM7AVCNFSM6AAAAABNGTDWNCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJYGMYTANJQGA> . You are receiving this because your review was requested.Message ID: ***@***.***>

srl295 · 2024-09-09T17:24:50Z

still needs rebase to reflect the branch change.

jira-pull-request-webhook · 2024-10-25T17:31:07Z

Hooray! The files in the branch are the same across the force-push. 😃

~ Your Friendly Jira-GitHub PR Checker Bot

jira-pull-request-webhook · 2024-10-29T17:02:59Z

Notice: the branch changed across the force-push!

common/supplemental/likelySubtags.xml is different
common/supplemental/supplementalData.xml is different
common/testData/localeIdentifiers/likelySubtags.txt is now changed in the branch
common/testData/localeIdentifiers/localeDisplayName.txt is now changed in the branch
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

macchiati

The Swiss numbers very suspect, which makes me doubt the other numbers.

		<languagePopulation type="fr" populationPercent="67" officialStatus="official" references="R1030"/>	<!--French-->

I think this is coming from https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf page 32

It is a bit confusing, having Suisse romande as 81%, but that is 81% of Swiss Romandy, which is ~20-24% of the overall population, or about 19% of the country.

Then it lists Suisse separately as 67%. (Does that mean the whole country? The remainder?) But either way that is far too high.

Take the Swiss government figures:

https://www.bfs.admin.ch/bfs/de/home/statistiken/bevoelkerung/sprachen-religionen/sprachen.html under Regelmässig verwendete Sprachen
("regularly used languages"), under Details, lists French as ~38.8%.

[Aside: And under "Learned Languages, 15.0% for French. And for native French speakers, 22.80%. So we if we count native + "learned" we'd get:

22.8% native
77.2% non native, and of that, 15% learned French; multiplying, we get 12%
total: 34.38%

So even the higher of these two figures is 38.8%, not 67%.

That is much more like what I experienced in Switzerland; many people might have had French in school but are not comfortable using it.

FYI, comparable figures for English are: 45% regularly used.

So I'm inclined to take the figures from https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf with a big grain of salt, and only use them if we have corroborating evidence.

conradarcturus · 2024-10-30T14:32:38Z

Thanks @macchiati for checking the data -- yea it definitely looks suspect. I was just re-basing this diff since it was on a pretty stale branch. I'll need to interrogate the data better. That Swiss government source looks great.

https://unicode-org.atlassian.net/browse/CLDR-11888 was created to update the French speakers for Djibouti but while I was researching that I found many other Francophone countries that significantly underestimated French populations. Most of those gaps probably come from the number being L1 users but the point of this file is L1+L2 users -- basically how many people in each country could use an interface in this language. See the original data in: https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf mvn package -DskipTests=true java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags CLDR-11888 Update French speakers https://unicode-org.atlassian.net/browse/CLDR-11888 was created to update the French speakers for Djibouti but while I was researching that I found many other Francophone countries that significantly underestimated French populations. Most of those gaps probably come from the number being L1 users but the point of this file is L1+L2 users -- basically how many people in each country could use an interface in this language. See the original data in: https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf mvn package -DskipTests=true java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags CLDR-11888 Redo automated scripts after merge conflicts

This change makes multiple updates to the French-speaking population listed in CLDR. Gratefully, much of this data was compiled in a comprehensive meta-study by the International Francophone Organization (OIF), with many collaborators such as the University of Laval. Some estimates come from other sources during the process of vetting the OIF estimates. There are 4 CLDR tickets avoud French population estimates fixed by this change * Syria [CLDR-11885](https://unicode-org.atlassian.net/browse/CLDR-11885) -- French is no longer and official language of Syria. Unfortunately I could not find a valid population estimate so I left the number as-is * Haiti [CLDR-11886](https://unicode-org.atlassian.net/browse/CLDR-11886) * DRCongo [CLDR-11887](https://unicode-org.atlassian.net/browse/CLDR-11887) * Djibouti [CLDR-11888](https://unicode-org.atlassian.net/browse/CLDR-11888) -- Note that after this change it will recognize that French IS the biggest langauge of Djibouti, not Afar. It's surprising by back up by use data. ## Sources I couldn't accept all data points from that study (as Mark Davis recommended, I sought corroborating sources) -- leaving out Burundi, Cameroon, Mauritius, Germany, Portugal, Belgium, and Andorra because of large, uncertain discrepancies. Furthermore, I used the Canadian census directly for Canada, and the Swiss census website (updating all of them because it wasn't that much and the estimates we very different). I didn't cite every primary source since sometimes the OIF added up numbers from multiple sources and I didn't have time to thoroughly open every cited census. The Eurostat website contains many surveys that were compiled by the French organization -- the website for one of them is here https://ec.europa.eu/eurostat/web/microdata/adult-education-survey . ## Effects * Likely subtags (based on the relative population) * French is now the largest language in 2 countries * Democratic Republic of Congo (replacing Swahili) * Djibouti (replacing Afar) * French is now no longer the largest language in 3 countries * Central African Republic (replaced by Sango) * Senegal (replaced by Wolof) * Chad (replaced by Arabic) * French is no longer considered official for Syria Many countries lost a lot of French users -- but since the prior figure did not have citations and other sources tended to agree, I made the change. Mali, Madagascar, Hungary, Niger, Comoros, Chad, and Central African Republic. Some countries gained a lot: D.R. Congo, Djibouti, Haiti, Lebanon, Switzerland, Morocco, Walls & Futuna, French Polynesia, Rwanda. For countries that had large changes I double-checked with other sources that it made sense, especially since it will change likely subtags. There could be literacy gaps favoring French -- if you want me to press on that data I can investigate. Thanks for reviewing this large change -- it was fun to read so many sources, read in French and German, and learn more about countries aroudn world.

jira-pull-request-webhook · 2024-11-05T00:14:04Z

Notice: the branch changed across the force-push!

common/supplemental/likelySubtags.xml is different
common/supplemental/supplementalData.xml is different
common/testData/localeIdentifiers/likelySubtags.txt is different
common/testData/localeIdentifiers/localeDisplayName.txt is no longer changed in the branch
tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/country_language_population.tsv is different

View Diff Across Force-Push

~ Your Friendly Jira-GitHub PR Checker Bot

Fixed :D

macchiati

Great work!

conradarcturus requested review from srl295 and macchiati August 27, 2024 17:27

github-actions bot assigned conradarcturus Aug 27, 2024

conradarcturus force-pushed the CLDR-11888-Add-L2-French-Speakers-for-Djibouti branch from 838e118 to d6f6b7f Compare August 27, 2024 18:40

macchiati reviewed Aug 27, 2024

View reviewed changes

conradarcturus marked this pull request as draft August 28, 2024 16:18

conradarcturus changed the base branch from main to ddl/v47 August 28, 2024 16:18

conradarcturus had a problem deploying to cloudflare September 23, 2024 23:39 — with GitHub Actions Failure

conradarcturus had a problem deploying to cloudflare September 23, 2024 23:40 — with GitHub Actions Failure

srl295 deleted the branch unicode-org:main October 25, 2024 16:35

srl295 closed this Oct 25, 2024

srl295 reopened this Oct 25, 2024

srl295 changed the base branch from _ddl/v47 to main October 25, 2024 17:17

srl295 force-pushed the CLDR-11888-Add-L2-French-Speakers-for-Djibouti branch from 6549966 to 8b37042 Compare October 25, 2024 17:31

conradarcturus force-pushed the CLDR-11888-Add-L2-French-Speakers-for-Djibouti branch from 8b37042 to 7c74957 Compare October 29, 2024 17:02

macchiati previously requested changes Oct 29, 2024

View reviewed changes

conradarcturus added 2 commits November 4, 2024 15:55

conradarcturus marked this pull request as ready for review November 4, 2024 23:55

conradarcturus requested a review from macchiati November 4, 2024 23:55

CLDR-11888 Fixed some things for tests

420c663

conradarcturus force-pushed the CLDR-11888-Add-L2-French-Speakers-for-Djibouti branch from 7c74957 to 420c663 Compare November 5, 2024 00:14

macchiati approved these changes Nov 5, 2024

View reviewed changes

conradarcturus merged commit 78ff1ae into unicode-org:main Nov 5, 2024
9 checks passed

srl295 mentioned this pull request Nov 6, 2024

CLDR-11570 Update Ancient Greek writing systems #4058

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CLDR-11888 Update French speakers #3985

CLDR-11888 Update French speakers #3985

conradarcturus commented Aug 27, 2024 •

edited

Loading

jira-pull-request-webhook bot commented Aug 27, 2024

macchiati left a comment

macchiati Aug 27, 2024

conradarcturus Aug 28, 2024

conradarcturus Aug 28, 2024

macchiati Aug 27, 2024

conradarcturus Aug 28, 2024

conradarcturus Aug 28, 2024

macchiati Aug 27, 2024

conradarcturus Aug 28, 2024

conradarcturus commented Aug 28, 2024

conradarcturus commented Aug 28, 2024

srl295 commented Aug 29, 2024

srl295 commented Aug 29, 2024

macchiati commented Aug 29, 2024 via email

srl295 commented Sep 9, 2024

jira-pull-request-webhook bot commented Oct 25, 2024

jira-pull-request-webhook bot commented Oct 29, 2024

macchiati left a comment

conradarcturus commented Oct 30, 2024

jira-pull-request-webhook bot commented Nov 5, 2024

macchiati left a comment

CLDR-11888 Update French speakers #3985

CLDR-11888 Update French speakers #3985

Conversation

conradarcturus commented Aug 27, 2024 • edited Loading

Jira Tickets

Sources

Effects

Steps

jira-pull-request-webhook bot commented Aug 27, 2024

macchiati left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

conradarcturus commented Aug 28, 2024

conradarcturus commented Aug 28, 2024

srl295 commented Aug 29, 2024

srl295 commented Aug 29, 2024

macchiati commented Aug 29, 2024 via email

srl295 commented Sep 9, 2024

jira-pull-request-webhook bot commented Oct 25, 2024

jira-pull-request-webhook bot commented Oct 29, 2024

macchiati left a comment

Choose a reason for hiding this comment

conradarcturus commented Oct 30, 2024

jira-pull-request-webhook bot commented Nov 5, 2024

macchiati left a comment

Choose a reason for hiding this comment

conradarcturus commented Aug 27, 2024 •

edited

Loading