-
Notifications
You must be signed in to change notification settings - Fork 386
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CLDR-11888 Update French speakers #3985
CLDR-11888 Update French speakers #3985
Conversation
838e118
to
d6f6b7f
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My assumption is that the likelySubtags changes are caused by running the tool. Is that correct?
@@ -566,7 +566,7 @@ not be patched by hand, as any changes made in that fashion may be lost. | |||
<likelySubtag from="pko" to="pko_Latn_KE"/> <!--Pökoot‧?‧? ➡ Pökoot‧Latin‧Kenya--> | |||
<likelySubtag from="pl" to="pl_Latn_PL"/> <!--Polish‧?‧? ➡ Polish‧Latin‧Poland--> | |||
<likelySubtag from="pms" to="pms_Latn_IT"/> <!--Piedmontese‧?‧? ➡ Piedmontese‧Latin‧Italy--> | |||
<likelySubtag from="pnt" to="pnt_Grek_GR"/> <!--Pontic‧?‧? ➡ Pontic‧Greek‧Greece--> | |||
<likelySubtag from="pnt" to="pnt_Cyrl_GR"/> <!--Pontic‧?‧? ➡ Pontic‧Cyrillic‧Greece--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems bizarre, based on the Wikipedia article: I'd expect Grek.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is caused by gitter in ConvertLanguageData
. pnt
has the scripts Grek Cyrl Latn
right now without language population estimates -- it gets re-sorted into alphabetical Cyrl Grek Latn
when running ConvertLanguageData
and GenerateLikelySubtags
picks up the new order.
I can look into getting language population data to stabilize this and avoid these swaps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I grabbed Pontic Greek data per country. It's roughly Grek (200k in Greece), Cyrl (50k in Russia and Ukraine), Latn (5k in Turkey).
However, I guess these lists are not intentionally ordered. It's not correct to assume the first script here listed is the most common one.
@@ -1979,7 +1979,7 @@ XXX Code for transations where no currency is involved | |||
<language type="mzn" scripts="Arab"/> | |||
<language type="mzn" territories="IR" alt="secondary"/> | |||
<language type="na" scripts="Latn" territories="NR"/> | |||
<language type="nan" scripts="Hans"/> | |||
<language type="nan" scripts="Hans" territories="TW"/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems just slightly odd that these use Simplified in TW.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is odd -- its automatically picked up from ConvertLanguageData
though, hmm...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, its coming from hak
being listed as official
for Taiwan but it not specifying the script and somehow its resolved as Hans
.
@@ -2735,9 +2735,9 @@ XXX Code for transations where no currency is involved | |||
</territory> | |||
<territory type="CH" gdp="733800000000" literacyPercent="99" population="8860570"> <!--Switzerland--> | |||
<languagePopulation type="de" populationPercent="73" officialStatus="official"/> <!--German--> | |||
<languagePopulation type="fr" populationPercent="67" officialStatus="official" references="R1335"/> <!--French--> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very high. While living in Switzerland (Zürich), my experience was that just small percentage of German Swiss had sufficient French to be comfortable using it in UIs. Many had some years in school, but that didn't translate into adult competency. When German Swiss had conversations with French Swiss, it was far more likely to be in English. Now, it is likely a higher percentage in the cantons nearest (or split across) Romandy, I just don't see the figures adding to 67%.
Not at all hard evidence, but an AI responded
"The information regarding the percentage of the Swiss population who speak fluent French or can get by with the language comes from a combination of sources, including:
Swiss Federal Statistical Office:
This official government agency collects and publishes data on various aspects of Swiss society, including language use. Their surveys and censuses provide reliable information on the language proficiency of the population.
Academic Research and Publications: Studies conducted by linguists and sociologists specializing in Switzerland often include data on language distribution and proficiency. These studies provide a more nuanced understanding of language use in the country.
Reputable News Outlets and Online Resources: Articles and reports from trusted news sources and websites that focus on Switzerland often cite statistics on language demographics. These sources can help corroborate information from official and academic sources.
While these sources may not always perfectly align in their exact figures, they generally converge on the estimate that approximately 23% of the Swiss population speaks fluent French and roughly 28% can get by with the language, including those who are fluent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The French source I used expected 67%, but it was counting L2 speakers, not strictly filtered by literacy. FWIW most of my Swiss friends (from either side of the border) can speak French but I wouldn't consider them a representative sample. I can look for other sources for Switzerland though since they will be very thorough.
Changing the merge target to v47 and also will investigate making the document more stable -- I'll pursue these population count changes later. |
Yea, from running |
you need to also rebase it on v47. |
I've updated the v47 branch. |
I think
https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf
is
overstating the case for French. I've seen the same mistakes show up for
other sources over-estimating English competency (eg in Mauritius), so
francophones are certainly not alone!
Our aim for the language population is to provide two figures, the overall
population of reasonably competent L1+L2 speakers, and the "literacy"
percentage of those. The 'literacy' percentage should really reflect usage,
being a proxy for something like 'weekly active readers of the language'.
When a language is written in multiple scripts, then it should also reflect
the script usage. For example, if the language xx could be written in both
Latin and Cyrillic, we'd expect to see xx_Latn and xx_Cyrl (one of them
would be just xx, if that script is the default for xx globally).
The 'competency' is very roughly "if the person were literate, could they
read and understand an application UI in the language, including help
messages, instructions for usage, etc."
Now, we rarely get a lot of detailed information about language
capabilities; sometimes the figures we see include just L1, sometimes also
L2, and typically doesn't say which. So we have to use a fair amount of
judgement in assessing various reports. For languages that are rarely
written, such as Swiss German, in the absence of good information we tend
to get a literacy value of 5%.
…On Thu, Aug 29, 2024 at 9:33 AM Steven R. Loomis ***@***.***> wrote:
I've updated the v47 branch.
—
Reply to this email directly, view it on GitHub
<#3985 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACJLEMB6P7UJJRSYDUSQCZDZT5EM7AVCNFSM6AAAAABNGTDWNCVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDGMJYGMYTANJQGA>
.
You are receiving this because your review was requested.Message ID:
***@***.***>
|
still needs rebase to reflect the branch change. |
6549966
to
8b37042
Compare
Hooray! The files in the branch are the same across the force-push. 😃 ~ Your Friendly Jira-GitHub PR Checker Bot |
8b37042
to
7c74957
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Swiss numbers very suspect, which makes me doubt the other numbers.
<languagePopulation type="fr" populationPercent="67" officialStatus="official" references="R1030"/> <!--French-->
I think this is coming from https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf page 32
It is a bit confusing, having Suisse romande as 81%, but that is 81% of Swiss Romandy, which is ~20-24% of the overall population, or about 19% of the country.
Then it lists Suisse separately as 67%. (Does that mean the whole country? The remainder?) But either way that is far too high.
Take the Swiss government figures:
https://www.bfs.admin.ch/bfs/de/home/statistiken/bevoelkerung/sprachen-religionen/sprachen.html under Regelmässig verwendete Sprachen
("regularly used languages"), under Details, lists French as ~38.8%.
[Aside: And under "Learned Languages, 15.0% for French. And for native French speakers, 22.80%. So we if we count native + "learned" we'd get:
22.8% native
77.2% non native, and of that, 15% learned French; multiplying, we get 12%
total: 34.38%
So even the higher of these two figures is 38.8%, not 67%.
That is much more like what I experienced in Switzerland; many people might have had French in school but are not comfortable using it.
FYI, comparable figures for English are: 45% regularly used.
So I'm inclined to take the figures from https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf with a big grain of salt, and only use them if we have corroborating evidence.
Thanks @macchiati for checking the data -- yea it definitely looks suspect. I was just re-basing this diff since it was on a pretty stale branch. I'll need to interrogate the data better. That Swiss government source looks great. |
https://unicode-org.atlassian.net/browse/CLDR-11888 was created to update the French speakers for Djibouti but while I was researching that I found many other Francophone countries that significantly underestimated French populations. Most of those gaps probably come from the number being L1 users but the point of this file is L1+L2 users -- basically how many people in each country could use an interface in this language. See the original data in: https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf mvn package -DskipTests=true java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags CLDR-11888 Update French speakers https://unicode-org.atlassian.net/browse/CLDR-11888 was created to update the French speakers for Djibouti but while I was researching that I found many other Francophone countries that significantly underestimated French populations. Most of those gaps probably come from the number being L1 users but the point of this file is L1+L2 users -- basically how many people in each country could use an interface in this language. See the original data in: https://www.francophonie.org/sites/default/files/2021-04/LFDM-20Edition-2019-La-langue-fran%C3%A7aise-dans-le-monde.pdf mvn package -DskipTests=true java -jar tools/cldr-code/target/cldr-code.jar ConvertLanguageData java -jar tools/cldr-code/target/cldr-code.jar GenerateLikelySubtags CLDR-11888 Redo automated scripts after merge conflicts
This change makes multiple updates to the French-speaking population listed in CLDR. Gratefully, much of this data was compiled in a comprehensive meta-study by the International Francophone Organization (OIF), with many collaborators such as the University of Laval. Some estimates come from other sources during the process of vetting the OIF estimates. There are 4 CLDR tickets avoud French population estimates fixed by this change * Syria [CLDR-11885](https://unicode-org.atlassian.net/browse/CLDR-11885) -- French is no longer and official language of Syria. Unfortunately I could not find a valid population estimate so I left the number as-is * Haiti [CLDR-11886](https://unicode-org.atlassian.net/browse/CLDR-11886) * DRCongo [CLDR-11887](https://unicode-org.atlassian.net/browse/CLDR-11887) * Djibouti [CLDR-11888](https://unicode-org.atlassian.net/browse/CLDR-11888) -- Note that after this change it will recognize that French IS the biggest langauge of Djibouti, not Afar. It's surprising by back up by use data. ## Sources I couldn't accept all data points from that study (as Mark Davis recommended, I sought corroborating sources) -- leaving out Burundi, Cameroon, Mauritius, Germany, Portugal, Belgium, and Andorra because of large, uncertain discrepancies. Furthermore, I used the Canadian census directly for Canada, and the Swiss census website (updating all of them because it wasn't that much and the estimates we very different). I didn't cite every primary source since sometimes the OIF added up numbers from multiple sources and I didn't have time to thoroughly open every cited census. The Eurostat website contains many surveys that were compiled by the French organization -- the website for one of them is here https://ec.europa.eu/eurostat/web/microdata/adult-education-survey . ## Effects * Likely subtags (based on the relative population) * French is now the largest language in 2 countries * Democratic Republic of Congo (replacing Swahili) * Djibouti (replacing Afar) * French is now no longer the largest language in 3 countries * Central African Republic (replaced by Sango) * Senegal (replaced by Wolof) * Chad (replaced by Arabic) * French is no longer considered official for Syria Many countries lost a lot of French users -- but since the prior figure did not have citations and other sources tended to agree, I made the change. Mali, Madagascar, Hungary, Niger, Comoros, Chad, and Central African Republic. Some countries gained a lot: D.R. Congo, Djibouti, Haiti, Lebanon, Switzerland, Morocco, Walls & Futuna, French Polynesia, Rwanda. For countries that had large changes I double-checked with other sources that it made sense, especially since it will change likely subtags. There could be literacy gaps favoring French -- if you want me to press on that data I can investigate. Thanks for reviewing this large change -- it was fun to read so many sources, read in French and German, and learn more about countries aroudn world.
7c74957
to
420c663
Compare
Notice: the branch changed across the force-push!
~ Your Friendly Jira-GitHub PR Checker Bot |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
This change makes multiple updates to the French-speaking population listed in CLDR. Gratefully, much of this data was compiled in a comprehensive meta-study by the International Francophone Organization (OIF), with many collaborators such as the University of Laval. Some estimates come from other sources during the process of vetting the OIF estimates.
Jira Tickets
There are 4 CLDR tickets about French population estimates fixed by this change
Sources
I couldn't accept all data points from that study (as Mark Davis recommended, I sought corroborating sources) -- leaving out Burundi, Cameroon, Mauritius, Germany, Portugal, Belgium, and Andorra because of large, uncertain discrepancies.
Furthermore, I used the Canadian census directly for Canada, and the Swiss census website (updating all of them because it wasn't that much and the estimates we very different).
I didn't cite every primary source since sometimes the OIF added up numbers from multiple sources and I didn't have time to thoroughly open every cited census. The Eurostat website contains many surveys that were compiled by the French organization -- the website for one of them is here .
Effects
Many countries lost a lot of French users -- but since the prior figure did not have citations and other sources tended to agree, I made the change. Mali, Madagascar, Hungary, Niger, Comoros, Chad, and Central African Republic.
Some countries gained a lot: D.R. Congo, Djibouti, Haiti, Lebanon, Switzerland, Morocco, Walls & Futuna, French Polynesia, Rwanda.
For countries that had large changes I double-checked with other sources that it made sense, especially since it will change likely subtags. There could be literacy gaps favoring French -- if you want me to press on that data I can investigate.
Steps
Thanks for reviewing this large change -- it was fun to read so many sources, read in French and German, and learn more about countries around world.
CLDR-11888
ALLOW_MANY_COMMITS=true