Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLDR-15923 load UN Literacy data from new location #3512

Merged
merged 2 commits into from
Feb 23, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions docs/dev/brs/codes/literacy/un-literacy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# UN Literacy Data (CLDR BRS)



1. Goto <https://data.un.org/Data.aspx?d=POP&f=tableCode:31>
2. On the left tab under filters:
- under **Area** choose **Total**
- under **Sex** choose **Both Sexes**
3. Click the **Download** button and choose **XML**
4. Save the resultant XML file as `tools/cldr-code/src/main/resources/org/unicode/cldr/util/data/external/un_literacy.xml`
5. Now you can run `AddPopulationData`

> Note: If the format changes, you'll have to modify the `AddPopulationData.loadUnLiteracy()` method.
Original file line number Diff line number Diff line change
Expand Up @@ -3,14 +3,17 @@
import com.ibm.icu.text.ListFormat;
import com.ibm.icu.text.NumberFormat;
import com.ibm.icu.text.UnicodeSet;
import com.ibm.icu.util.Output;
import com.ibm.icu.util.ULocale;
import java.io.IOException;
import java.text.ParseException;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.Iterator;
import java.util.LinkedList;
import java.util.List;
import java.util.Locale;
import java.util.Map;
import java.util.Set;
import java.util.TreeSet;
import java.util.regex.Matcher;
Expand Down Expand Up @@ -476,45 +479,55 @@ public boolean handle(String line) {
});
}

private static void loadUnLiteracy() throws IOException {
CldrUtility.handleFile(
"external/un_literacy.csv",
new CldrUtility.LineHandler() {
@Override
public boolean handle(String line) {
// Afghanistan,2000, ,28,43,13,,34,51,18
// "Country or area","Year",,"Adult (15+) literacy rate",,,,,,"
// Youth (15-24) literacy rate",,,,
// ,,,Total,Men,Women,,Total,Men,Women
// "Albania",2008,,96,,97,,95,,99,,99,,99
String[] pieces = splitCommaSeparated(line);
if (pieces.length != 14
|| pieces[1].length() == 0
|| !DIGITS.containsAll(pieces[1])) {
return false;
}
String code =
CountryCodeConverter.getCodeFromName(pieces[0], true, missing);
if (code == null) {
return false;
}
if (!StandardCodes.isCountry(code)) {
if (ADD_POP) {
System.out.println("Skipping UN info for: " + code);
}
return false;
}
String totalLiteracy = pieces[3];
if (totalLiteracy.equals("�")
|| totalLiteracy.equals("…")
|| totalLiteracy.isEmpty()) {
return true;
}
double percent = Double.parseDouble(totalLiteracy);
un_literacy.add(code, percent);
return true;
}
});
static void loadUnLiteracy() throws IOException {
for (final Pair<String, Double> p : getUnLiteracy(null)) {
un_literacy.add(p.getFirst(), p.getSecond());
}
}

/**
* @param hadErr on return, true if there were errs
* @return list of code,percent values
* @throws IOException
*/
static List<Pair<String, Double>> getUnLiteracy(Output<Boolean> hadErr) throws IOException {
List<Pair<String, Double>> result = new LinkedList<>();
UnLiteracyParser ulp;
try {
ulp = new UnLiteracyParser().read();
} catch (Throwable t) {
throw new IOException("Could not read UN data " + UnLiteracyParser.UN_LITERACY, t);
}

for (final Map.Entry<String, UnLiteracyParser.PerCountry> e : ulp.perCountry.entrySet()) {
final String country = e.getKey();
final String latest = e.getValue().latest();
final UnLiteracyParser.PerYear py = e.getValue().perYear.get(latest);

Long literate = py.total(UnLiteracyParser.LITERATE);
Long illiterate = py.total(UnLiteracyParser.ILLITERATE);

String code = CountryCodeConverter.getCodeFromName(country, true, missing);
if (code == null) {
if (hadErr != null) {
hadErr.value = true;
}
continue;
}
if (!StandardCodes.isCountry(code)) {
if (ADD_POP) {
System.out.println("Skipping UN info for: " + code);
}
continue;
}
double total = literate + illiterate;
double percent = ((double) literate) / total;
result.add(Pair.of(code, percent));
}
if (result.isEmpty()) {
hadErr.value = true;
}
return result;
}

static {
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,224 @@
package org.unicode.cldr.tool;

import com.ibm.icu.number.LocalizedNumberFormatter;
import com.ibm.icu.number.NumberFormatter;
import java.util.HashMap;
import java.util.Locale;
import java.util.Map;
import java.util.Map.Entry;
import java.util.TreeMap;
import org.unicode.cldr.util.XMLFileReader;
import org.unicode.cldr.util.XPathParts;

public class UnLiteracyParser extends XMLFileReader.SimpleHandler {

private static final String VALUE = "Value";
private static final String RELIABILITY = "Reliability";
private static final String LITERACY = "Literacy";
private static final String YEAR = "Year";
private static final String COUNTRY_OR_AREA = "Country or Area";
private static final String AGE = "Age";
static final String LITERATE = "Literate";
static final String ILLITERATE = "Illiterate";
private static final String UNKNOWN = "Unknown";
private static final String TOTAL = "Total";
// Debug stuff
public static void main(String args[]) {
final UnLiteracyParser ulp = new UnLiteracyParser().read();
for (final Entry<String, PerCountry> e : ulp.perCountry.entrySet()) {
final String country = e.getKey();
final String latest = e.getValue().latest();
final PerYear py = e.getValue().perYear.get(latest);

Long literate = py.total(LITERATE);
Long illiterate = py.total(ILLITERATE);
Long unknown = py.total(UNKNOWN);
Long total = py.total(TOTAL);

System.out.println(
country
+ "\t"
+ latest
+ "\t"
+ literate
+ "/"
+ illiterate
+ ", "
+ unknown
+ " = "
+ total);
if ((literate + illiterate + unknown) != total) {
System.out.println(
"- doesn't add up for "
+ country
+ " - total is "
+ (literate + illiterate + unknown));
}
}
}

int recCount = 0;

// Reading stuff
public static final String UN_LITERACY = "external/un_literacy.xml";

UnLiteracyParser read() {
System.out.println("* Reading " + UN_LITERACY);
new XMLFileReader()
.setHandler(this)
.readCLDRResource(UN_LITERACY, XMLFileReader.CONTENT_HANDLER, false);
// get the final record
handleNewRecord();
LocalizedNumberFormatter nf = NumberFormatter.with().locale(Locale.ENGLISH);
System.out.println(
"* Read "
+ nf.format(recCount)
+ " record(s) with "
+ nf.format(perCountry.size())
+ " region(s) from "
+ UN_LITERACY);
return this;
}

// Parsing stuff
@Override
public void handlePathValue(String path, String value) {
if (!path.startsWith("//ROOT/data/record")) {
return;
}
final String field = XPathParts.getFrozenInstance(path).getAttributeValue(-1, "name");
handleField(field, value);
}

@Override
public void handleElement(CharSequence path) {
if ("//ROOT/data/record".equals(path.toString())) {
handleNewRecord();
}
}

// Data ingestion
final Map<String, String> thisRecord = new HashMap<String, String>();

private void handleField(String field, String value) {
final String old = thisRecord.put(field, value);
if (old != null) {
throw new IllegalArgumentException(
"Duplicate field " + field + ", context: " + thisRecord);
}
}

private void handleNewRecord() {
if (!thisRecord.isEmpty() && validate()) {
recCount++;
handleRecord();
}

thisRecord.clear();
}

boolean validate() {
try {
assertEqual("Area", "Total");
assertEqual("Sex", "Both Sexes");

assertPresent(AGE);
assertPresent(COUNTRY_OR_AREA);
assertPresent(LITERACY);
assertPresent(VALUE);
assertPresent(YEAR);
assertPresent(RELIABILITY);

return true;
} catch (Throwable t) {
final String context = thisRecord.toString();
throw new IllegalArgumentException("While parsing " + context, t);
}
}

void assertPresent(String field) {
String value = get(field);
if (value == null) {
throw new NullPointerException("Missing field: " + field);
} else if (value.isEmpty()) {
throw new NullPointerException("Empty field: " + field);
}
}

void assertEqual(String field, String expected) {
assertPresent(field);
String value = get(field);
if (!value.equals(expected)) {
throw new NullPointerException(
"Expected " + field + "=" + expected + " but got " + value);
}
}

private final String get(String field) {
final String value = thisRecord.get(field);
if (value == null) return value;
return value.trim();
}

private void handleRecord() {
final String country = get(COUNTRY_OR_AREA);
final String year = get(YEAR);
final String age = get(AGE);
final String literacy = get(LITERACY);
final String reliability = get(RELIABILITY);
final PerAge pa =
perCountry
.computeIfAbsent(country, (String c) -> new PerCountry())
.perYear
.computeIfAbsent(year, (String y) -> new PerYear())
.perAge
.computeIfAbsent(age, (String a) -> new PerAge());

if (pa.reliability == null) {
pa.reliability = reliability;
} else if (!pa.reliability.equals(reliability)) {
throw new IllegalArgumentException(
"Inconsistent reliability " + reliability + " for " + thisRecord);
}
final Long old = pa.perLiteracy.put(literacy, getLongValue());
if (old != null) {
System.err.println("Duplicate record " + country + " " + year + " " + age);
}
}

private long getLongValue() {
final String value = get(VALUE);
if (value.contains(
".")) { // yes. some of the data has decimal points. Ignoring the fractional part.
return Long.parseLong(value.split("\\.")[0]);
} else {
return Long.parseLong(value);
}
}

final Map<String, PerCountry> perCountry = new TreeMap<String, PerCountry>();

final class PerCountry {
final Map<String, PerYear> perYear = new TreeMap<String, PerYear>();

public String latest() {
final String y[] = perYear.keySet().toArray(new String[0]);
return y[y.length - 1];
}
}

final class PerYear {
final Map<String, PerAge> perAge = new TreeMap<String, PerAge>();

Long total(String literacy) {
return perAge.values().stream()
.map((pa) -> pa.perLiteracy.getOrDefault(literacy, 0L))
.reduce(0L, (Long a, Long b) -> a + b);
}
}

final class PerAge {
final Map<String, Long> perLiteracy = new TreeMap<String, Long>();
String reliability = null;
}
}
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,15 @@ public class XMLFileReader {
private SimpleHandler simpleHandler;

public static class SimpleHandler {
/**
* called when every new element is encountered, with the full path to the element
* (including attributes). Called on leaf and non-leaf elements.
*
* @param path
*/
public void handleElement(CharSequence path) {}

/** Called with an "xpath" of each leaf element */
public void handlePathValue(String path, String value) {}

public void handleComment(String path, String comment) {}
Expand Down Expand Up @@ -416,6 +425,7 @@ public void startElement(
startElements.push(tempPath.toString());
chars.setLength(0); // clear garbage
lastIsStart = true;
simpleHandler.handleElement(tempPath);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a little surprising, that its absence was never a problem before... oh, I think I see, the new SimpleHandler.handleElement doesn't currently do anything, and this is in case it eventually does

Copy link
Member Author

@srl295 srl295 Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a new (internal) API. There wasn't a way (in this API) to register for notification on a new element. The issue comes because the data is of this form:

<a>
   <b c="3"/>
   <b d="4"/>
</a>
<a>
   <b c="5"/>
   <b d="6"/>
</a>

the XPATH=based api would only produce:

//a/b[@c="3"]
//a/b[@d="4"]
 (need notification that we're on a new 'a' here)
//a/b[@c="5"]
//a/b[@c="6"]

since there's no identity on <a> this new API lets me reset the parser. Anyway, that's what it's for.

This isn't needed for CLDR's own data, because the <a> above always has a clear identity <a id="something"/> vs <a id="somethingElse"/> so produces a unique xpath. This minor API update lets me reuse our existing convenience functions without writing something entirely new for this single file.

Copy link
Member

@macchiati macchiati Feb 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For normal CLDR there must be a distinguishing attribute for the first a and the second a to be different.

In CLDR, the mechanism for such distinctions is to add an _q distinguishing element. That also preserves order where needed. That could be done here as well, for consistency.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@macchiati this is the internal low-level XMLFileReader interface. _q is handled at the next layer up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

normal CLDR data doesn't go through this interface, see other comment.

}

@Override
Expand Down
Loading
Loading