Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding characters present (often in ASN.1 comments) #17

rmwesley · 2024-10-04T06:02:22Z

I am not asking for a fix. Just explaining an issue.

Here is the command I ran on bash:

$ ./tools/pycrate_asn1compile.py -i DSRC_instances_asn1_specs/EN15509/ -o DSRC_instances_asn1_specs/EN15509 -j
./tools/pycrate_asn1compile.py, args error: unable to read input file DSRC_instances_asn1_specs/EN15509/ISO14906Amd(2014)EfcDsrcGenericv5.asn
'utf-8' codec can't decode byte 0x93 in position 10503: invalid start byte

So we see there are invalid "utf-8" characters in the ASN1 file.

In one of the comments present in the ISO14906Amd(2014)EfcDsrcGenericv5.asn file they surrounded the word UNIX time with left (“ = 0x93 in Latin1) and right (” = 0x94 in Latin1) double quotation marks instead of plain ASCII quotation marks "(0x22), like so:
“UNIX time”

This makes up invalid UTF-8 text.
We find the same kind of issue in EfcDsrcApplicationv5 and AVIAEINumberingAndDataStructures, be it for double quotation marks or other such characters, such as single quotation marks and dashes (’=0x94 and –=0x96).

As is often recommended, one should manually remove the comments from the ASN.1 specs.
Instead, I will simply change these characters by hand to their ASCII equivalents to make up valid UTF-8 text and compile the ASN1 specs from that point.
They are just comments after all...

It is impossible to detect 8-bit encodings programatically, right? Only if it is kept as metadata or noted down somewhere.
If the encoding could be determined, we could then simply do open("myfile", encoding=determined_encoding).

Just to note, I downloaded the original ASN.1 specs directly from the official ISO site.
Some of the specifications using ISO-8859-1 (Latin1) encoding are https://standards.iso.org/iso/14906/ed-2/ISO14906Amd(2014)EfcDsrcGenericv5.asn, https://standards.iso.org/iso/14906/ed-2/ISO14906Amd(2014)EfcDsrcApplicationv5.asn and https://standards.iso.org/iso/14816/ISO14816%20ASN.1%20repository/ISO14816_AVIAEINumberingAndDataStructures.asn.

The text was updated successfully, but these errors were encountered:

mitshell · 2024-10-08T20:04:33Z

I agree that many ASN.1 specs provided here and there contain misencoded (or sometimes simply invalid) characters. This is generally the result of how the work is organized when building a technical standard or specification: different contributions from different companies and regions of the world are all merged in a big Word document, which then is eventually converted to PDF. This is error prone!

On the other side, the current pycrate ASN.1 compiler tries to decode any input as UTF8 and breaks if it contains a non-UTF8 byte. What could be done is:

convert wrongly encoded but meaningful characters to their expected UTF8 encoding.
drop invalid bytes when they just breaks the UTF8 decoding.

This could lead to better acceptance of ASN.1 specs at the end.

mitshell · 2024-10-08T20:11:43Z

For my own recollection, this is happening here:

pycrate/tools/pycrate_asn1compile.py

Line 200 in 0b8309e

fd = open(f, 'r', encoding='utf-8')

rmwesley changed the title ~~Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding~~ Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding characters present (often in ASN.1 comments) Oct 4, 2024

mitshell self-assigned this Oct 8, 2024

mitshell added the enhancement New feature or request label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding characters present (often in ASN.1 comments) #17

Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding characters present (often in ASN.1 comments) #17

rmwesley commented Oct 4, 2024 •

edited

Loading

mitshell commented Oct 8, 2024 •

edited

Loading

mitshell commented Oct 8, 2024

Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding characters present (often in ASN.1 comments) #17

Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding characters present (often in ASN.1 comments) #17

Comments

rmwesley commented Oct 4, 2024 • edited Loading

mitshell commented Oct 8, 2024 • edited Loading

mitshell commented Oct 8, 2024

rmwesley commented Oct 4, 2024 •

edited

Loading

mitshell commented Oct 8, 2024 •

edited

Loading