Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding characters present (often in ASN.1 comments) #17

Open
rmwesley opened this issue Oct 4, 2024 · 2 comments
Assignees
Labels
enhancement New feature or request

Comments

@rmwesley
Copy link

rmwesley commented Oct 4, 2024

I am not asking for a fix. Just explaining an issue.

Here is the command I ran on bash:

$ ./tools/pycrate_asn1compile.py -i DSRC_instances_asn1_specs/EN15509/ -o DSRC_instances_asn1_specs/EN15509 -j
./tools/pycrate_asn1compile.py, args error: unable to read input file DSRC_instances_asn1_specs/EN15509/ISO14906Amd(2014)EfcDsrcGenericv5.asn
'utf-8' codec can't decode byte 0x93 in position 10503: invalid start byte

So we see there are invalid "utf-8" characters in the ASN1 file.

In one of the comments present in the ISO14906Amd(2014)EfcDsrcGenericv5.asn file they surrounded the word UNIX time with left (“ = 0x93 in Latin1) and right (” = 0x94 in Latin1) double quotation marks instead of plain ASCII quotation marks "(0x22), like so:
“UNIX time”

This makes up invalid UTF-8 text.
We find the same kind of issue in EfcDsrcApplicationv5 and AVIAEINumberingAndDataStructures, be it for double quotation marks or other such characters, such as single quotation marks and dashes (’=0x94 and –=0x96).

As is often recommended, one should manually remove the comments from the ASN.1 specs.
Instead, I will simply change these characters by hand to their ASCII equivalents to make up valid UTF-8 text and compile the ASN1 specs from that point.

They are just comments after all...

It is impossible to detect 8-bit encodings programatically, right? Only if it is kept as metadata or noted down somewhere.
If the encoding could be determined, we could then simply do open("myfile", encoding=determined_encoding).

Just to note, I downloaded the original ASN.1 specs directly from the official ISO site.
Some of the specifications using ISO-8859-1 (Latin1) encoding are https://standards.iso.org/iso/14906/ed-2/ISO14906Amd(2014)EfcDsrcGenericv5.asn, https://standards.iso.org/iso/14906/ed-2/ISO14906Amd(2014)EfcDsrcApplicationv5.asn and https://standards.iso.org/iso/14816/ISO14816%20ASN.1%20repository/ISO14816_AVIAEINumberingAndDataStructures.asn.

@rmwesley rmwesley changed the title Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding Cannot compile ASN1 specs with ISO-8859 (Latin or Western) encoding characters present (often in ASN.1 comments) Oct 4, 2024
@mitshell
Copy link
Member

mitshell commented Oct 8, 2024

I agree that many ASN.1 specs provided here and there contain misencoded (or sometimes simply invalid) characters. This is generally the result of how the work is organized when building a technical standard or specification: different contributions from different companies and regions of the world are all merged in a big Word document, which then is eventually converted to PDF. This is error prone!

On the other side, the current pycrate ASN.1 compiler tries to decode any input as UTF8 and breaks if it contains a non-UTF8 byte. What could be done is:

  • convert wrongly encoded but meaningful characters to their expected UTF8 encoding.
  • drop invalid bytes when they just breaks the UTF8 decoding.

This could lead to better acceptance of ASN.1 specs at the end.

@mitshell mitshell self-assigned this Oct 8, 2024
@mitshell mitshell added the enhancement New feature or request label Oct 8, 2024
@mitshell
Copy link
Member

mitshell commented Oct 8, 2024

For my own recollection, this is happening here:

fd = open(f, 'r', encoding='utf-8')

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants