Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML entity is erroneously converted (legacy numeric character reference) #307

Closed
p3k opened this issue Feb 20, 2024 · 7 comments
Closed
Labels

Comments

@p3k
Copy link

p3k commented Feb 20, 2024

Steps to reproduce the problem (provide example Markdown if applicable):

This endash: –  
should look like this endash: –

Expected behavior:

This endash: –
should look like this endash: –

Actual behavior:

This endash: –
should look like this endash: �

The reference implementation at https://spec.commonmark.org/dingus/ does behave differently, it displays the expected behavior.

Is this an abberration in the Java implementation?

@p3k p3k added the bug label Feb 20, 2024
p3k added a commit to antville/antville that referenced this issue Feb 20, 2024
@robinst
Copy link
Collaborator

robinst commented Mar 2, 2024

Interesting case! So the spec only says this:

Decimal numeric character references consist of &# + a string of 1--7 arabic digits + ;.
A numeric character reference is parsed as the corresponding Unicode character.
Invalid Unicode code points will be replaced by the REPLACEMENT CHARACTER (U+FFFD).
For security reasons, the code point U+0000 will also be replaced by U+FFFD.

And that's what commonmark-java implements. – is 150 in decimal, which is 96 in hex and so corresponds to the Unicode code point U+0096, which is (SPA).

But it turns out that commonmark.js implements this differently. It uses the entities library which has this replacement map: decode_codepoint.ts.

As you can see there, 150 is mapped to 8211, which is U+2013 aka En Dash. (If you use – in commonmark-java, it would work.)


Looking at the HTML spec (which commonmark usually tries to be compatible with), says this:

The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.

(U+0096 is a control character)

And then in the parsing section:

If the number is one of the numbers in the first column of the following table, then find the row with that number in the first column, and set the character reference code to the number in the second column of that row.

And there it is, the replacement table that entities uses (I assume it's the same, I haven't checked).


So, tl;dr:

  • 150 in decimal maps to a control character in Unicode, and the HTML spec uses a different character for it (probably based on some legacy encoding).
  • The spec doesn't mention the mapping
  • commonmark.js implements the mapping
  • commonmark-java doesn't

So technically it's not a bug, but I think it makes sense to follow HTML and commonmark.js here.

Note that GitHub which AFAIK uses cmark doesn't do the same replacement, it renders – as: �
(The code for that is probably here.)

Next steps:

  1. I'll raise an issue in https://github.com/commonmark/commonmark-spec
  2. I'll work on adding the replacement table to commonmark-java

@p3k
Copy link
Author

p3k commented Mar 2, 2024

Wow, thanks for the deep insights @robinst

the HTML spec uses a different character for it (probably based on some legacy encoding)

I probably should have added that the corresponding content was created in the 1990s where encoding HTML entities like this was pretty common. The Markdown environment later grew around it.

So I see this could be kind of way-back backwards-compatibility 👴

And I noticed the issue on GitHub, too ☺️

@robinst
Copy link
Collaborator

robinst commented Mar 5, 2024

I probably should have added that the corresponding content was created in the 1990s

The Markdown content? 😱

I don't know if you've read the discussion on the spec issue, but I'm leaning towards closing this as "not a bug" for now until the spec says something different about numeric character references.

@p3k
Copy link
Author

p3k commented Mar 5, 2024

The Markdown content? 😱

The content at the time was mainly text with sparse HTML and some macro structures processed on the server-side – nothing Markdown should not be able to handle after the server-side processing.

I do not recall whether it was the blogging system (I think it was blogger.com) or the browsers that were not capable of an typographically correct unencoded endash or other non-ASCII characters. Thus, we helped ourselves encoding it manually.

The blog systems came and go, I migrated the content multiple times. But the issue arose when I started using CommonMark (Java) to render the contents.

@robinst
Copy link
Collaborator

robinst commented Mar 5, 2024

I see. Can you migrate the content to replace – with – (or just a literal –)?

@p3k
Copy link
Author

p3k commented Mar 6, 2024

Sure, that is always a possibility, eventually.

But still, CommonMark’s behavior seems to be inconsistent in different implementations, at least when comparing the Java version to the one running the reference implementation.

@robinst
Copy link
Collaborator

robinst commented Mar 6, 2024

But still, CommonMark’s behavior seems to be inconsistent in different implementations, at least when comparing the Java version to the one running the reference implementation.

Yeah. But cmark is also a reference implementation and behaves differently. commonmark-java is implementing the spec as it is currently written, so it's a spec issue which I raised here. (To be super pedantic, commonmark.js is the one currently not following the spec.)

I'm going to close this one, we can reopen once there's a decision on the spec issue. Until then I recommend not using legacy numeric character references. Thanks for bringing this up again!

@robinst robinst closed this as completed Mar 6, 2024
@robinst robinst added spec and removed bug labels Mar 6, 2024
@robinst robinst changed the title HTML entity is erroneously converted HTML entity is erroneously converted (legacy numeric character reference) Mar 6, 2024
p3k added a commit to antville/antville that referenced this issue May 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants