HTML entity is erroneously converted (legacy numeric character reference) #307

p3k · 2024-02-20T18:17:21Z

Steps to reproduce the problem (provide example Markdown if applicable):

This endash: –  
should look like this endash: &#150;

Expected behavior:

This endash: –
should look like this endash: –

Actual behavior:

This endash: –
should look like this endash: �

The reference implementation at https://spec.commonmark.org/dingus/ does behave differently, it displays the expected behavior.

Is this an abberration in the Java implementation?

The text was updated successfully, but these errors were encountered:

…ntities See <commonmark/commonmark-java#307>

robinst · 2024-03-02T13:25:06Z

Interesting case! So the spec only says this:

Decimal numeric character references consist of &# + a string of 1--7 arabic digits + ;.
A numeric character reference is parsed as the corresponding Unicode character.
Invalid Unicode code points will be replaced by the REPLACEMENT CHARACTER (U+FFFD).
For security reasons, the code point U+0000 will also be replaced by U+FFFD.

And that's what commonmark-java implements.  is 150 in decimal, which is 96 in hex and so corresponds to the Unicode code point U+0096, which is (SPA).

But it turns out that commonmark.js implements this differently. It uses the entities library which has this replacement map: decode_codepoint.ts.

As you can see there, 150 is mapped to 8211, which is U+2013 aka En Dash. (If you use – in commonmark-java, it would work.)

Looking at the HTML spec (which commonmark usually tries to be compatible with), says this:

The numeric character reference forms described above are allowed to reference any code point excluding U+000D CR, noncharacters, and controls other than ASCII whitespace.

(U+0096 is a control character)

And then in the parsing section:

If the number is one of the numbers in the first column of the following table, then find the row with that number in the first column, and set the character reference code to the number in the second column of that row.

And there it is, the replacement table that entities uses (I assume it's the same, I haven't checked).

So, tl;dr:

150 in decimal maps to a control character in Unicode, and the HTML spec uses a different character for it (probably based on some legacy encoding).
The spec doesn't mention the mapping
commonmark.js implements the mapping
commonmark-java doesn't

So technically it's not a bug, but I think it makes sense to follow HTML and commonmark.js here.

Note that GitHub which AFAIK uses cmark doesn't do the same replacement, it renders  as: �
(The code for that is probably here.)

Next steps:

I'll raise an issue in https://github.com/commonmark/commonmark-spec
I'll work on adding the replacement table to commonmark-java

p3k · 2024-03-02T14:22:16Z

Wow, thanks for the deep insights @robinst

the HTML spec uses a different character for it (probably based on some legacy encoding)

I probably should have added that the corresponding content was created in the 1990s where encoding HTML entities like this was pretty common. The Markdown environment later grew around it.

So I see this could be kind of way-back backwards-compatibility 👴

And I noticed the issue on GitHub, too ☺️

robinst · 2024-03-05T11:07:52Z

I probably should have added that the corresponding content was created in the 1990s

The Markdown content? 😱

I don't know if you've read the discussion on the spec issue, but I'm leaning towards closing this as "not a bug" for now until the spec says something different about numeric character references.

p3k · 2024-03-05T11:15:52Z

The Markdown content? 😱

The content at the time was mainly text with sparse HTML and some macro structures processed on the server-side – nothing Markdown should not be able to handle after the server-side processing.

I do not recall whether it was the blogging system (I think it was blogger.com) or the browsers that were not capable of an typographically correct unencoded endash or other non-ASCII characters. Thus, we helped ourselves encoding it manually.

The blog systems came and go, I migrated the content multiple times. But the issue arose when I started using CommonMark (Java) to render the contents.

robinst · 2024-03-05T22:13:15Z

I see. Can you migrate the content to replace  with – (or just a literal –)?

p3k · 2024-03-06T07:38:13Z

Sure, that is always a possibility, eventually.

But still, CommonMark’s behavior seems to be inconsistent in different implementations, at least when comparing the Java version to the one running the reference implementation.

robinst · 2024-03-06T23:20:07Z

But still, CommonMark’s behavior seems to be inconsistent in different implementations, at least when comparing the Java version to the one running the reference implementation.

Yeah. But cmark is also a reference implementation and behaves differently. commonmark-java is implementing the spec as it is currently written, so it's a spec issue which I raised here. (To be super pedantic, commonmark.js is the one currently not following the spec.)

I'm going to close this one, we can reopen once there's a decision on the spec issue. Until then I recommend not using legacy numeric character references. Thanks for bringing this up again!

…ntities See <commonmark/commonmark-java#307>

p3k added the bug label Feb 20, 2024

p3k added a commit to antville/antville that referenced this issue Feb 20, 2024

Add work-around for issue with CommonMark erroneously encoding HTML e…

c0cf251

…ntities See <commonmark/commonmark-java#307>

robinst mentioned this issue Mar 3, 2024

Numeric character references: Should HTML spec be followed for codes mapping to control characters commonmark/commonmark-spec#765

Open

robinst closed this as completed Mar 6, 2024

robinst added spec and removed bug labels Mar 6, 2024

robinst changed the title ~~HTML entity is erroneously converted~~ HTML entity is erroneously converted (legacy numeric character reference) Mar 6, 2024

p3k added a commit to antville/antville that referenced this issue May 30, 2024

Add work-around for issue with CommonMark erroneously encoding HTML e…

5d82daf

…ntities See <commonmark/commonmark-java#307>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTML entity is erroneously converted (legacy numeric character reference) #307

HTML entity is erroneously converted (legacy numeric character reference) #307

p3k commented Feb 20, 2024

robinst commented Mar 2, 2024

p3k commented Mar 2, 2024

robinst commented Mar 5, 2024

p3k commented Mar 5, 2024

robinst commented Mar 5, 2024

p3k commented Mar 6, 2024

robinst commented Mar 6, 2024

HTML entity is erroneously converted (legacy numeric character reference) #307

HTML entity is erroneously converted (legacy numeric character reference) #307

Comments

p3k commented Feb 20, 2024

robinst commented Mar 2, 2024

p3k commented Mar 2, 2024

robinst commented Mar 5, 2024

p3k commented Mar 5, 2024

robinst commented Mar 5, 2024

p3k commented Mar 6, 2024

robinst commented Mar 6, 2024