-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTML entity is erroneously converted (legacy numeric character reference) #307
Comments
Interesting case! So the spec only says this:
And that's what commonmark-java implements. But it turns out that commonmark.js implements this differently. It uses the As you can see there, 150 is mapped to 8211, which is U+2013 aka En Dash. (If you use Looking at the HTML spec (which commonmark usually tries to be compatible with), says this:
(U+0096 is a control character) And then in the parsing section:
And there it is, the replacement table that So, tl;dr:
So technically it's not a bug, but I think it makes sense to follow HTML and commonmark.js here. Note that GitHub which AFAIK uses cmark doesn't do the same replacement, it renders Next steps:
|
Wow, thanks for the deep insights @robinst
I probably should have added that the corresponding content was created in the 1990s where encoding HTML entities like this was pretty common. The Markdown environment later grew around it. So I see this could be kind of way-back backwards-compatibility 👴 And I noticed the issue on GitHub, too |
The Markdown content? 😱 I don't know if you've read the discussion on the spec issue, but I'm leaning towards closing this as "not a bug" for now until the spec says something different about numeric character references. |
The content at the time was mainly text with sparse HTML and some macro structures processed on the server-side – nothing Markdown should not be able to handle after the server-side processing. I do not recall whether it was the blogging system (I think it was blogger.com) or the browsers that were not capable of an typographically correct unencoded endash or other non-ASCII characters. Thus, we helped ourselves encoding it manually. The blog systems came and go, I migrated the content multiple times. But the issue arose when I started using CommonMark (Java) to render the contents. |
I see. Can you migrate the content to replace |
Sure, that is always a possibility, eventually. But still, CommonMark’s behavior seems to be inconsistent in different implementations, at least when comparing the Java version to the one running the reference implementation. |
Yeah. But cmark is also a reference implementation and behaves differently. commonmark-java is implementing the spec as it is currently written, so it's a spec issue which I raised here. (To be super pedantic, commonmark.js is the one currently not following the spec.) I'm going to close this one, we can reopen once there's a decision on the spec issue. Until then I recommend not using legacy numeric character references. Thanks for bringing this up again! |
Steps to reproduce the problem (provide example Markdown if applicable):
This endash: – should look like this endash: –
Expected behavior:
Actual behavior:
The reference implementation at https://spec.commonmark.org/dingus/ does behave differently, it displays the expected behavior.
Is this an abberration in the Java implementation?
The text was updated successfully, but these errors were encountered: