-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Forbidden numeric character references appear in sanitized HTML #223
Comments
Thanks for the detailed report. What about replacing any banned code-point with the replacement character (U+FFFD)? It looks like, from that link, that the set of banned codepoints is UTF-16 surrogate pairs where one or both of the surrogates are specified via reference seem like non-minimal sequence problem. Should those be banned? Replaced with one replacement char or two? If so, we'd need to do some work during lexing. |
A quick test suggests that we don't have to worry about surrogate pairs where one is specified via reference. I ran this:
and go the result "🤡" every time (That's the clown face emoji, by the way). |
Having had a few hours to think about, I think removing the bad characters in the same way we do for isolated surrogates and U+FFFE and U+FFFF is the right thing to do. The "bad characters" with the rationale would be: U+0000 to U+0008 : C0 Control characters, banned by XML. Additionally: U+0085 : Whilst not discouraged by XML, there is no consistent definition of how the character should be handled. Some assume it is badly encoded Windows-1252 and show an ellipsis. Some recognise it is a new line, some just as white space. As there is no clear use for U+0085, I believe any usage of it can be viewed with suspicion, and therefore it should be elided. U+000D : HTML bans this as a numeric escape, but if used as a raw CR it becomes an attack vector in systems where it is treated as a Carriage Return and not a New Line. The carriage return moves the output point back to the start of the line and allows overwriting of data. Whilst not a problem in HTML, this is a problem as a log-file injection. I believe that it would be best to normalize newlines as described here: https://infra.spec.whatwg.org/#normalize-newlines This removes all occurrences of U+000D without changing the meaning of the HTML. So my proposal is to normalize newlines and then remove anything on the list of "bad characters" above. How does that sound as a solution? |
Ok. I guess as long as the way it's done does not affect the HTML renderer then I expect that it won't cause much fallout. Then we can focus testing on token merging conflicts as where <Øscript>
<sØcript ="">
<oØbject>
<a title=Øonclick=alert(1)>
<a href="javascript&Ø#58;alert(1)">a</a> Newline normalizing sounds good. I think a few tests for CRLF vs CR should do fine. It might be worth updating the fuzzer test so that the fragments it splices around include a banned character and a U+D. That's a pretty terrible fuzzer, but it's saved me from releasing bugs into prod before. Thanks for taking this on. |
I have created a pull request for my proposed fix. I have scattered comments on the proposed code changes to explain my thinking. |
Thanks. I see #225 |
The HTML living standard ( https://html.spec.whatwg.org/multipage/syntax.html#character-references ) states:
However, non-characters from the supplemental planes are encoded numerically.
The code:
Produces: ""
I see two possible simple possible solutions, but I am loathe to recommend either one:
First the characters could be elided in line with the elision of U+FFFE and U+FFFF. This produces a strange botch that is not required by the rules of HTML nor XML, and I don't like it.
Alternatively, the character is allowed if it is not numerically escaped and the noncharacters U+FDD0 to U+FDEF are presented unescaped - so consistency with other non-characters would produce legal HTML. However, all other supplemental code points are represented by numeric escapes to avoid corruption when converting between unicode encodings. I am not happy with introducing special cases for these supplemental code points.
More complex solutions would be to introduce a policy for handling of the "discouraged" characters defined in https://www.w3.org/TR/2008/REC-xml-20081126/#charsets.
I am happy to put the time into creating a fix and test cases, but I need guidance as to what is the "correct" solution.
The text was updated successfully, but these errors were encountered: