-
Notifications
You must be signed in to change notification settings - Fork 215
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes in HTML Lexer to support HTML empty comment statements #327
base: main
Are you sure you want to change the base?
Conversation
@jfbyers Can you pl provide a test case, where with a minimal policy, post sanitization, the vector is not sanitized? |
Hi @subbudvk , with the current implementation the following strings are not properly sanitized: class BugTest
{
public static void main(String[] args)
{
String[] test = new String[]{"qwe1<img>qwe2", "qwe3<!>qwe4", "qwe5<!-->qwe6"};
for (String s : test)
{
PolicyFactory sanitizePolicy = new HtmlPolicyBuilder().toFactory();
String safeString = sanitizePolicy.sanitize(s, null, null);
System.out.println( "safeString :" + safeString);
}
}
} This outputs:
And it should be:
As I mentioned in my first message (apologies if I did not explain myself clearly) this is even more misleading if you use a listener to add tags/attributes in discardedTag() or discardedAttribute() from the string being sanitized e.g string |
Thanks @melloware , can you approve / run the workflows or you want me to add more unit tests? |
No you are good. I don't have commit privs so I can't run your workflow only @mikesamuel has permissions to this repo. |
Which HTML RFC are you quoting? https://html.spec.whatwg.org/multipage/syntax.html#comments seems to disagree. https://html.spec.whatwg.org/multipage/parsing.html#parse-errors does class
Why? |
@mikesamuel : Can you kindly release a new version at least with whatever changes that are already merged to master? |
+1 to @subbudvk just get any new version out there that gets rid of Guava. |
Added support for 2 comment parser errors https://html.spec.whatwg.org/multipage/parsing.html#parse-errors :
|
Hey @mikesamuel We are quoting this RFC: https://www.ietf.org/rfc/rfc1866.txt section 3.2.5.
Based on this, It is true, that based on this, I can build some PoC for both scenarios detailed above, let me know if that is needed or if this is enough information. Thanks!! |
Done |
I don't think any modern browser references that standard and, since it's an RFC, it hasn't changes since it was published in 1995. |
@@ -665,6 +679,8 @@ && canonicalElementName(start + 2, end) | |||
if ('>' == ch) { | |||
state = State.DONE; | |||
type = HtmlTokenType.COMMENT; | |||
} else if ('!' == ch) { // --!> is also valid closing sequence | |||
state = State.COMMENT_DASH_DASH; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would seem to suggest that<!-- --!->
is a whole comment tag.
iiuc, after <!-- --!-
we should be in the comment_dash state.
Perhaps we need !
to transition to COMMENT_DASH_DASH_BANG
here which transitions as does case COMMENT
above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this look right?
flowchart TD
BANG -- "-" --> BANG_DASH;
BANG_DASH -- "-" --> COMMENT_DASH_AFTER_BANG;
BANG_DASH -- "else" --> DIRECTIVE;
COMMENT_DASH_AFTER_BANG -- "-" --> COMMENT_DASH_AFTER_BANG;
COMMENT_DASH_AFTER_BANG -- ">" --> DONE;
COMMENT_DASH_AFTER_BANG -- "else" --> COMMENT;
COMMENT -- "-" --> COMMENT_DASH;
COMMENT -- "else" --> COMMENT;
COMMENT_DASH -- "-" --> COMMENT_DASH_DASH;
COMMENT_DASH -- "else" --> COMMENT;
COMMENT_DASH_DASH -- ">" --> DONE;
COMMENT_DASH_DASH -- "-" --> COMMENT_DASH_DASH;
COMMENT_DASH_DASH -- "!" --> COMMENT_DASH_DASH_BANG;
COMMENT_DASH_DASH_BANG -- ">" --> DONE;
COMMENT_DASH_DASH_BANG -- "-" --> COMMENT_DASH;
COMMENT_DASH_DASH_BANG -- "else" --> COMMENT;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this seems correct. The missing part is the COMMENT_DASH_AFTER_BANG
case and the <!>
scenario too. But the new COMMENT_DASH_DASH_BANG
suggestion seems correct to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mikesamuel , this makes sense. I implemented your proposed changes and added new tests. All the use cases discussed so far pass the tests. Can you please validate / approve the MR? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jfbyers Changes look good to me, let see what mike's thoughts are :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@subbudvk @mikesamuel are you ok moving forward with this PR ? What are the next steps? Thank you.
Thanks @mikesamuel ! Appreciate your contribution to the open source community. |
About this, please forgive me in advance... I am not super used to referencing in RFCs and/or browsers standards. That being said, in https://html.spec.whatwg.org/multipage/syntax.html#comments they state a comment must start with Not sure if this means this HTML spec is wrong, or it is just not the one followed by browsers. Also the pattern Again, let me know if it is useful to provide a live PoC for this. |
Summary and intro
The HTML Lexer fails to detect empty HTML comment declarations leading into the next piece of HTML object in the input to not be detected as such.
Basically, for an input such as
<!><img src=1 onError=alert(1)>
, the lexer considers the whole blob as an HTML comment instead of an empty comment declaration (<!>
) and then animg
tag (which is what browsers would do).This bug in the lexer, provides wrong information to the
HtmlChangeListener
HTML RFC context
In the HTML RFC, comments are defined as:
This means that
<!>
is a valid comment declaration with zero comments inside. Which means the following HTML code would trigger thealert(1)
if rendered in a browser:In addition, the pattern
<!-->
is also transformed by browsers to<!-- -->
(not sure why yet, but tested in all major browsers). This means the following HTML code would also trigger thealert(1)
if rendered in a browser:The bug
The Lexer does not consider neither
<!>
nor<!-->
as valid comment declaration statements, considering the last character of both statements (>
) still as part of the comment.This means, when the sanitizer reads the following input
<!><img src=1 onError=alert(1)>
, the lexer will interpret the whole input as an HTML comment.However, the expected behavior would be to detect
<!>
as an HTML comment declaration, and<img src=1 onError=alert(1)>
as animg
HTML tag.This is easy to see with the
HtmlChangeListener
class. For example, the following code provides the following output:The output:
However, the expected correct output should be:
The fix
We added a new state called
COMMENT_DASH_AFTER_BANG
into theHtmlInputSplitter
insideHtmlLexer
to handle dash character after the bang one (!-
)Also we created special condition checks in the
BANG
andBANG_DASH
states inside the lexer's state machine to handle<!>
and<!-->
comments.Authors
Bug discovery: Carlos Villa (@carlosvillasanchez)
Code fix: Eduardo Aguado (@jfbyers)