Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Antisamy stripping @ and not encoding if it falls within <> #24

Open
ankitmdesai opened this issue Apr 25, 2018 · 9 comments
Open

Antisamy stripping @ and not encoding if it falls within <> #24

ankitmdesai opened this issue Apr 25, 2018 · 9 comments
Labels

Comments

@ankitmdesai
Copy link

ankitmdesai commented Apr 25, 2018

I am using antisamy 1.5.7.

I saw issue when input was

firstname,lastname<[email protected]> or firstname,lastname<[email protected] testing>

Result after Antisamy scan is same for both above cases

firstname,lastname<name>

I have below directive in policy file

<directive name="onUnknownTag" value="encode"/>

Is there a place in policy file I can update to encode @ when it is within <> ?

@davewichers
Copy link
Collaborator

davewichers commented Mar 28, 2019

I'm investigating this. Using a DOM parser, with these settings, I get only: "firstname,lastname" in the output of .getCleanHTML(). Using a SAX parser, I get: "firstname,lastname&lt;name&gt;&lt;/name&gt;". I'm not sure if this inconsistency in output is a different bug. I'll research that too.

@nahsra
Copy link
Owner

nahsra commented Mar 30, 2019

This use case is not possible with the current architecture. I see the value. I think it’s a candidate for 1.5.9.

@goshantmeher
Copy link

goshantmeher commented May 29, 2019

I have similar kind of issue when i try to scan with text: "hello <hii world", i get the output : "hello".
and scan is not giving any error in CleanResults for this. but with
when i scan text "hello <hi>" it will give output: "hello" with error "[The hi tag was empty, and therefore we could not process it. The rest of the message is intact, and its removal should not have any side effects.]"
is there any workaround to get output "hello <hii world" on scan "hello <hii world".

@log2akshat
Copy link

I have a similar issue, using the Antisamy library with version 1.5.8 and I tried writing the following unit test case:

Input:

<html><head><style>.uegzbq{font-size:22px;}@media not all and (pointer:coarse){.8bsfb:hover{background-color:#056b27;}}.scem3j{font-size:25px;}</style></head><body><div class="uegzbq">First Line</div><br><div class="scem3j">Second Line</div></body></html>

Expected on org.owasp.validator.html.CleanResults.getCleanHTML:

<html><head><style>/*<![CDATA[*/*.uegzbq {
	font-size: 22.0px;
}
@media not all and (pointer:coarse) {
       .8bsfb: hover {
                background-color: #056b27;
        }
}
.scem3j{
       font-size:25px;
}
/*]]>*/</style></head><body><div class="uegzbq">First Line</div><br /><div class="scem3j">Second Line</div></body></html>

Actual:

<html><head><style>/*<![CDATA[*/*.uegzbq {
	font-size: 22.0px;
}
/*]]>*/</style></head><body><div class="uegzbq">First Line</div><br /><div class="scem3j">Second Line</div></body></html>

I can see that in the org.owasp.validator.html.scan.AntiSamyDOMScanner class, I was having the expected string prior to serialization and after the org.apache.xml.serialize.HTMLSerializer has done the serialization to the DocumentFragment Whatever it was after @ symbol got stripped off in the style tag.
I tried upgrading to 1.6.3 but still having the same issue.

@nikowitt
Copy link

nikowitt commented Dec 2, 2021

Also stumbled across this one - are there any plans to work on this in the near future?

@davewichers
Copy link
Collaborator

@spassarop - Is this even possible/reasonable? Or way too hard? I suspect 'too hard'.

@spassarop
Copy link
Collaborator

Some points to cover and clarify to understand the current behavior:

  • AntiSamy parses the input with cyberneko as HTML, so it expects valid HTML to build the internal representation which gets validated. Raw inputs like hello <hii world and firstname,lastname<[email protected]> (provided by @goshantmeher and @nikowitt respectively) are not valid HTML, so the library returns a best effort parse result auto-closing tags and so. Raw text that goes through an HTML parser but is not intended to be HTML should be HTML-encoded first, at least that specific fragments. It is just the nature of the parser, as with any other kind of parser.
  • Using <directive name="onUnknownTag" value="encode"/> does not work in this particular case (the one with [email protected]) because in the validation logic, a prior step is to check if the tag is empty AND if is in the allowed empty tags list. As the tag is not defined, it won't be in the policy, it won't be allowed as empty and therefore removed before even getting to the encoding step. Just to test I added "name" as an allowed empty tag and re-tested, resulting in firstname,lastname&lt;name/&gt;. This takes us to the following point.
  • The cyberneko parser naturally does not take the whole [email protected] as a valid tag name, it internally truncates it when encountering @. Can't do anything about it as far as I know, if the whole string should be interpreted as text, then it should also be HTML-encoded as I said before.
  • The CSS style tag case is a whole different scenario, where the problem is not HTML parsing, but CSS parsing. In the example provided by @log2akshat, the @media rule is not valid for Batik-CSS library (it expects the opening { right after the not word), so it decides to fail and stop parsing the whole CSS block, the HTML-embedded style sheet. Why is that? You may say Batik-CSS is not smart enough (already pointed this out in Embed style sheets after opening embedStyleSheets should not be deleted all. #108) and it a valid point.

In conclusion, there is not much to do because of the underlying libraries that parse HTML and CSS. They just expect text in their respective formats to evaluate and parse them in their target spec, parsing invalid HTML will result in a weird result for sure. If you fear someone may put HTML in a place where it shouldn't the solution is to HTML-encode, not to filter. Maybe the whole string, maybe the fragments that must be HTML-free but get inserted in an HTML template, that depends entirely on the usage context.

The most I can offer here is the allowed empty tag stuff, a logic change to remove them only if they are known tags but not present in that policy fragment definition. I hope all this explanation make the issues clear.

@davewichers
Copy link
Collaborator

@spassarop - You did a lot of analysis on this one. Are there any changes you are comfortable with making that would improve anything?

@spassarop
Copy link
Collaborator

The most I can offer here is the allowed empty tag stuff, a logic change to remove them only if they are known tags but not present in that policy fragment definition.

Maybe this, so tags like <name> which are unknown get a chance to be detected as such and honor the “onUnknownTag” behavior, independently whether they’re empty tags or not. An unknown and empty tag does not get to that point today.

I’m not sure if it does improve something but at least will be consistent with the expected behavior of “onUnknownTag” on policy definition. It can be done for the next version.

The other problems cannot be solved through AntiSamy code. In my opinion they’re due to wrong usage and underlying libraries limitations, as I stated in the previous analysis.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants