Antisamy stripping @ and not encoding if it falls within <> #24

ankitmdesai · 2018-04-25T20:24:57Z

I am using antisamy 1.5.7.

I saw issue when input was

firstname,lastname<[email protected]> or firstname,lastname<[email protected] testing>

Result after Antisamy scan is same for both above cases

firstname,lastname<name>

I have below directive in policy file

<directive name="onUnknownTag" value="encode"/>

Is there a place in policy file I can update to encode @ when it is within <> ?

The text was updated successfully, but these errors were encountered:

davewichers · 2019-03-28T19:26:14Z

I'm investigating this. Using a DOM parser, with these settings, I get only: "firstname,lastname" in the output of .getCleanHTML(). Using a SAX parser, I get: "firstname,lastname<name></name>". I'm not sure if this inconsistency in output is a different bug. I'll research that too.

nahsra · 2019-03-30T01:31:24Z

This use case is not possible with the current architecture. I see the value. I think it’s a candidate for 1.5.9.

goshantmeher · 2019-05-29T10:23:03Z

I have similar kind of issue when i try to scan with text: "hello <hii world", i get the output : "hello".
and scan is not giving any error in CleanResults for this. but with
when i scan text "hello <hi>" it will give output: "hello" with error "[The hi tag was empty, and therefore we could not process it. The rest of the message is intact, and its removal should not have any side effects.]"
is there any workaround to get output "hello <hii world" on scan "hello <hii world".

log2akshat · 2021-05-27T08:17:23Z

I have a similar issue, using the Antisamy library with version 1.5.8 and I tried writing the following unit test case:

Input:

<html><head><style>.uegzbq{font-size:22px;}@media not all and (pointer:coarse){.8bsfb:hover{background-color:#056b27;}}.scem3j{font-size:25px;}</style></head><body><div class="uegzbq">First Line</div><br><div class="scem3j">Second Line</div></body></html>

Expected on org.owasp.validator.html.CleanResults.getCleanHTML:

<html><head><style>/*<![CDATA[*/*.uegzbq {
	font-size: 22.0px;
}
@media not all and (pointer:coarse) {
       .8bsfb: hover {
                background-color: #056b27;
        }
}
.scem3j{
       font-size:25px;
}
/*]]>*/</style></head><body><div class="uegzbq">First Line</div><br /><div class="scem3j">Second Line</div></body></html>

Actual:

<html><head><style>/*<![CDATA[*/*.uegzbq {
	font-size: 22.0px;
}
/*]]>*/</style></head><body><div class="uegzbq">First Line</div><br /><div class="scem3j">Second Line</div></body></html>

I can see that in the org.owasp.validator.html.scan.AntiSamyDOMScanner class, I was having the expected string prior to serialization and after the org.apache.xml.serialize.HTMLSerializer has done the serialization to the DocumentFragment Whatever it was after @ symbol got stripped off in the style tag.
I tried upgrading to 1.6.3 but still having the same issue.

nikowitt · 2021-12-02T07:00:07Z

Also stumbled across this one - are there any plans to work on this in the near future?

davewichers · 2021-12-02T20:05:54Z

@spassarop - Is this even possible/reasonable? Or way too hard? I suspect 'too hard'.

spassarop · 2021-12-04T21:07:44Z

Some points to cover and clarify to understand the current behavior:

AntiSamy parses the input with cyberneko as HTML, so it expects valid HTML to build the internal representation which gets validated. Raw inputs like hello <hii world and firstname,lastname<[email protected]> (provided by @goshantmeher and @nikowitt respectively) are not valid HTML, so the library returns a best effort parse result auto-closing tags and so. Raw text that goes through an HTML parser but is not intended to be HTML should be HTML-encoded first, at least that specific fragments. It is just the nature of the parser, as with any other kind of parser.
Using <directive name="onUnknownTag" value="encode"/> does not work in this particular case (the one with [email protected]) because in the validation logic, a prior step is to check if the tag is empty AND if is in the allowed empty tags list. As the tag is not defined, it won't be in the policy, it won't be allowed as empty and therefore removed before even getting to the encoding step. Just to test I added "name" as an allowed empty tag and re-tested, resulting in firstname,lastname<name/>. This takes us to the following point.
The cyberneko parser naturally does not take the whole [email protected] as a valid tag name, it internally truncates it when encountering @. Can't do anything about it as far as I know, if the whole string should be interpreted as text, then it should also be HTML-encoded as I said before.
The CSS style tag case is a whole different scenario, where the problem is not HTML parsing, but CSS parsing. In the example provided by @log2akshat, the @media rule is not valid for Batik-CSS library (it expects the opening { right after the not word), so it decides to fail and stop parsing the whole CSS block, the HTML-embedded style sheet. Why is that? You may say Batik-CSS is not smart enough (already pointed this out in Embed style sheets after opening embedStyleSheets should not be deleted all. #108) and it a valid point.

In conclusion, there is not much to do because of the underlying libraries that parse HTML and CSS. They just expect text in their respective formats to evaluate and parse them in their target spec, parsing invalid HTML will result in a weird result for sure. If you fear someone may put HTML in a place where it shouldn't the solution is to HTML-encode, not to filter. Maybe the whole string, maybe the fragments that must be HTML-free but get inserted in an HTML template, that depends entirely on the usage context.

The most I can offer here is the allowed empty tag stuff, a logic change to remove them only if they are known tags but not present in that policy fragment definition. I hope all this explanation make the issues clear.

davewichers · 2022-01-08T16:56:59Z

@spassarop - You did a lot of analysis on this one. Are there any changes you are comfortable with making that would improve anything?

spassarop · 2022-01-08T19:39:40Z

The most I can offer here is the allowed empty tag stuff, a logic change to remove them only if they are known tags but not present in that policy fragment definition.

Maybe this, so tags like <name> which are unknown get a chance to be detected as such and honor the “onUnknownTag” behavior, independently whether they’re empty tags or not. An unknown and empty tag does not get to that point today.

I’m not sure if it does improve something but at least will be consistent with the expected behavior of “onUnknownTag” on policy definition. It can be done for the next version.

The other problems cannot be solved through AntiSamy code. In my opinion they’re due to wrong usage and underlying libraries limitations, as I stated in the previous analysis.

davewichers added the bug label Mar 26, 2019

davewichers mentioned this issue Jan 13, 2021

stripNonValidXMLCharacters doesn't work with HTML where html.length() > 1 #34

Open

log2akshat mentioned this issue May 28, 2021

ZCS-10594/TSS-18404: Fix for emails not displaying correctly. Zimbra/antisamy#2

Merged

silentsakky mentioned this issue Nov 23, 2024

ZCS-16214 Zimbra/antisamy#6

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Antisamy stripping @ and not encoding if it falls within <> #24

Antisamy stripping @ and not encoding if it falls within <> #24

ankitmdesai commented Apr 25, 2018 •

edited

Loading

davewichers commented Mar 28, 2019 •

edited

Loading

nahsra commented Mar 30, 2019

goshantmeher commented May 29, 2019 •

edited

Loading

log2akshat commented May 27, 2021

nikowitt commented Dec 2, 2021

davewichers commented Dec 2, 2021

spassarop commented Dec 4, 2021

davewichers commented Jan 8, 2022

spassarop commented Jan 8, 2022

Antisamy stripping @ and not encoding if it falls within <> #24

Antisamy stripping @ and not encoding if it falls within <> #24

Comments

ankitmdesai commented Apr 25, 2018 • edited Loading

davewichers commented Mar 28, 2019 • edited Loading

nahsra commented Mar 30, 2019

goshantmeher commented May 29, 2019 • edited Loading

log2akshat commented May 27, 2021

nikowitt commented Dec 2, 2021

davewichers commented Dec 2, 2021

spassarop commented Dec 4, 2021

davewichers commented Jan 8, 2022

spassarop commented Jan 8, 2022

ankitmdesai commented Apr 25, 2018 •

edited

Loading

davewichers commented Mar 28, 2019 •

edited

Loading

goshantmeher commented May 29, 2019 •

edited

Loading