-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Antisamy removes "margin" attribute when it's value is configured very small decimal number #101
Comments
@spassarop - Can you look into this one too? |
You won't believe it. When you exceed the three significant figures, the value in CSS as a lexical unit is interpreted with an exponential value. What I mean is that This lexical unit stuff is how Batik CSS handles CSS parsing and its way beyond AntiSamy limits. What AntiSamy does is to concatenate the lexical unit properties from Batik CSS API, this is the case for lu.getFloatValue() + lu.getDimensionUnitText() This is The other problem is that even if the regex is changed anticipating that shady/hidden/implicit conversion and the margin keeps its value, it will have that new verified string, so instead of the original So, there lies the reason and extended background of the issue. @davewichers what do you think about all that? As a side note: this also happens on .NET, just can't fight with the underlying parser :'( |
@spassarop - Whatever you want to do here is fine with me. I trust you :-) |
In that case, the only thing I can offer is to label this issue as bug and update the example policies so |
Sounds fine with me. Do you want to implement/push those changes? |
I'm on it. Doing the test cases I found out that there are also problems on Batik CSS with exponential dimensions when they are the input. When using the margin |
Thanks @davewichers for sharing RCA of this issue. As per your latest comment, are you going to provide this by default in the next release? If so, could you please share probable timeline for the same and based on that we would be asking our customer to change the policy to allow exponentials |
@davewichers / @spassarop I did close this issue by mistake so reopening it. |
@bijarniy yes this will be on the next release if it gets merged before that. |
- Update length regexes on example policies. - The tests have comments explaining the bug (#101) and a single case that passes but involves a latent bug.
@davewichers even though a partial solution was provided including exponentials on regex there is no actual solution as we depend on Batick CSS. Should we leave this open potentially forever? |
@davewichers and @spassarop - This really is sounding more like a bug in Batik CSS or possibly a misalignment in the understanding with your users than it is a bug per se in AntiSamy. I would recommend closing it with a 'wontfix' label. (You do have one of those, don't you? I haven't checked.) |
I verified some CSS attributes (width and length) and found that there was a problem of Do we need to push the Batik CSS community to address this issue? |
@spassarop - Would you be willing to open up a discussion with Batik to see what they say? |
@star-R I think the only logical approach at this point is to ask that Batik CSS community to fix it. Write up an example with a test case using Batik CSS directly and show them expected outcome to illustrate exactly what you are asking for / expecting. Because I think the AntiSamy team has already said they are not going to replace Batik CSS or try to write their own CSS parser. Just my $.02, but that was my interpretation based on earlier comments from the AntiSamy contributors. Batik CSS is still being actively supported so if you can show them a place that it not in compliance with W3C, I think you'll have a fair shot of convincing them. But since you are the one who wants this (I doubt anyone else uses such extreme margins), I think you should be the one to file an issue with Batik CSS. |
I agree with Kevin on the wontfix and on the W3C compliance argument. Just to avoid misunderstandings, @LiuXing-R was just helping investigate, @bijarniy was the one who created the issue. I found the JIRA site where they add issues and signed up but there are so many fields to fill that I'll probably be kicked in the head by one of the team if I do it wrong. It would be better to notify somewhere to discuss and make them follow the right process. Apparently it must be done through the mailing list, I subscribed [email protected] to later see how you're supposed to report a bug. Of course anyone else can do it too, in my case if I end doing it I'll comment here. It will take time to elaborate a convincing and complete bug report. |
@spassarop - Any dialog going on with the Batik team? If not, can you push/nudge again to see if you can get them to work on this? |
Damn, I forgot to do this. I started to investigate how to report but somehow I didn't do it. Guess I'll do it in the following days, I'm setting a reminder. |
Created an issue just now: https://issues.apache.org/jira/browse/BATIK-1320 |
I found this issue while browsing Batik's JIRA, although I'm not related to the Batik project (nor the ASF) in any way. I do not think that there is anything that the Batik people (nor any other SAC implementor) could do. The SAC API provides the lu.getFloatValue() + lu.getDimensionUnitText() That is, the The SAC API does not provide any way to determine the exact input text that was used when parsing. Perhaps Antisamy could provide a different kind of check so numbers are checked as numbers and not as strings. Checking numbers with string methods is always going to be error prone. |
Incidentally, I see that you are checking against old CSS 2.1 units. <regexp name="length" value="((-|\+)?0|(-|\+)?([0-9]+(\.[0-9]*)?)(em|ex|px|in|cm|mm|pt|pc))"/> But today's websites use a wider range of values and units, see CSS Values and Units Module Level 4. So if you keep with the regexp approach (for example serializing the |
@carlosame: About the units, the regex was expanded in #104 based on this page. I don't know why some of the units were left out really. We could expand the list even more with absolute and relative units. Regarding the float stuff the matter is complex. AntiSamy does "generic" validations by comparing to values defined in a policy XML. It can be a strict comparison if it's by value or a regexp-based comparison. I agree it's best to use abstract representations whenever it's possible instead of just using strings in general. That's why we use parsers and such utilities and work with objects. However, relying on parsers for this to be done eventually ends up in having a final granular value that we cannot go "deeper" with. In general, for HTML and CSS we use strings as normally the values are just text or number+text. We also stop going deeper when evaluating attributes values or tags' text contents. With CSS we stop on those lexical units for CSS property values. A property value can be really complex as you know, with numbers, units, function calls, etc. That's why we also need the parser to be specific with what it provides so we can validate the values somehow. I'd love we had something more versatile than just I don't really know if an improvement from Batik's side can be made, that's what I want them to tell us. I'm not a Java guy so that's why I seek specific knowledge for this. On AntiSamy .NET I had the same issue and reached the parser creator who fixed it, so now it has the behavior I needed on most cases (there was a details with exponents which I cannot remember now). Don't have much to say. |
I haven't seen that updated regex, I'd just add
The problem are the following lines:
and
If you serialize those You do not need to change anything at Batik, although I'd migrate to a more modern SAC parser like my Css4j 1.3.1 (in post-EOL state) or cssparser (less complete but still maintained, and the maintainer is generally very helpful).
The Batik project is not very active currently so you could have to wait for a long time, and the fix that I explained is easy to apply. |
Related to #293 |
Steps to reproduce the problem-
<p style="margin: 0.0001pt;" />
.filterHTML
API to filter the above HTML contentExpected output-
With regexp configuration [2], should not remove the margin with any decimal number
[1] AntiSamy warning: The p tag had a style attribute, "margin", that could not be allowed for security reasons.
[2]
<regexp name="length" value="((-|\+)?0|(-|\+)?([0-9]+(\.[0-9]*)?)(em|ex|px|in|cm|mm|pt|pc))"/>
The text was updated successfully, but these errors were encountered: