-
Notifications
You must be signed in to change notification settings - Fork 6
Fix tag balancing #23
base: master
Are you sure you want to change the base?
Conversation
@maditya, please check. thanks :) |
65129b3
to
9925fd4
Compare
can we track updates to benchmark suite in a separate PR ? |
ok. rebased to separate it to PR #25 |
9925fd4
to
8243352
Compare
@maditya , as discussed over phone, i optimized the code a bit, and imposed a cap on stack size. |
if (openedTag) { | ||
this.output += '</' + openedTag + '>'; | ||
if (tagBalance.enabled && !optionalElements[tagName]) { | ||
// relaxed tag balancing, accept it as long as the tag exists in the stack |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be a heuristic approach. We should have specific examples, probably in a separate document, on what cases we are handling and what cases we are not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A document was prepared and shared internally on Aug 13. In case you missed it, please search "Untrusted input should be self-contained by tag balancing" in your mailbox.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the README to explain tag balancing.
4f3f134
to
bcd6047
Compare
@@ -37,28 +48,29 @@ See the accompanying LICENSE file for terms. | |||
enableCanonicalization: config.enableCanonicalization, | |||
enableVoidingIEConditionalComments: config.enableVoidingIEConditionalComments | |||
}).on('postWalk', function (lastState, state, i, endsWithEOF) { | |||
processTransition.call(that, lastState, state, i); | |||
!tagBalance.stackOverflow && processTransition.call(that, lastState, state, i); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another approach on handling stackOverflow is to :
a. clear the stack
b. disregard the check "if tag exists in the stack".
(in other words, behave as if tag balancing was disabled)
That would prevent truncating valid/balanced input just because it has nesting > stack size(as currently seen in the test case at https://github.com/yahoo/html-purify/pull/23/files#diff-cfdb6a0bb85a9c9dcb6708d2225134cfR42). Also, our assertion that "we won't do tag balacing when nesting > stack size" still holds true. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
really want to let an untrusted input to disable tag balancing?
an attacker can first pad <b>
n+1 times, where n equals the stack size limit, optionally close all of them, and then start crafting "dangerous" tags. Examples:
<b>...<b><textarea>
<b>...<b><div style="display:none;visibility:hidden;font-size:0;color:#000">
(accepting/allowing either of those CSS will result in a consequence literally like<!--
. It's common to allow an untrusted email to specify font size/color)<div style="display:none;visibility:hidden;font-size:0;color:#000">
n+1 times
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me come up with a simple use case, as for email threads/conversations:
<div>
To: inHTMLData(email_recipient1)<br>
From: inHTMLData(email_sender1)<br>
Subject: inHTMLData(email_subj1)<br>
purify(email_content1)
<br><button>Reply</button><button>Forward</button><button>Delete</button>
</div>
<div>
To: inHTMLData(email_recipient2)<br>
From: inHTMLData(email_sender2)<br>
Subject: inHTMLData(email_subj2)<br>
purify(email_content2)
<br><button>Reply</button><button>Forward</button><button>Delete</button>
</div>
<div>
To: inHTMLData(email_recipient3)<br>
From: inHTMLData(email_sender3)<br>
Subject: inHTMLData(email_subj3)<br>
purify(email_content3)
<br><button>Reply</button><button>Forward</button><button>Delete</button>
</div>
I hope this use case can explain why we do tag balancing. No one would accept:
(1) the purify(email_content1)
can hide the buttons like Reply, Forward, and Delete from the trusted Mail app
(2) even worst, purify(email_content1)
can collude with purify(email_content3)
to hide email 2, when email_content1
is </div><div style="display:none;..."><div>
and email_content3
is </div></div><div>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts? this pretty much explained the need of self-containing the untrusted data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should make sure the output is safe, or we just return empty output with an error.
Please see http://jsperf.com/array-push-int-and-string for performance with various array push technique. Since we are having a whitelist of tags, I think we would prefer using integer (enum based) assignment to the tag list stack (array). There may be other optimization techniques, please feel free to edit the jsperf test for such. |
@yukinying Right now, the tagName whitelist is like @yukinying @maditya Hence, should we now keep the current implementation as is before settling the use case regarding tag-specific attribute whitelist and its required the list structure? And perhaps, (if the list structure really requires a change,) let's do it in separate PRs. |
The suggestion there is to avoid using Array.prototype.push as it is not optimized for the use case. Instead, keeping a counter of the stack size and use it to mimic the push and pop operation will give you a huge performance boost. A further boost can be possible if we use enum based approach for identifying the tag name, as the space required for building the stack will be much smaller. Please take the suggestion on avoiding use of array push as a basic requirement for the tag balancing stack. Once that's done, we should be able to relax the requirement of stack size limitation (e.g. grow it to 100,000) to some value that we would consider the input as totally malicious / invalid, and then just return error for the case. |
I appreciate your suggestions, and indeed, thank you for a midnight comment. 👍 But there has no more I personally think legitimate HTML won't nest more than a maximum of 100 levels. 100,000 sounds way too relax perhaps. Will you consider/accept 1,000 or 10,000? Regarding "enum based approach for identifying the tag name", I also see its potential as you do. But could we not to do that in this PR before we finalize the enum/whitelist structure? |
I would rather use some measurable metrics to determine the level. Could you provide a simple benchmark with nested tags with different levels, e.g. 100, 1000, 10000? We will also need a doc (or MD) file that describes the actual tag balancing algorithm. The doc you sent me previously has too many open questions there. My preference is to have a document that clearly explains the algorithm, and we use the algorithm spec to validate if any future pull requests or bug fixes from any contributors would align with the design principles and security consideration of such. It'd actually be much nicer if the commit fa24fa2 has come along with a description of the changes. |
Agree that the matrix of allowable tags and attributes combination should come as a separate PR. |
I'm not sure if I misunderstood. I perceive that the cap is more related to what the norm is, for which no legit inputs will get truncated because of exceeding the cap. As long as the cap is not hit, 100 and 100,000 won't make a difference in terms of performance (stackPtr < 100 vs. stackPtr < 100000). In fact, I'm confused about the relationship between the cap and anything related to benchmark, including the worst-case performance you sort of mentioned a while ago. Assume we've a 1m file. Say, we encountered 300KB (e.g., Determining that big enough number may sound magical, but more accurately it's very use-case specific. The best effort is to set a sufficiently big number. 100, 1000, 10000 all look reasonable to me. The best approach is to have it configurable as already did, and let users judge based on their needs. May be Mail never want to truncate people's emails to avoid complaints, then the number could be MAX of what a |
MD (or even comments inside the code) is good, since it can be included in this open-sourced project. The description on tag balancing in already included in the README.md. I'd very much appreciate if you could directly edit/improve/polish it over there. Thanks. |
We need more than a paragraph, and we need the algorithm rather than an overview. Without such, this pull request would not be accepted. |
Also, we would not accept source code comment as a proper documentation unless they are gathered in one big comment block. The reason is that it is very easy to overlook any potential new pull requests that may wipe a single line comment or moved certain code logic and mess up all the remaining comments. The requirement may not be necessary for other libraries, yet we try to set a higher bar to make this security library be more easier manageable. |
- logics simplified - optional tags are ignored during balancing - ignore self-closing solidus, which is required only for foreign elements
d1ce433
to
12d5fd2
Compare
- imposed a max stack size, i.e., nested-levels - stack pointer is used instead of push()
- also improved the arrayLastIndexOf to use the native Array.prototype.lastIndexOf if exixts
12d5fd2
to
3701cb7
Compare
please find the changes rebased to the latest master. comments added too. @maditya @yukinying |
Changes
before the fix
default x 0.64 ops/sec ±5.90% (6 runs sampled)
disabled tag balancing x 0.60 ops/sec ±3.82% (6 runs sampled)
Fastest is [ 'default' ]
Speed/Time is 0.6423016849725829MB/s 1.5567777936666667s
after the fix
default x 0.72 ops/sec ±5.01% (6 runs sampled)
disabled tag balancing x 0.60 ops/sec ±3.91% (6 runs sampled)
Fastest is [ 'default' ]
Speed/Time is 0.7190234217142741MB/s 1.3906654078333334s
it actually runs faster than without it using the given 1m file. so tag balancing doesn't seem to incur any significant performance hit.