Skip to content
This repository has been archived by the owner on Jul 15, 2019. It is now read-only.

Fix tag balancing #23

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open

Fix tag balancing #23

wants to merge 4 commits into from

Conversation

adon-at-work
Copy link
Contributor

Changes

  • fix: optional tags are ignored during balancing, i.e., well-balanced input will not be broken by attempting to close optional tags.
  • tag balancing logics simplified (accept closing tag as long as it exists in the stack)
  • ignore self-closing solidus, which is required only for foreign elements
  • added a cap to limit the max levels of unclosed/nested tags

before the fix
default x 0.64 ops/sec ±5.90% (6 runs sampled)
disabled tag balancing x 0.60 ops/sec ±3.82% (6 runs sampled)
Fastest is [ 'default' ]
Speed/Time is 0.6423016849725829MB/s 1.5567777936666667s

after the fix
default x 0.72 ops/sec ±5.01% (6 runs sampled)
disabled tag balancing x 0.60 ops/sec ±3.91% (6 runs sampled)
Fastest is [ 'default' ]
Speed/Time is 0.7190234217142741MB/s 1.3906654078333334s

it actually runs faster than without it using the given 1m file. so tag balancing doesn't seem to incur any significant performance hit.

@adon-at-work
Copy link
Contributor Author

@maditya, please check. thanks :)

@maditya
Copy link
Contributor

maditya commented Aug 17, 2015

can we track updates to benchmark suite in a separate PR ?

@adon-at-work
Copy link
Contributor Author

ok. rebased to separate it to PR #25

@adon-at-work
Copy link
Contributor Author

@maditya , as discussed over phone, i optimized the code a bit, and imposed a cap on stack size.

if (openedTag) {
this.output += '</' + openedTag + '>';
if (tagBalance.enabled && !optionalElements[tagName]) {
// relaxed tag balancing, accept it as long as the tag exists in the stack
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be a heuristic approach. We should have specific examples, probably in a separate document, on what cases we are handling and what cases we are not.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A document was prepared and shared internally on Aug 13. In case you missed it, please search "Untrusted input should be self-contained by tag balancing" in your mailbox.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the README to explain tag balancing.

@maditya

@@ -37,28 +48,29 @@ See the accompanying LICENSE file for terms.
enableCanonicalization: config.enableCanonicalization,
enableVoidingIEConditionalComments: config.enableVoidingIEConditionalComments
}).on('postWalk', function (lastState, state, i, endsWithEOF) {
processTransition.call(that, lastState, state, i);
!tagBalance.stackOverflow && processTransition.call(that, lastState, state, i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another approach on handling stackOverflow is to :
a. clear the stack
b. disregard the check "if tag exists in the stack".
(in other words, behave as if tag balancing was disabled)
That would prevent truncating valid/balanced input just because it has nesting > stack size(as currently seen in the test case at https://github.com/yahoo/html-purify/pull/23/files#diff-cfdb6a0bb85a9c9dcb6708d2225134cfR42). Also, our assertion that "we won't do tag balacing when nesting > stack size" still holds true. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really want to let an untrusted input to disable tag balancing?

an attacker can first pad <b> n+1 times, where n equals the stack size limit, optionally close all of them, and then start crafting "dangerous" tags. Examples:

  • <b>...<b><textarea>
  • <b>...<b><div style="display:none;visibility:hidden;font-size:0;color:#000"> (accepting/allowing either of those CSS will result in a consequence literally like <!--. It's common to allow an untrusted email to specify font size/color)
  • <div style="display:none;visibility:hidden;font-size:0;color:#000"> n+1 times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me come up with a simple use case, as for email threads/conversations:

<div>
To: inHTMLData(email_recipient1)<br>
From: inHTMLData(email_sender1)<br>
Subject: inHTMLData(email_subj1)<br>
purify(email_content1)
<br><button>Reply</button><button>Forward</button><button>Delete</button>
</div>

<div>
To: inHTMLData(email_recipient2)<br>
From: inHTMLData(email_sender2)<br>
Subject: inHTMLData(email_subj2)<br>
purify(email_content2)
<br><button>Reply</button><button>Forward</button><button>Delete</button>
</div>

<div>
To: inHTMLData(email_recipient3)<br>
From: inHTMLData(email_sender3)<br>
Subject: inHTMLData(email_subj3)<br>
purify(email_content3)
<br><button>Reply</button><button>Forward</button><button>Delete</button>
</div>

I hope this use case can explain why we do tag balancing. No one would accept:
(1) the purify(email_content1) can hide the buttons like Reply, Forward, and Delete from the trusted Mail app
(2) even worst, purify(email_content1) can collude with purify(email_content3) to hide email 2, when email_content1 is </div><div style="display:none;..."><div> and email_content3 is </div></div><div>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts? this pretty much explained the need of self-containing the untrusted data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make sure the output is safe, or we just return empty output with an error.

@yukinying
Copy link
Contributor

@maditya, @adon-at-work

Please see http://jsperf.com/array-push-int-and-string for performance with various array push technique. Since we are having a whitelist of tags, I think we would prefer using integer (enum based) assignment to the tag list stack (array).

There may be other optimization techniques, please feel free to edit the jsperf test for such.

@adon-at-work
Copy link
Contributor Author

@yukinying
I see you're suggesting to deal with integers instead of tagName (string), and thanks for the detailed performance tests. :)

Right now, the tagName whitelist is like ['a', 'div', ...], and I'm aware of making use of the array indices as the integers we want.

@yukinying @maditya
Looking farther, it boils down to whether we'll support tag-specific attribute whitelist, e.g., href is whitelisted for <a> but not <div>, as in the current Mail or YIV use case. This will for sure affect the list structure. As a reference, secure-handlebars attempted to support similar tag-specific attribute matching by: https://github.com/yahoo/secure-handlebars/blob/master/src/parser-utils.js#L68-L70. Here, we may need a very different structure.

Hence, should we now keep the current implementation as is before settling the use case regarding tag-specific attribute whitelist and its required the list structure? And perhaps, (if the list structure really requires a change,) let's do it in separate PRs.

@yukinying
Copy link
Contributor

The suggestion there is to avoid using Array.prototype.push as it is not optimized for the use case. Instead, keeping a counter of the stack size and use it to mimic the push and pop operation will give you a huge performance boost.

A further boost can be possible if we use enum based approach for identifying the tag name, as the space required for building the stack will be much smaller.

Please take the suggestion on avoiding use of array push as a basic requirement for the tag balancing stack. Once that's done, we should be able to relax the requirement of stack size limitation (e.g. grow it to 100,000) to some value that we would consider the input as totally malicious / invalid, and then just return error for the case.

@adon-at-work
Copy link
Contributor Author

I appreciate your suggestions, and indeed, thank you for a midnight comment. 👍

But there has no more .push since 24 days ago. Such an optimization is completed in fa24fa2#diff-9c568acf92dc1a69fdd058c41195c719R121. You may like to confirm by searching for .push after clicking "Files Changed".

I personally think legitimate HTML won't nest more than a maximum of 100 levels. 100,000 sounds way too relax perhaps. Will you consider/accept 1,000 or 10,000?

Regarding "enum based approach for identifying the tag name", I also see its potential as you do. But could we not to do that in this PR before we finalize the enum/whitelist structure?

@yukinying
Copy link
Contributor

I would rather use some measurable metrics to determine the level. Could you provide a simple benchmark with nested tags with different levels, e.g. 100, 1000, 10000?

We will also need a doc (or MD) file that describes the actual tag balancing algorithm. The doc you sent me previously has too many open questions there. My preference is to have a document that clearly explains the algorithm, and we use the algorithm spec to validate if any future pull requests or bug fixes from any contributors would align with the design principles and security consideration of such.

It'd actually be much nicer if the commit fa24fa2 has come along with a description of the changes.

@yukinying
Copy link
Contributor

Looking farther, it boils down to whether we'll support tag-specific attribute whitelist, e.g., href is whitelisted for but not

, as in the current Mail or YIV use case. This will for sure affect the list structure. As a reference, secure-handlebars attempted to support similar tag-specific attribute matching by: https://github.com/yahoo/secure-handlebars/blob/master/src/parser-utils.js#L68-L70. Here, we may need a very different structure.

Hence, should we now keep the current implementation as is before settling the use case regarding tag-specific attribute whitelist and its required the list structure? And perhaps, (if the list structure really requires a change,) let's do it in separate PRs.

Agree that the matrix of allowable tags and attributes combination should come as a separate PR.

@adon-at-work
Copy link
Contributor Author

I would rather use some measurable metrics to determine the level. Could you provide a simple benchmark with nested tags with different levels, e.g. 100, 1000, 10000?

I'm not sure if I misunderstood.

I perceive that the cap is more related to what the norm is, for which no legit inputs will get truncated because of exceeding the cap. As long as the cap is not hit, 100 and 100,000 won't make a difference in terms of performance (stackPtr < 100 vs. stackPtr < 100000).

In fact, I'm confused about the relationship between the cap and anything related to benchmark, including the worst-case performance you sort of mentioned a while ago. Assume we've a 1m file. Say, we encountered 300KB (e.g., <b> x 100) of tags that are so nested to hit the cap, then the remaining 9700KB of data will be ignored. One can obviously tell the performance is even better once hitting the cap, as no further inputs will be consumed.

Determining that big enough number may sound magical, but more accurately it's very use-case specific. The best effort is to set a sufficiently big number. 100, 1000, 10000 all look reasonable to me. The best approach is to have it configurable as already did, and let users judge based on their needs. May be Mail never want to truncate people's emails to avoid complaints, then the number could be MAX of what a Number can hold.

@adon-at-work
Copy link
Contributor Author

We will also need a doc (or MD) file that describes the actual tag balancing algorithm. The doc you sent me previously has too many open questions there. My preference is to have a document that clearly explains the algorithm, and we use the algorithm spec to validate if any future pull requests or bug fixes from any contributors would align with the design principles and security consideration of such.

MD (or even comments inside the code) is good, since it can be included in this open-sourced project. The description on tag balancing in already included in the README.md. I'd very much appreciate if you could directly edit/improve/polish it over there. Thanks.

@yukinying
Copy link
Contributor

The description on tag balancing in already included in the README.md. I'd very much appreciate if you could directly edit/improve/polish it over there. Thanks.

We need more than a paragraph, and we need the algorithm rather than an overview. Without such, this pull request would not be accepted.

@yukinying
Copy link
Contributor

MD (or even comments inside the code) is good

Also, we would not accept source code comment as a proper documentation unless they are gathered in one big comment block. The reason is that it is very easy to overlook any potential new pull requests that may wipe a single line comment or moved certain code logic and mess up all the remaining comments. The requirement may not be necessary for other libraries, yet we try to set a higher bar to make this security library be more easier manageable.

adon added 2 commits October 5, 2015 15:59
- logics simplified
- optional tags are ignored during balancing
- ignore self-closing solidus, which is required only for foreign
elements
@adon-at-work adon-at-work force-pushed the fix-tag-balancing branch 4 times, most recently from d1ce433 to 12d5fd2 Compare October 5, 2015 09:32
adon added 2 commits October 5, 2015 17:35
- imposed a max stack size, i.e., nested-levels
- stack pointer is used instead of push()
- also improved the arrayLastIndexOf to use the native
Array.prototype.lastIndexOf if exixts
@adon-at-work
Copy link
Contributor Author

please find the changes rebased to the latest master. comments added too. @maditya @yukinying

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants