Skip to content
This repository has been archived by the owner on Jul 15, 2019. It is now read-only.

Fix tag balancing #23

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 30 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,36 @@ var input = '...';
var result = purifier.purify(input);
```

## Advanced Usage

The following outlines the configuration that is secure by default. You should perform due dilligence to confirm your use cases are safe before disabling or altering the configurations.

```js
// The default configuration
new Purifier({
whitelistTags: ['a', '...'],
whitelistAttributes: ['href', '...'],
enableCanonicalization: true,
tagBalance: {
enabled: true,
stackSize: 100
}
});
```

<!--
#### whitelistTags
#### whitelistAttributes
#### enableCanonicalization
-->

#### tagBalance
The untrusted data must be self-contained. Hence, it cannot close any tags prior to its inclusion, nor leave any of its own tags unclosed. An efficient and simple tag balancing algorithm is applied by default to enforce this goal only, and may not produce perfectly nested output. You may implement another tag balancing algorithm before invoking purify. But the default one should still be enabled, unless you're sure the self-contained requirement is met.

The ``stackSize`` (default: 100) is a limit imposed on the maximum number of unclosed tags (or the max levels of nested tags). When an untrusted data attempts to open tags that are so nested and has exceeded the allowed limit, the algorithm will cease any further processing but simply close all of those tags.

## Development

### How to build
Expand Down
142 changes: 99 additions & 43 deletions src/html-purify.js
Original file line number Diff line number Diff line change
Expand Up @@ -11,16 +11,28 @@ See the accompanying LICENSE file for terms.
derivedState = require('./derived-states.js'),
xssFilters = require('xss-filters'),
CssParser = require('css-js'),
hrefAttribtues = tagAttList.HrefAttributes,
voidElements = tagAttList.VoidElements;
hrefAttributes = tagAttList.HrefAttributes,
voidElements = tagAttList.VoidElements,
optionalElements = tagAttList.OptionalElements;

/*jshint -W030 */
function Purifier(config) {
var that = this;
var that = this, tagBalance;

config = config || {};
// defaulted to true
config.enableCanonicalization = config.enableCanonicalization !== false;
config.enableTagBalancing = config.enableTagBalancing !== false;
config.enableVoidingIEConditionalComments = config.enableVoidingIEConditionalComments !== false;

// defaulted to true
config.tagBalance || (config.tagBalance = {});
tagBalance = that.tagBalance = {};
tagBalance.stackOverflow = false;
if ((tagBalance.enabled = config.tagBalance.enabled !== false)) {
tagBalance.stackPtrMax = (parseInt(config.tagBalance.stackSize) || 100) - 1;
tagBalance.stackPtr = 0;
tagBalance.stack = new Array(tagBalance.stackPtrMax + 1);
}

// accept array of tags to be whitelisted, default list in tag-attr-list.js
that.tagsWhitelist = config.whitelistTags || tagAttList.WhiteListTags;
Expand All @@ -35,28 +47,36 @@ See the accompanying LICENSE file for terms.
enableCanonicalization: config.enableCanonicalization,
enableVoidingIEConditionalComments: false // comments are always stripped from the output anyway
}).on('postWalk', function (lastState, state, i, endsWithEOF) {
processTransition.call(that, lastState, state, i);
!tagBalance.stackOverflow && processTransition.call(that, lastState, state, i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

another approach on handling stackOverflow is to :
a. clear the stack
b. disregard the check "if tag exists in the stack".
(in other words, behave as if tag balancing was disabled)
That would prevent truncating valid/balanced input just because it has nesting > stack size(as currently seen in the test case at https://github.com/yahoo/html-purify/pull/23/files#diff-cfdb6a0bb85a9c9dcb6708d2225134cfR42). Also, our assertion that "we won't do tag balacing when nesting > stack size" still holds true. Thoughts?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

really want to let an untrusted input to disable tag balancing?

an attacker can first pad <b> n+1 times, where n equals the stack size limit, optionally close all of them, and then start crafting "dangerous" tags. Examples:

  • <b>...<b><textarea>
  • <b>...<b><div style="display:none;visibility:hidden;font-size:0;color:#000"> (accepting/allowing either of those CSS will result in a consequence literally like <!--. It's common to allow an untrusted email to specify font size/color)
  • <div style="display:none;visibility:hidden;font-size:0;color:#000"> n+1 times

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me come up with a simple use case, as for email threads/conversations:

<div>
To: inHTMLData(email_recipient1)<br>
From: inHTMLData(email_sender1)<br>
Subject: inHTMLData(email_subj1)<br>
purify(email_content1)
<br><button>Reply</button><button>Forward</button><button>Delete</button>
</div>

<div>
To: inHTMLData(email_recipient2)<br>
From: inHTMLData(email_sender2)<br>
Subject: inHTMLData(email_subj2)<br>
purify(email_content2)
<br><button>Reply</button><button>Forward</button><button>Delete</button>
</div>

<div>
To: inHTMLData(email_recipient3)<br>
From: inHTMLData(email_sender3)<br>
Subject: inHTMLData(email_subj3)<br>
purify(email_content3)
<br><button>Reply</button><button>Forward</button><button>Delete</button>
</div>

I hope this use case can explain why we do tag balancing. No one would accept:
(1) the purify(email_content1) can hide the buttons like Reply, Forward, and Delete from the trusted Mail app
(2) even worst, purify(email_content1) can collude with purify(email_content3) to hide email 2, when email_content1 is </div><div style="display:none;..."><div> and email_content3 is </div></div><div>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thoughts? this pretty much explained the need of self-containing the untrusted data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should make sure the output is safe, or we just return empty output with an error.

});

that.cssParser = new CssParser({"ver": "strict", "throwError": false});

}

// TODO: introduce polyfill for Array.indexOf
function contains(arr, element) {
for (var i = 0, len = arr.length; i < len; i++) {
if (arr[i] === element) {
return true;
// A simple polyfill for Array.lastIndexOf
function arrayLastIndexOf(arr, element, fromIndex) {
if (arguments.length < 3) {
fromIndex = arr.length - 1;
}

if (Array.prototype.lastIndexOf) {
return arr.lastIndexOf(element, fromIndex);
}
for (; fromIndex >= 0; fromIndex--) {
if (arr[fromIndex] === element) {
return fromIndex;
}
}
return false;
return -1;
}

function processTransition(prevState, nextState, i) {
/* jshint validthis: true */
/* jshint expr: true */
var parser = this.parser,
idx, tagName, attrValString, openedTag, key, value;

tagBalance = this.tagBalance,
idx = 0, tagName = '', attrValString = '', key = '', value = '', hasSelfClosing = 0;

switch (derivedState.Transitions[prevState][nextState]) {

Expand All @@ -68,40 +88,61 @@ See the accompanying LICENSE file for terms.
idx = parser.getCurrentTagIndex();
tagName = parser.getCurrentTag(idx);

if (contains(this.tagsWhitelist, tagName)) {
if (arrayLastIndexOf(this.tagsWhitelist, tagName) !== -1) {

if (idx) {
if (this.config.enableTagBalancing) {
// add closing tags for any opened ones before closing the current one
while((openedTag = this.openedTags.pop()) && openedTag !== tagName) {
this.output += '</' + openedTag + '>';
}
// openedTag is undefined if tagName is never found in all openedTags, no output needed
if (openedTag) {
this.output += '</' + openedTag + '>';
if (tagBalance.enabled && !optionalElements[tagName]) {

// Simple tag balancing: close the tag as long as it
// exists in the stack, as we only want to ensure the
// untrusted data must be self-contained. Hence, it can
// not close any tags prior to its inclusion, nor leave
// any of its own tags unclosed.
idx = arrayLastIndexOf(tagBalance.stack, tagName, tagBalance.stackPtr - 1);

if (idx >= 0) {
this.output += '</' + tagName + '>';
tagBalance.stack.splice(idx, 1);
tagBalance.stackPtr--;
}

// Pop-until-matched tag balancing: add closing tags for any opened ones before closing the matched one
// while((openedTag = this.openedTags.pop()) && openedTag !== tagName) {
// this.output += '</' + openedTag + '>';
// }
// // openedTag is undefined if tagName is never found in all openedTags, no output needed
// if (openedTag) {
// this.output += '</' + openedTag + '>';
// }
}
else {
this.output += '</' + tagName + '>';
}
}
else {
// - void elements only have a start tag; end tags must not be specified for void elements.
this.hasSelfClosing = this.hasSelfClosing || voidElements[tagName];
// void elements only have a start tag; end tags must not be specified for void elements.
hasSelfClosing = voidElements[tagName];

// push the tagName into the openedTags stack if not found:
// - a self-closing tag or a void element
this.config.enableTagBalancing && !this.hasSelfClosing && this.openedTags.push(tagName);
if (tagBalance.enabled && !hasSelfClosing && !optionalElements[tagName]) {
// cease further processing if it exceeds the maximum stack size allowed
if (tagBalance.stackPtr > tagBalance.stackPtrMax) {
tagBalance.stackOverflow = true;
return;
}

tagBalance.stack[tagBalance.stackPtr++] = tagName;
}

if (prevState === 35 ||
prevState === 36 ||
prevState === 40) {
this.attrVals[parser.getAttributeName()] = parser.getAttributeValue();
}

attrValString = '';
for (key in this.attrVals) {
if (contains(this.attributesWhitelist, key)) {
if (arrayLastIndexOf(this.attributesWhitelist, key) !== -1) {
value = this.attrVals[key];

if (key === "style") { // TODO: move style to a const
Expand All @@ -116,19 +157,19 @@ See the accompanying LICENSE file for terms.

attrValString += ' ' + key;
if (value !== null) {
attrValString += '="' + (hrefAttribtues[key] ? xssFilters.uriInDoubleQuotedAttr(decodeURI(value)) : value) + '"';
attrValString += '="' + (hrefAttributes[key] ? xssFilters.uriInDoubleQuotedAttr(decodeURI(value)) : value) + '"';
}
}
}

// handle self-closing tags
this.output += '<' + tagName + attrValString + (this.hasSelfClosing ? ' />' : '>');
this.output += '<' + tagName + attrValString + (hasSelfClosing ? ' />' : '>');
// this.output += '<' + tagName + attrValString + '>';

}
}
// reinitialize once tag has been written to output
this.attrVals = {};
this.hasSelfClosing = 0;
break;

case derivedState.TransitionName.ATTR_TO_AFTER_ATTR:
Expand All @@ -141,32 +182,47 @@ See the accompanying LICENSE file for terms.

//case derivedState.TransitionName.TAG_OPEN_TO_MARKUP_OPEN:
// this.output += "<" + parser.input[i];
// break;
// break;

case derivedState.TransitionName.TO_SELF_CLOSING_START:
// boolean attributes may not have a value
if (prevState === 35) {
this.attrVals[parser.getAttributeName()] = null;
}
this.hasSelfClosing = 1;

/* According to https://html.spec.whatwg.org/multipage/syntax.html#start-tags
* "Then, if the element is one of the void elements, or if the element is a foreign element, then there may be a single U+002F SOLIDUS character (/).
* This character has no effect on void elements, but on foreign elements it marks the start tag as self-closing."
*/
// that means only foreign elements can self-close (self-closing is optional for void elements)
// no foreign elements will be allowed, so the following logic can be commented
// openedTag = parser.getStartTagName();
// if (openedTag === 'svg' || openedTag === 'math') { // ...
// this.hasSelfClosing = true;
// }

break;
}
}

Purifier.prototype.purify = function (data) {
var that = this, openedTag;
var that = this, i,
tagBalance = that.tagBalance;

that.output = '';
that.openedTags = [];
that.attrVals = {};
that.hasSelfClosing = 0;
that.parser.reset();
that.parser.contextualize(data);

if (that.config.enableTagBalancing) {
// close any remaining openedTags
while((openedTag = this.openedTags.pop())) {
that.output += '</' + openedTag + '>';
that.output = '';

if (tagBalance.enabled) {
tagBalance.stack = new Array(tagBalance.stackPtrMax + 1);
tagBalance.stackPtr = 0;
}

that.parser.reset().contextualize(data);

if (tagBalance.enabled) {
// close remaining opened tags, if any
for (i = tagBalance.stackPtr - 1; i >= 0; i--) {
that.output += '</' + tagBalance.stack[i] + '>';
}
}

Expand Down
3 changes: 3 additions & 0 deletions src/tag-attr-list.js
Original file line number Diff line number Diff line change
Expand Up @@ -261,3 +261,6 @@ exports.HrefAttributes = {
// Void elements only have a start tag; end tags must not be specified for void elements.
// https://html.spec.whatwg.org/multipage/syntax.html#void-elements
exports.VoidElements = {"area":1, "base":1, "br":1, "col":1, "embed":1, "hr":1, "img":1, "input":1, "keygen":1, "link":1, "menuitem":1, "meta":1, "param":1, "source":1, "track":1, "wbr":1};

// https://html.spec.whatwg.org/multipage/syntax.html#optional-tags
exports.OptionalElements = {/*"plaintext":1, "html":1, "head":1, "body":1,*/ "li":1, "dt":1, "dd":1, "p":1, "rt":1, "rp":1, "optgroup":1, "option":1, "colgroup":1, "caption":1, "thead":1, "tbody":1, "tfoot":1, "tr":1, "td":1, "th":1};
22 changes: 16 additions & 6 deletions tests/test-vectors.js
Original file line number Diff line number Diff line change
Expand Up @@ -628,12 +628,12 @@ var generalVectors = [
{
id: 10,
input: "<table><tr bz/></table>normal tag made standalone with bogus trunc attr",
output: "<table><tr /></table>normal tag made standalone with bogus trunc attr"
output: "<table><tr></table>normal tag made standalone with bogus trunc attr"
},
{
id: 11,
input: "<table><tr color/></table>normal tag made standalone with trunc attr bad val",
output: "<table><tr /></table>normal tag made standalone with trunc attr bad val"
output: "<table><tr></table>normal tag made standalone with trunc attr bad val"
},
{
id: 12,
Expand All @@ -648,7 +648,7 @@ var generalVectors = [
{
id: 14,
input: "<table><tr size=\"4\"/></table>normal tag made standalone with value",
output: "<table><tr size=\"4\" /></table>normal tag made standalone with value"
output: "<table><tr size=\"4\"></table>normal tag made standalone with value"
},
//style attribute tests
{
Expand Down Expand Up @@ -779,7 +779,7 @@ var generalVectors = [
{
id: 40,
input: "<option selected />",
output: "<option selected />"
output: "<option selected>"
},
{
id: 41,
Expand All @@ -799,7 +799,7 @@ var generalVectors = [
{
id: 44,
input: "<option selected/>",
output: "<option selected />"
output: "<option selected>"
},
{
id: 45,
Expand Down Expand Up @@ -831,7 +831,17 @@ var generalVectors = [
id: 50,
input: "abc <!-- 123",
output: "abc "
}
},
{
id: 51,
input: "<a href=\"http://www.yahoo.com/\" />hello",
output: "<a href=\"http://www.yahoo.com/\">hello</a>"
},
// {
// id: 52,
// input: "<img src=\"x\" id=\'\" onerror=\"alert(1)\' />",
// output: "<img src=\"x\" id=\'\" onerror=\"alert(1)\' />"
// }
];

exports.html5secVectors = html5secVectors;
Expand Down
23 changes: 18 additions & 5 deletions tests/unit/html-purify.js
Original file line number Diff line number Diff line change
Expand Up @@ -31,18 +31,31 @@ Authors: Aditya Mahendrakar <[email protected]>
assert.equal(output, '<h1 id="foo" title="asd" checked>hello world 2</h1>');
});

it('should always balance unopened tags', function(){
it('should balance tags', function(){
var html = "</div>foo</h2>bar<a href=\"123\">hello<b>world</a><embed>123</embed><br /><br/><p>";

// with tag balancing enabled by default
var output = (new Purifier()).purify(html);
assert.equal(output, 'foobar<a href=\"123\">hello<b>world</b></a><embed />123<br /><br /><p></p>');
var output = (new Purifier({tagBalance:{enabled:true}})).purify(html);
assert.equal(output, 'foobar<a href="123">hello<b>world</a><embed />123<br /><br /><p></b>');
});

it('should balance remaining tags and drop inputs when there are too many unclosed tags', function(){
var html = "<b>1<b>2<b>3<b>4<b>5<b>6</b></b></b></b>";

// with tag balancing enabled by default
var output = (new Purifier({tagBalance:{enabled:true, stackSize:3}})).purify(html);
assert.equal(output, '<b>1<b>2<b>3</b></b></b>');
});

it('should not balance tags if disabled', function(){
var html = "</div>foo</h2>bar<a href=\"123\">hello<b>world</a><embed>123</embed><br /><br/><p>";

// with tag balancing disabled
var output = (new Purifier({enableTagBalancing:false})).purify(html);
assert.equal(output, "</div>foo</h2>bar<a href=\"123\">hello<b>world</a><embed />123</embed><br /><br /><p>");
var output = (new Purifier({tagBalance:{enabled:false}})).purify(html);
assert.equal(output, '</div>foo</h2>bar<a href="123">hello<b>world</a><embed />123</embed><br /><br /><p>');
});


it('should handle all vectors mentioned in https://html5sec.org', function(){
var output, i, vector;
for (var i = 0; i < html5secVectors.length; i++) {
Expand Down