Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTML must be sanitized before parsing, to fixup any invalid characters #239

Open
wants to merge 51 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
51 commits
Select commit Hold shift + click to select a range
de02bdd
Opts.parse = false always throws errors
Ivanca Sep 8, 2016
6ed2a3b
Merge branch 'release/1.1.3'
rchipka Mar 22, 2017
7c20daa
1.1.4
rchipka Mar 23, 2017
6a66be6
Update needle dependancy.
cubehouse Jun 18, 2017
0bf7480
Merge pull request #123 from Ivanca/patch-1
rchipka Nov 8, 2017
1a4a979
fix empty param passed to request adding a query string to url path
e-e-e Nov 15, 2017
b25d48e
remove console.log's and make compatible with legacy node versions
e-e-e Nov 15, 2017
ffc10e2
changes based on review comments
e-e-e Dec 2, 2017
fc9dd67
Merge pull request #188 from e-e-e/master
rchipka Dec 2, 2017
511d979
Fix handling of null query param object
Dec 2, 2017
b887c87
Preserve array order for .follow() within .set()
Dec 2, 2017
a785cd9
0.1.5
Dec 2, 2017
2330092
Remove donation offers
Dec 2, 2017
21850b2
Preserve sort order for find
Dec 2, 2017
2a8de94
Paginate command accepts function as first arg
Dec 2, 2017
c081887
Merge pull request #167 from cubehouse/master
rchipka Dec 28, 2017
cba7cf6
Update package-lock.json
rchipka Dec 28, 2017
b887213
v1.1.6
rchipka Dec 28, 2017
e5ad0db
Allow get/post commands to accept params as contextCallback
samogot May 12, 2018
525845e
Add tests
samogot May 12, 2018
549254b
Merge pull request #208 from samogot/get-params-callback
rchipka May 12, 2018
2ac8b8d
Update libxmljs-dom dependency
rchipka Jun 10, 2018
7487a40
Deprecate node v0.10
rchipka Jun 10, 2018
72a48ea
Adds option to customise error codes + useragent as a function
hasnat Jul 24, 2018
bc20de8
Set user_agent to string before sending to needle
hasnat Jul 24, 2018
3462c8d
Update check for is error response code + tests for user_agent + erro…
hasnat Jul 25, 2018
129c3ef
Update process_response option to take next + callback
hasnat Jul 27, 2018
59dc65a
Remove error_status_code option
hasnat Jul 27, 2018
0e16c1c
Adds default no process_response test
hasnat Jul 27, 2018
dfdada0
Prefer If-statement over turnary
rchipka Jul 27, 2018
0641503
Merge pull request #214 from hasnat/patch-1
rchipka Jul 27, 2018
5bf419b
Update package.json
rchipka Jul 27, 2018
3f12493
Update libxmljs-dom
rchipka Feb 21, 2019
922d1a8
v1.1.9
rchipka Feb 21, 2019
61dde5a
Fix builds
rchipka Feb 21, 2019
253c60c
Fix builds
rchipka Feb 21, 2019
f414cea
Update package-lock.json
rchipka Feb 21, 2019
93832f7
Try latest nodeunit
rchipka Feb 21, 2019
63af526
Update package-lock.json
rchipka Feb 21, 2019
759b3d9
Update package-lock.json
rchipka Feb 21, 2019
4aff964
Update package-lock.json
rchipka Feb 21, 2019
97384f7
Update package-lock.json
rchipka Feb 21, 2019
bd717ac
Update package-lock.json
rchipka Feb 21, 2019
58d4a1c
Export libxml
rchipka Mar 1, 2019
b8c946f
v1.1.10
rchipka Mar 1, 2019
6540b47
Update package-lock.json
rchipka Mar 1, 2019
ce1b42e
fix #236, pagination preserves method
May 13, 2019
1d432e6
Merge pull request #237 from tttp/master
rchipka May 13, 2019
37de803
sanitize the html before parsing it with libxml
itsthatguy May 31, 2019
9446935
fixes sanitization so that we can actually use this package again
itsthatguy May 31, 2019
70318f3
encode the URI before making a request
itsthatguy Jun 4, 2019
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion .travis.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
language: node_js
node_js:
- "0.10"
# - "0.10"
- 8
17 changes: 17 additions & 0 deletions Changes.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,23 @@
* Add warnings for parser errors?
* Switch to semantic versioning?

## Next major release:

* Event/error handling
* Error.code = 404, 'timeout', etc.
* Error.module = 'http', 'dom', etc.
* return true = retry, false = stop, anything else = continue
* Event for discontinued context/data
* Module system using osmosis.require and modules prefixed with `osmosis-`
* Way to trigger DOM
* Throw unhandled errors?
* `.while()` to do things more than once as long as they call next()

## 0.1.5

* Fixed bug where .get() without `params` caused empty query string ('?')
* Preserve sort order for `.follow()` results within `.set()`

## 0.1.4

#### `get`
Expand Down
5 changes: 0 additions & 5 deletions Readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -84,9 +84,4 @@ For documentation and examples check out [https://rchipka.github.io/node-osmosis
Please consider a donation if you depend on web scraping and Osmosis makes your job a bit easier.
Your contribution allows me to spend more time making this the best web scraper for Node.

### Donation offers:

- $25 - A custom Osmosis scraper to extract the data you need efficiently and in as few lines of code as possible.
- $25/month - Become a sponsor. Your company will be listed on this page. Priority support and bug fixes.

[![Donate](https://www.paypalobjects.com/en_US/i/btn/btn_donate_LG.gif)](https://www.paypal.com/cgi-bin/webscr?item_name=node-osmosis&cmd=_donations&business=NAXMWBMWKUWUU)
6 changes: 6 additions & 0 deletions index.js
Original file line number Diff line number Diff line change
Expand Up @@ -179,6 +179,10 @@ Osmosis.prototype.request = function (url, opts, callback, tries) {
this.requests++;
this.queue.requests++;
this.queue.push();

if (typeof opts.user_agent === 'function') {
opts.user_agent = opts.user_agent();
}

request(url.method,
url,
Expand Down Expand Up @@ -451,4 +455,6 @@ libxml.Element.prototype.find = function (selector) {
* @param {data} data - The current data object.
*/

Osmosis.libxmljs = libxml;

module.exports = Osmosis;
14 changes: 9 additions & 5 deletions lib/Command.js
Original file line number Diff line number Diff line change
Expand Up @@ -156,14 +156,14 @@ Command.prototype.start = function (context, data) {

data.ref();

return callback.call(this, context, data, function (c, d) {
return callback.call(this, context, data, function (c, d, index) {
if (calledNext === true) {
// If `next` is called more than once,
// then we need to clone the data
next.start(c, d.clone().ref());
next.start(c, d.clone().setSortIndex(index).ref());
} else {
calledNext = true;
next.start(c, d);
next.start(c, d.setSortIndex(index));
}
}, function (err) {
data.unref();
Expand Down Expand Up @@ -284,7 +284,7 @@ Command.prototype.setOpt = function (name, value) {
* @private
*/

Command.prototype.request = function (method, context, href, params, callback) {
Command.prototype.request = function (method, context, href, params, callback, sortIndex) {
var self = this,
length = callback.length,
instance = self.instance,
Expand Down Expand Up @@ -340,7 +340,7 @@ Command.prototype.request = function (method, context, href, params, callback) {
url.method = method;
url.params = params;

if (method === 'get') {
if (method === 'get' && params instanceof Object && params !== null) {
for (key in params) {
url.query[key] = params[key];
}
Expand Down Expand Up @@ -386,6 +386,10 @@ Command.prototype.request = function (method, context, href, params, callback) {
'')
);

if (document instanceof Object && document !== null) {
document._dataSortIndex = sortIndex;
}

if (length === 1) {
callback(document);
} else if (length === 2) {
Expand Down
70 changes: 66 additions & 4 deletions lib/Data.js
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
* @constructor Data
* @param {object} [data] - Data object value
* @param {object} [parent] - Parent Data object
* @param {object} [index] - Index in the parent object
* @param {number} [index] - Index in the parent object
* @param {number} [sortIndex] - Sort order of object if coerced into array
* @param {bool} [isArray] - Is the object an array?
* @property {number} refs - Number of references
* @property {number} clones - Number of clones
Expand Down Expand Up @@ -91,6 +92,7 @@ Data.prototype.setObject = function (object) {

Data.prototype.next = function () {
var clone = new Data(this.parent)
.setSortIndex(this.getSortIndex())
.setIndex(this.getIndex())
.isArray(this.isArray());

Expand Down Expand Up @@ -130,7 +132,7 @@ Data.prototype.unref = function () {
*/

Data.prototype.set = function (key, val) {
var object, currentVal;
var object, currentVal, sortKey;

if (val === undefined) {
return this;
Expand Down Expand Up @@ -222,9 +224,65 @@ Data.prototype.setIndex = function (index) {
return this;
};

Data.prototype.setSortIndex = function (index) {
if (index !== undefined) {
this.sortIndex = index;
}

return this;
};

Data.prototype.getSortIndex = function () {
return this.sortIndex;
}

Data.prototype.sortKey = function (key, sortIndex) {
var object = this.getObject(),
currentVal = object[key],
sortArray;

if (!this.sortArray) {
this.sortArray = {};
}

sortArray = this.sortArray[key];

if (sortArray === undefined) {
if (currentVal instanceof Array && currentVal.length > 0) {
sortArray = new Array(currentVal.length);
} else {
sortArray = [sortIndex];
}

this.sortArray[key] = sortArray;
}

if (currentVal instanceof Array) {
var diff = currentVal.length - sortArray.length;

while (diff > 0) {
sortArray.push(sortIndex + (--diff));
}

object[key] = sortArray.map(function (v, i) {
return {
value: v,
index: i
};
}).sort(function (a, b) {
return a.value - b.value;
}).map(function (v, i) {
sortArray[i] = v.value;

return currentVal[v.index];
});
}
}

Data.prototype.merge = function (child) {
var object = child.object,
index = child.getIndex();
index = child.getIndex(),
sortIndex = child.getSortIndex();

if (object === undefined) {
return;
Expand All @@ -233,10 +291,14 @@ Data.prototype.merge = function (child) {
if (this.isArray() === true) {
this.push(object);
} else if (index !== undefined) {
this.set(child.getIndex(), object);
this.set(index, object);
} else if (object instanceof Object) {
this.extend(object);
}

if (sortIndex !== undefined) {
this.sortKey(index, sortIndex);
}
};

Data.prototype.toArray = function () {
Expand Down
101 changes: 62 additions & 39 deletions lib/Request.js
Original file line number Diff line number Diff line change
@@ -1,8 +1,9 @@
'use strict';

var needle = require('needle'),
URL = require('url'),
libxml = require('libxmljs-dom');
var needle = require('needle'),
URL = require('url'),
libxml = require('libxmljs-dom'),
sanitizeHtml = require('sanitize-html');

/**
* Make an HTTP request.
Expand All @@ -12,13 +13,15 @@ var needle = require('needle'),

function Request(method, url, params, opts, tries, callback) {
var location = url;

return needle.request(method,
url.href,
encodeURI(url.href),
params,
opts,
function (err, res, data) {
var document;

if (!(url.params instanceof Object) || url.params === null) {
url.params = url.query;
}

if (err !== null) {
callback(err.message);
Expand All @@ -28,7 +31,8 @@ function Request(method, url, params, opts, tries, callback) {
if (opts.ignore_http_errors !== true &&
res !== undefined &&
res.statusCode >= 400 &&
res.statusCode <= 500) {
res.statusCode <= 500
) {
// HTTP error
callback(res.statusCode + ' ' + res.statusMessage);
return;
Expand All @@ -39,49 +43,68 @@ function Request(method, url, params, opts, tries, callback) {
return;
}

if (opts.process_response !== undefined) {
document = opts.process_response(data);
} else {
document = data;
}
function next(document) {
if (opts.parse === false) {
callback(null, res, document);
return;
}

if (opts.parse !== false) {
document = libxml.parseHtml(document,
{ baseUrl: location.href, huge: true });
}
{ baseUrl: location.href, huge: true });

if (document === null) {
callback('Couldn\'t parse response');
return;
}
if (document === null) {
callback('Couldn\'t parse response');
return;
}

if (document.errors[0] !== undefined &&
document.errors[0].code === 4) {
callback('Document is empty');
return;
}
if (document.errors[0] !== undefined &&
document.errors[0].code === 4) {
callback('Document is empty');
return;
}

if (document.root() === null) {
callback('Document has no root');
return;
}
if (document.root() === null) {
callback('Document has no root');
return;
}

location.headers = res.req._headers;
location.proxy = opts.proxy;
location.user_agent = opts.user_agent;

location.headers = res.req._headers;
location.proxy = opts.proxy;
location.user_agent = opts.user_agent;
document.location = location;
document.request = location;

document.location = location;
document.request = location;
setResponseMeta(document, res, data.length);
setCookies(document, res.cookies);
setCookies(document, opts.cookies);

setResponseMeta(document, res, data.length);
setCookies(document, res.cookies);
setCookies(document, opts.cookies);
if (opts.keep_data === true) {
document.response.data = data;
}

if (opts.keep_data === true) {
document.response.data = data;
callback(null, res, document);
}

if (
opts.process_response !== undefined &&
typeof opts.process_response === 'function'
) {
if (opts.process_response.length > 2) {
opts.process_response(data, res, next, callback);
return;
}

next(opts.process_response(data, res));
} else {
const cleanData = sanitizeHtml(data, {
allowedTags: false,
allowedAttributes: false,
selfClosing: [],
});
next(cleanData);
}

callback(null, res, document);
})
.on('redirect', function (href) {
extend(location, URL.parse(URL.resolve(location.href, href)));
Expand Down
2 changes: 1 addition & 1 deletion lib/commands/find.js
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ var Find = function (context, data, next, done) {
node = nodes[i];
node.last = (length - 1 === i);
node.index = i;
next(node, data);
next(node, data, i);
}

done();
Expand Down
7 changes: 4 additions & 3 deletions lib/commands/follow.js
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,7 @@ module.exports.follow = function (context, data, next, done) {
i = 0, queue = 0, length, node, url,
requestDone = function (err, document) {
if (err === null) {
next(document, data);
next(document, data, document._dataSortIndex);
}

if (--queue === 0) {
Expand Down Expand Up @@ -52,10 +52,11 @@ module.exports.follow = function (context, data, next, done) {

self.log("url: " + url);
self.request('get',
context,
node,
url,
null,
requestDone);
requestDone,
i);
}
}

Expand Down
Loading