Skip to content

Commit

Permalink
Merge pull request #299 from w3c/tripu/metadata
Browse files Browse the repository at this point in the history
Add method for metadata extraction
  • Loading branch information
deniak committed Mar 22, 2016
2 parents 6d2ec8d + ca80fbb commit 29f367f
Show file tree
Hide file tree
Showing 22 changed files with 3,865 additions and 202 deletions.
46 changes: 21 additions & 25 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -63,8 +63,9 @@ The interface you get when you `require("specberus")` is that from `lib/validato
`Specberus` instance that is properly configured for operation in the Node.js environment
(there is nominal support for running Specberus under other environments, but it isn't usable at this time).

The validator interface supports a `validate(options)` methods, which takes an object with the
following fields:
### `validate(options)`

This method takes an object with the following fields:

* `url`: URL of the content to check. One of `url`, `source`, `file`, or `document` must be
specified and if several are they will be used in this order.
Expand All @@ -75,6 +76,24 @@ following fields:
* `events`: An event sink which supports the same interface as Node.js's `EventEmitter`. Required. See
below for the events that get generated.

### `extractMetadata(options)`

This method returns a simple object with metadata inferred from the document.
The `options` accepted are equal to those in `validate()`, except that a `profile` is not necessary and will be ignored (finding out the profile is one of the
goals of this method).

The returned `Object` may contain up to 2 properties: `profile` and `delivererIDs`.
If some of these pieces of metadata cannot be deduced, that key will not exist, or its value will not be defined.

An example:

```json
{
"profile": "WD",
"delivererIDs": [47318, 43696]
}
```

### Emitting metadata about the document

Every time the validator finds/deduces a piece of metadata about the document, it emits a `metadata` event.
Expand All @@ -86,17 +105,12 @@ These properties are now returned when found:
* `docDate`: The date associated to the document.
* `title`: The (possible) title of the document.
* `process`: The process rules, **as they appear on the text of the document**, eg `'1 September 2015'`.
* `deliverers`: The deliverer(s) responsible for the document (WGs, TFs, etc); an `Array` of `Object`s, each one with these properties:
* `homepage`: URL of the group's home page.
* `name`: name of the group, exactly as it is found in the hyperlink on the document.
* `delivererIDs` ID(s) of the deliverer(s); an `Array` of `Number`s.
* `thisVersion`: URL of this version of the document.
* `previousVersion`: URL of the previous version of the document (the last one, if multiple are shown).
* `latestVersion`: URL of the latest version of the document.
* `editorIDs`: ID(s) of the editor(s) responsible for the document; an `Array` of `Number`s.
* `editorsDraft`: URL of the latest editor's draft.
* `shortname`: shortname extracted from latestVersion in the document; a `String`.
* `status`: ID (acronym) of the profile detected in the document; a `String`. See file `public/data/profiles.json`.

As an example, validating [`http://www.w3.org/TR/2014/REC-exi-profile-20140909/`](http://www.w3.org/TR/2014/REC-exi-profile-20140909/) (REC)
emits these pairs of metadata:
Expand All @@ -108,13 +122,8 @@ emits these pairs of metadata:
{ latestVersion: 'http://www.w3.org/TR/exi-profile/' }
{ previousVersion: 'http://www.w3.org/TR/2014/PR-exi-profile-20140506/' }
{ editorIDs: [] }
{ status: 'REC' }
{ shortname: 'exi-profile'}
{ process: '1 September 2015' }
{ deliverers: [
{ homepage: 'http://www.w3.org/XML/EXI/',
name: 'Efficient XML Interchange Working Group' }
] }
```

If you download that very spec, edit it to include the following metadata…
Expand All @@ -133,13 +142,8 @@ If you download that very spec, edit it to include the following metadata&hellip
{ latestVersion: 'http://www.w3.org/TR/exi-profile/' }
{ previousVersion: 'http://www.w3.org/TR/2014/PR-exi-profile-20140506/' }
{ editorIDs: [ '329883', '387297' ] }
{ status: 'REC' }
{ shortname: 'exi-profile'}
{ process: '1 September 2015' }
{ deliverers: [
{ homepage: 'http://www.w3.org/XML/EXI/',
name: 'Efficient XML Interchange Working Group' }
] }
```

Another example: when applied to [`http://www.w3.org/TR/wai-aria-1.1/`](http://www.w3.org/TR/wai-aria-1.1/) (WD),
Expand All @@ -152,16 +156,9 @@ the following metadata will be found:
{ latestVersion: 'http://www.w3.org/TR/wai-aria-1.1/' }
{ previousVersion: 'http://www.w3.org/TR/2014/WD-wai-aria-1.1-20140612/' }
{ editorIDs: [] }
{ status: 'WD' }
{ shortname: 'wai-aria-1.1' }
{ process: '1 September 2015' }
{ editorsDraft: 'http://w3c.github.io/aria/aria/aria.html' }
{ deliverers: [
{ homepage: 'http://www.w3.org/WAI/PF/',
name: 'Protocols & Formats Working Group' },
{ homepage: 'http://www.w3.org/html/wg/',
name: 'HTML Working Group' }
] }
```

## Profiles
Expand Down Expand Up @@ -243,4 +240,3 @@ The Specberus object exposes the following API that's useful for validation:
* `getDocumentDate()`. Returns a Date object that matches the document's date as specified in the
headers' h2.
* `getDocumentDateElement()`. Returns the element that contains the document's date.

73 changes: 42 additions & 31 deletions app.js
Original file line number Diff line number Diff line change
@@ -1,24 +1,37 @@
/*jshint es5: true*/
/**
* Main runnable file of Specberus.
*/

// Pseudo-constants:
var DEFAULT_PORT = 80;
// Settings:
const DEFAULT_PORT = 80;

// The Express and Socket.io server interface
var express = require("express")
, bodyParser = require('body-parser')
// Native packages:
const http = require('http');

// External packages:
const bodyParser = require('body-parser')
, compression = require('compression')
, express = require('express')
, insafe = require('insafe')
, morgan = require('morgan')
, app = express()
, server = require("http").createServer(app)
, io = require("socket.io").listen(server)
, Specberus = new require("./lib/validator").Specberus
, l10n = require("./lib/l10n")
, util = require("util")
, events = require("events")
, insafe = require("insafe")
, version = require("./package.json").version
, socket = require('socket.io')
;

// Internal packages:
const package = require('./package.json')
, l10n = require('./lib/l10n')
, sink = require('./lib/sink')
, validator = require('./lib/validator')
;

const app = express()
, server = http.createServer(app)
, io = socket.listen(server)
, profiles = {}
, Sink = sink.Sink
, version = package.version
;

("FPWD FPLC FPCR WD LC CR PR PER REC RSCND " +
"CG-NOTE FPIG-NOTE IG-NOTE FPWG-NOTE WG-NOTE " +
"WD-Echidna " +
Expand Down Expand Up @@ -49,47 +62,45 @@ server.listen(process.argv[2] || process.env.PORT || DEFAULT_PORT);
// error, { name: "test name", code: "FOO" }
// done, { name: "test name" }
// finished
function Sink () {}
util.inherits(Sink, events.EventEmitter);

io.sockets.on("connection", function (socket) {
socket.emit("handshake", { version: version });
socket.on("validate", function (data) {
if (!data.url) return socket.emit("exception", { message: "URL not provided." });
if (!data.profile) return socket.emit("exception", { message: "Profile not provided." });
if (!profiles[data.profile]) return socket.emit("exception", { message: "Profile does not exist." });
var validator = new Specberus()
, sink = new Sink
var v = new validator.Specberus
, handler = new Sink
, profile = profiles[data.profile]
;
socket.emit("start", {
rules: (profile.rules || []).map(function (rule) { return rule.name; })
});
sink.on("ok", function (type) {
handler.on("ok", function (type) {
socket.emit("ok", { name: type });
});
sink.on("err", function (type, data) {
handler.on("err", function (type, data) {
data.name = type;
data.message = l10n.message(validator.config.lang, type, data.key, data.extra);
data.message = l10n.message(v.config.lang, type, data.key, data.extra);
socket.emit("err", data);
});
sink.on("warning", function (type, data) {
handler.on("warning", function (type, data) {
data.name = type;
data.message = l10n.message(validator.config.lang, type, data.key, data.extra);
data.message = l10n.message(v.config.lang, type, data.key, data.extra);
socket.emit("warning", data);
});
sink.on('info', function (type, data) {
handler.on('info', function (type, data) {
data.name = type;
data.message = l10n.message(validator.config.lang, type, data.key, data.extra);
data.message = l10n.message(v.config.lang, type, data.key, data.extra);
socket.emit('info', data);
});
sink.on("done", function (name) {
handler.on("done", function (name) {
socket.emit("done", { name: name });
});
sink.on("end-all", function () {
handler.on("end-all", function () {
socket.emit("finished");
});
sink.on("exception", function (data) {
handler.on("exception", function (data) {
socket.emit("exception", data);
});
insafe.check({
Expand All @@ -98,10 +109,10 @@ io.sockets.on("connection", function (socket) {
}).then(function(res){
if(res.status) {
try {
validator.validate({
v.validate({
url: res.url
, profile: profile
, events: sink
, events: handler
, validation: data.validation
, noRecTrack: data.noRecTrack
, informativeOnly: data.informativeOnly
Expand Down
1 change: 0 additions & 1 deletion lib/profiles/base.js
Original file line number Diff line number Diff line change
Expand Up @@ -56,6 +56,5 @@ exports.rules = [
, require("../rules/validation/html")
, require("../rules/validation/css")
, require('../rules/validation/wcag')
, require("../rules/heuristic/group")
, require('../rules/heuristic/date-format')
];
10 changes: 10 additions & 0 deletions lib/profiles/metadata.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
/**
* Pseudo-profile for metadata extraction.
*/

exports.name = 'Metadata';

exports.rules = [
require('../rules/metadata/profile')
, require('../rules/metadata/deliverers')
];
2 changes: 0 additions & 2 deletions lib/rules/headers/h2-status.js
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,6 @@ exports.check = function (sr, done) {
var rx = new RegExp('^W3C ' + profiles.tracks[i].profiles[j].name + '( |,)', 'i');
if (rx.test(txt)) {
profileFound = true;
sr.metadata('status', profiles.tracks[i].profiles[j].id);
}
j ++;
}
Expand All @@ -39,4 +38,3 @@ exports.check = function (sr, done) {
done();

};

45 changes: 0 additions & 45 deletions lib/rules/heuristic/group.js

This file was deleted.

40 changes: 40 additions & 0 deletions lib/rules/metadata/deliverers.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
/**
* Pseudo-rule for metadata extraction: deliverers' IDs.
*/

// Settings:
const REGEX_DELIVERER_URL = /^((https?:)?\/\/)?(www\.)?w3\.org\/2004\/01\/pp-impl\/\d+\/status(#.*)?$/i
, REGEX_DELIVERER_TEXT = /^public\s+list\s+of\s+any\s+patent\s+disclosures(\s+\(.+\))?$/i
, REGEX_DELIVERER_ID = /pp-impl\/(\d+)\/status/i
;

exports.name = 'metadata.deliverers';

exports.check = function(sr, done) {

var ids = [];

if (sr && sr.getSotDSection() && sr.getSotDSection().find('a[href]')) {

sr.getSotDSection().find('a[href]').each(function() {

var item = sr.$(this)
, href = item.attr('href')
, text = sr.norm(item.text())
, found = {}
;

if (REGEX_DELIVERER_URL.test(href) && REGEX_DELIVERER_TEXT.test(text)) {
var id = REGEX_DELIVERER_ID.exec(href);
if (id && id.length > 1 && !found[id[1]]) {
found[id] = true;
ids.push(parseInt(id[1], 10));
}
}
});

}

done({delivererIDs: ids});

};
48 changes: 48 additions & 0 deletions lib/rules/metadata/profile.js
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
/**
* Pseudo-rule for metadata extraction: profile.
*/

// Settings:
const SELECTOR_SUBTITLE = 'body div.head h2';

// Internal packages:
const profiles = require('../../../public/data/profiles');

exports.name = 'metadata.profile';

exports.check = function(sr, done) {

var candidate
, track
, profile
, matchedLength = 0
, id
, i
, j
;

sr.$(SELECTOR_SUBTITLE).each(function() {
candidate = sr.norm(sr.$(this).text()).toLowerCase();
i = 0;
while (i < profiles.tracks.length) {
track = profiles.tracks[i].profiles;
j = 0;
while (j < track.length) {
profile = track[j];
if (-1 !== candidate.indexOf(profile.name.toLowerCase()) && matchedLength < profile.name.length) {
id = profile.id;
matchedLength = profile.name.length;
}
j++;
}
i++;
}
});
if (id) {
done({profile: id});
}
else {
done();
}

};
Loading

0 comments on commit 29f367f

Please sign in to comment.