Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to he for decoding HTML entities #3

Closed
wants to merge 1 commit into from
Closed

Switch to he for decoding HTML entities #3

wants to merge 1 commit into from

Conversation

mathiasbynens
Copy link

Fixes #2.

@ForbesLindesay
Copy link
Member

I'm reluctant to merge this because he seems to be such a mess relative to ent. ent is about 11 files (after my patches) whereas he is 22 files. With ent there's a clear index.js file that's the center of what's going on. That index.js file is a clean 57 lines of easy to understand code. The corresponding he.js file is 132 lines, of which about half seems to be boilerplate for handling different module systems. Pretty much the exact same file seems to be duplicated in /src/he.js. It's not at all clear where /src/data.js is used?

Don't get me wrong, you seem to have a great understanding of character encoding in HTML and JavaScript (far more extensive than mine at the moment) but the code looks like something that's been spat out of an automated build tool.

My suggestion would be to begin by building a pure node.js module. Compile that with browserify --standalone he to a /dist directory and point all the non-node.js compatible package managers there (i.e. component, bower, amd etc.) In addition, provide that as the download link. This will give you clean, maintainable code. Write your tests so that they also make use of require and test in other environments by browserifying your test script. This will produce a short, clean, simple test file.

If nobody has requested support for all those other really obscure package managers, don't support them. Just tell people where to download the standalone file and be done with it. As it stands, I'd rather fix the bugs in ent and have a small, maintainable package than use another monolithic library that I might one day have to maintain if you give up on it.

@mathiasbynens
Copy link
Author

I'm reluctant to merge this because he seems to be such a mess relative to ent.

Ouch.

If you honestly prefer compact, “readable” code that doesn’t deliver on its promise over slightly more convoluted but properly tested and fully working code, I don’t really know what to say.

By the way, he started out as a pull request for ent. As soon as I realized I would have to rewrite the whole thing, add a few thousand of tests, as well as write patches for all the other broken HTML entity libs out there, I decided to build my own library instead. Just seemed like a much more efficient way of solving the problem of “there is no fully working HTML entity library in JS”.

ent is about 11 files (after my patches) whereas he is 22 files.

Why does this matter? Also, note that after npm install ent he, node_modules/ent contains 8 folders/files, and he contains 3 files.

That [ent] index.js file is a clean 57 lines of easy to understand code.

What good is easy to understand code if it’s broken?

The corresponding he.js file is 132 lines, of which about half seems to be boilerplate for handling different module systems. Pretty much the exact same file seems to be duplicated in /src/he.js. It's not at all clear where /src/data.js is used?

src/data.js is used at the build step. Running grunt build downloads the WHATWG’s JSON file containing all the known named character references, and saves it in tests/entities.json. Then, src/he.js is used as a template — things like <%= astralSymbols %> are replaced with whatever src/data.json exports as the astralSymbols property. This way, the lookup tables for encoding/decoding entities as well as some regular expressions are dynamically generated as part of the build process.

The same thing goes for (part of) the tests, by the way: tests/tests.src.js is used as a template, in which the contents of the downloaded entities.json file is injected. See the template task in the Gruntfile for details.

This way, whenever the official entities.json file changes (which is highly unlikely, but hey), he can be updated by just running a single command. So much for “maintenance”.

Needless to say, doing all this manually would be soul-crushing and error-prone.

Don't get me wrong, you seem to have a great understanding of character encoding in HTML and JavaScript (far more extensive than mine at the moment) but the code looks like something that's been spat out of an automated build tool.

That’s because it is. Why is that a problem, though?

Write your tests so that they also make use of require and test in other environments by browserifying your test script.

My tests make use of require, both using RequireJS and the native require available in some environments.

This will produce a short, clean, simple test file.

he currently has 5953 tests in its test suite. How is this a bad thing? Why would I want to simplify the test file?

As it stands, I'd rather fix the bugs in ent and have a small, maintainable package […]

There’s no way (that I know of) to fix all the bugs in ent without ending up with the same code base as he.js. Sure, you could remove the few lines of export logic for environments you don’t care about, but will that really make a difference?

[…] than use another monolithic library that I might one day have to maintain if you give up on it.

he maintenance boils down to running grunt build and bumping the version number whenever a new version of entities.json comes about (which is very unlikely to happen anyway). I think we can handle it :)

@ForbesLindesay
Copy link
Member

I prefer working code, but I'm not convinced we can't get to working code without anything in the repository that's not directly related to that project.

Are you aware of any bugs in ent.decode that would remain un-fixed after my updates?

Having the project go through some sort of build tool is fine. The problem is that the results of that build file are mostly what you see when you open the repository on GitHub. Even the stuff in src looks like the result of a build tool, not the input to a build tool. If you are running your code through a build tool anyway, why not use an automated system like umd (which comes with a CLI utility) to support all the crazy module system checks.

You missed my point with the tests. According to the output of ent, substack has 7248 tests. Clearly neither of these statistics are true. The reality is you both have about 8 tests. 4 encoding tests and 4 decoding tests. The rest is just running through hundreds/thousands of examples within those. As a result of you in-lining your test data you have a test script which is 6066 lines long. You also have a totally un-necessary build step that could just as easily be taken care of by node.js's require.

In general, adding build steps to JavaScript vastly slows down development. It makes sense to have an automated step to download the entities and it makes sense to add a build step to deal with all the backwards, non-CommonJS module systems. These are both great because they only need to be run once, when you do a release. he on the other hand, uses build steps for everything, to the point where every single change to the code requires rebuilding to see if it has worked.

Here, for reference, is your entire module, without any requirement for a build step:

var regexEncode = require('./regexEncode.js');
var encodeMap = require('./data/encode-map.json');
var decodeMap = require('./data/decode-map.json');

var regexAstralSymbols = /[\uD800-\uDBFF][\uDC00-\uDFFF]/g;
var regexEscape = /[&<>"']/g;
var escapeMap = {
  '&': '&amp;',
  '<': '&lt;',
  '"': '&quot;',
  '\'': '&apos;',
  // See http://mathiasbynens.be/notes/ambiguous-ampersands: the following
  // is not strictly necessary unless part of a tag or an unquoted attribute
  // value. We’re only escaping it to match existing `htmlEscape`
  // implementations.
  '>': '&gt;'
};

// Inspired by `ucs2encode` in http://mths.be/punycode
var codePointToSymbol = function(codePoint) {
  var output = '';
  if (codePoint > 0xFFFF) {
    codePoint -= 0x10000;
    output += String.fromCharCode(codePoint >>> 10 & 0x3FF | 0xD800);
    codePoint = 0xDC00 | codePoint & 0x3FF;
  }
  output += String.fromCharCode(codePoint);
  return output;
};

exports.escape = function(string) {
  return string.replace(regexEscape, function($0) {
    return escapeMap[$0]; // no need to check `has()` here
  });
};

exports.encode = function(string) {
  return string
    .replace(regexEncode, function($0) {
      return encodeMap[$0]; // no need to check `has()` here
    })
    .replace(regexAstralSymbols, function($0) {
      // http://mathiasbynens.be/notes/javascript-encoding#surrogate-formulae
      var high = $0.charCodeAt(0);
      var low = $0.charCodeAt(1);
      var codePoint = (high - 0xD800) * 0x400 + low - 0xDC00 + 0x10000;
      return '&#x' + codePoint.toString(16).toUpperCase() + ';';
    })
    // Replace other non-ASCII chars with a hexadecimal escape
    .replace(/[^\0-\x7F]/g, function($0) {
      return '&#x' + $0.charCodeAt(0).toString(16).toUpperCase() + ';';
    });
};

exports.unescape = 
exports.decode = function(html) {
  return html
    .replace(/&#([0-9]+);?/g, function($0, codePoint) {
      return codePointToSymbol(codePoint);
    })
    .replace(/&#[xX]([0-9a-fA-F]+);?/g, function($0, hexDigits) {
      var codePoint = parseInt(hexDigits, 16);
      return codePointToSymbol(codePoint);
    })
    .replace(/&([0-9a-zA-Z]+;?)/g, function($0, reference) {
      if ({}.hasOwnProperty.call(decodeMap, reference)) {
        return decodeMap[reference];
      } else {
        // ambiguous ampersand; see http://mths.be/notes/ambiguous-ampersands
        return '&' + reference;
      }
    });
}

My guess is that by using a hasOwnProperty check on the encodeMap you could fix it so that regexEncode was sufficiently small to be in-lined.

With this, your repository would consist of:

  • data
    • encode-map.json
    • decode-map.json
  • dist
    • he.js (this is the built file, placed out of the way so nobody has to look at it)
  • test
    • fixtures
      • a few json files probably go in here
    • index.js (the 8 actual tests)
  • .gitignore
  • .travis.yml
  • Gruntile.js (handles downloading and creating updated encode-map.json and decode-map.json + kicking off tests + producing dist folder)
  • LICENSE
  • index.js
  • package.json

If it were me, I wouldn't include dist in that list, since it should really be hosted somewhere other than the GitHub repository, which is for code, not build artifacts. If you really must support component and bower, that adds another 2 files.

As for your final point:

he maintenance boils down to running grunt build and bumping the version number whenever a new version of entities.json comes about (which is very unlikely to happen anyway). I think we can handle it :)

This totally misses how software development works. When errors occur and when changes occur, any code I take on is code I have to maintain. For example, every time there's a character encoding bug, I have to consider the possibility that it's caused by he, which means I have to be happy I can understand and debug that code when that time comes. With software development, it's not a case of if bugs and changes to requirements happen, it's when and I want to remain as well prepared as I can.

@domenic
Copy link

domenic commented Jun 26, 2013

Wow, @ForbesLindesay's code is ridiculously nicer than what's in the he repository (after I even figured out which of the files to look at). Looking forward to seeing if @mathiasbynens accepts that simplification.

@domenic
Copy link

domenic commented Jun 26, 2013

Also, a minor critique of your directory structure @ForbesLindesay: the data/ directory should probably not be in the Git repository, but should be added to the package with a prepublish script that runs the grunt build step.

@mathiasbynens
Copy link
Author

Wow, @ForbesLindesay's code is ridiculously nicer than what's in the he repository (after I even figured out which of the files to look at).

I agree it looks much nicer this way. If he were a Node-only project, I’d definitely take this route.

But I want there to be a single build of he that work across various JavaScript platforms, including Node.js, Narwhal, RingoJS, Rhino, PhantomJS, and web browsers (old and new). I also want to have a single test file that is used to run the unit tests in all those environments. Sure, I could use require everywhere and use Browserify to create builds for sans-require environments, and the repository would be a bit nicer to look at. But then I’d be at the mercy of Browserify’s output: if there is a bug in Browserify, it could break he in certain environments. This brings us back to @ForbesLindesay’s point about how software development works:

For example, every time there's a character encoding bug, I have to consider the possibility that it's caused by he browserify, which means I have to be happy I can understand and debug that code when that time comes. With software development, it's not a case of if bugs and changes to requirements happen, it's when and I want to remain as well prepared as I can.

@ForbesLindesay
Copy link
Member

browserify he ent
stars 2,261 5 54
downloads in the last day 2,337 18 542
downloads in the last week 11,715 18 4,264
downloads in the last month 41,428 18 9,684

Browserify is well written, maintainable code that thousands of people are using. It's pretty safe to rely on. In any case, my point is that I'm deciding between ent and he while you're deciding between browserify and .

If you don't trust browserify, write you own module to produce standalone builds for use in non node.js environments. If you wanted to, you could create something that takes an approximation of the ES6 module format as its input and outputs something that can be digested by node/browsers/r.js etc. That would be pretty neat, but all of that stuff is unrelated to he so it doesn't belong in the repository.

If you think browserify has problems that can be improved, submit pull requests. @substack is pretty on the ball about accepting those pull requests when they're good and they fix real problems.

@mathiasbynens
Copy link
Author

Thanks to your suggestions, the directory structure for he is now a bit cleaner. Check it out. For example, https://github.com/mathiasbynens/he/tree/master/data contains all the data as separate JSON files.

I do intend to keep the export logic for various platforms in the main he.js build, though, for the reasons outlined above. If you’d prefer to stick with ent for this (or any other) reason, please close this pull request. Thank you!

@mathiasbynens
Copy link
Author

Any update on this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[upstream] ent.decode is buggy
3 participants