Changing class and id check to node name check to follow more semantic HTML #848

panda01 · 2024-03-12T13:24:43Z

I added this PR to try and fix persistent issues I noticed when trying to fetch data from a website using reader view. the main fix is changing unlikelyCandidates regex check to instead check for the node.nodeName as opposed to checking the class and id names used via matchString.

I'm not really sure why we even check the class/id names, and this PR is kind of to ask that question while proposing a solution as well.

Some Examples of websites that are broken, that this fixes. (will add more)
https://www.windsorstore.com/blogs/occasions/11-gorgeous-anniversary-outfit-ideas

Some problems I noticed is that with the tests they all seem to be complaining about classes or id's missing, but I'm having a hard time understanding some of the errors and spotting them. with time I'm sure I could figure it out, but maybe there is some automated way to get the diff between the actual and expected

I'm also willing to clean this up either with a squash and merge or a rebase, as well.

I'm also open to any criticism! This thing works wonderfully 95% of the time and I don't want to break it!

…st the nodeName instead of the className and id

gijsk · 2024-03-12T13:48:28Z

I'm a bit confused by this PR. Most of the substrings in unlikelyCandidates will never match node names, only classes or IDs. There are no HTML (or XUL or SVG) tags called disqus or social, and there probably never will be. The intent of these checks is to strip nodes that probably shouldn't be candidates, or if that doesn't leave enough content, to down-score them compared to other nodes.

The net effect of your patch is likely roughly the same as eliminating these regular expressions and associated code entirely (or at least reducing the regular expressions to only article|body|main for okMaybeItsACandidate and footer|header|menu for unlikelyCandidates).

It would be useful to have more examples of the websites you're trying to fix, and a bit more detail as to what nodes are getting stripped / downscored that you think shouldn't be. Right now I'm not sure what to suggest instead.

panda01 · 2024-03-12T18:33:03Z

Thank you for getting back to me in such a timely fashion! I appreciate it, I apologize for my tardiness, I was attempting to try and gather more examples. And here are just a couple of them. I'm trying to come up with more.

This Url misses like the first few paragraphs seemingly because they have a "section" in their class name, and also not one of the okMaybeItsACandidate classes
https://www.troweprice.com/personal-investing/resources/insights/retirement-savings-by-age-what-to-do-with-your-portfolio.html

This one captures some of the hidden data, but not others (the FAQ's) because of the hidden class being on the FAQ accordian and the precense of the aria-hidden="true". I would actually prefer it captured neither, or both, but it also not be hinged on the class containing something like "hidden"
https://www.ondemand.labcorp.com/lab-tests/complete-blood-count

On my local instance with the char threshold lowered to 300 for this page it still collects the list of items below it.
https://www.krefel.be/nl/c/airfryers

panda01 · 2024-03-12T18:38:59Z

I also want to say I very much agree with you, about the list not really being elements, I thought maybe they were a possibletagName or something. And even though I understand that if there are no top candidates, it will do a second loop, but a lot of my issues tend to stem from there being some kind of topCandidates that indeed do workout, and so it never hits that second loop.

I'm not really sure what the solution is, this is just something I tried, that seemed to work pretty well, even with the tests. I'm open to working out some kind of solution. I would just imagine that if I give semantic tags like <article /> and or <section /> would mean something to the capturing of content on the page, and under some scenarios it doesn't seem to.

…things

…s in order to fix certain pages like https://www.consumerreports.org/appliances/air-purifiers/best-air-purifiers-of-the-year-a1197763201/

…o 25% for paragraphs to make pages like https://www.consumerreports.org/cars/car-reliability-owner-satisfaction/10-most-reliable-cars-a6569295379/ and https://www.consumerreports.org/cars/car-reliability-owner-satisfaction/10-most-satisfying-cars-owner-satisfaction-a2239167129/

gijsk

Hey @panda01, thanks for continuing to work on this! I'm aware this is still marked draft so not sure if you were planning to make further changes... but just a few quick notes inline. Is the added logging helping you get to the bottom of what was going on in the testcases that you were dealing with?

gijsk · 2024-04-08T13:36:25Z

Readability.js

@@ -143,7 +143,7 @@ Readability.prototype = {
    b64DataUrl: /^data:\s*([^\s;,]+)\s*;\s*base64\s*,/i,
    // Commas as used in Latin, Sindhi, Chinese and various other scripts.
    // see: https://en.wikipedia.org/wiki/Comma#Comma_variants
-    commas: /\u002C|\u060C|\uFE50|\uFE10|\uFE11|\u2E41|\u2E34|\u2E32|\uFF0C/g,
+    commas: /[\s\D][\u002C|\u060C|\uFE50|\uFE10|\uFE11|\u2E41|\u2E34|\u2E32|\uFF0C][\s\D]/g,


This can probably be dropped in favour of #853, I guess?

gijsk · 2024-04-08T13:36:54Z

Readability.js

+          if (this.REGEXPS.unlikelyCandidates.test(node.nodeName.toLowerCase()) &&
+              !this.REGEXPS.okMaybeItsACandidate.test(node.nodeName.toLowerCase()) &&


As noted earlier, we would not take this change as it breaks most of the existing filtering that Readability does.

gijsk · 2024-04-08T13:38:07Z

Readability.js

+        var linkDensity = this._getLinkDensity(node);
+        var contentLength = this._getInnerText(node).length;
+        var lessParagraphsThanImages = (img > 1 && p / img < 0.5 && !this._hasAncestorTag(node, "figure"));
+        var isNotListAndMoreListItemsThanParagraphs = (!isList && li > p);
+        var moreInputsThanPs = (input > Math.floor(p/3));
+        var headingDensityAndContentLengthOff = (!isList && headingDensity < 0.9 && contentLength < 25 && (img === 0 || img > 2) && !this._hasAncestorTag(node, "figure"));
+        var weightAndLinkDensityIsLow = (!isList && weight < 25 && linkDensity > 0.25);
+        var weightAndLinkDensityTooHigh = (weight >= 25 && linkDensity > 0.5);
+        var embedCountAndContentLengthOff = ((embedCount === 1 && contentLength < 75) || embedCount > 1);
        var haveToRemove =


Thanks for organizing this better!

gijsk · 2024-04-08T13:38:41Z

package.json

@@ -1,6 +1,6 @@
 {
-  "name": "@mozilla/readability",
-  "version": "0.5.0",
+  "name": "@panda01/readability",


I assume this change was unintentional?

panda01 · 2024-04-11T19:03:51Z

Firstly, @gijsk Thank you very much! I appreciate all of your feedback! I will close this PR though, as this is based off of the main branch, and I simply didn't realize it, if anything if you want any of the changes I can create another branch and cherry pick some of the code changes.

To answer your question above though; the extra console logs did indeed help me understand how this worked and also spot some of the issues and find out exactly why certain pages weren't being captured properly, which has been a task of mine recently for sure!

Just let me know if there are any changes you want, I guess with comments or something, and we can figure out how to get it in that marvelous repo y'all are maintaining over there. It would be my honor!

panda01 added 5 commits March 11, 2024 09:39

[unlikelyCandidatesFix] testing Unlikely, and likely candidates again…

d2fc686

…st the nodeName instead of the className and id

[Logs] added more logs around some of the other events that happen

1cf0b50

[toLowerCase] lower case the tagName check

b2b5e6a

Adding back nodeName check instead of class and id check

23eef7e

[PR completeness] Fixed linting issues, and a test

73d938d

panda01 added 11 commits March 21, 2024 16:03

[price commas fix] Adding a fix to not count commas in the prices of …

11df464

…things

[naming] changing the package name as not to get conflicts

2b7125f

0.5.1

3329331

0.5.2

b67b74d

[readability positivelist] adding cda-round-up to positive class name…

88cc372

…s in order to fix certain pages like https://www.consumerreports.org/appliances/air-purifiers/best-air-purifiers-of-the-year-a1197763201/

0.5.3

09af4f9

[logs] adding logging around Cleanconditionally Remove Nodes

1fd2f03

0.5.4

c16d0bc

[snafu] adding back product to the list of negative classnames

a2e3121

0.5.5

64d9a1f

gijsk reviewed Apr 8, 2024

View reviewed changes

panda01 closed this Apr 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changing class and id check to node name check to follow more semantic HTML #848

Changing class and id check to node name check to follow more semantic HTML #848

panda01 commented Mar 12, 2024 •

edited

Loading

gijsk commented Mar 12, 2024 •

edited

Loading

panda01 commented Mar 12, 2024 •

edited

Loading

panda01 commented Mar 12, 2024

gijsk left a comment

gijsk Apr 8, 2024

gijsk Apr 8, 2024

gijsk Apr 8, 2024

gijsk Apr 8, 2024

panda01 commented Apr 11, 2024 •

edited

Loading

		if (this.REGEXPS.unlikelyCandidates.test(node.nodeName.toLowerCase()) &&
		!this.REGEXPS.okMaybeItsACandidate.test(node.nodeName.toLowerCase()) &&

Changing class and id check to node name check to follow more semantic HTML #848

Changing class and id check to node name check to follow more semantic HTML #848

Conversation

panda01 commented Mar 12, 2024 • edited Loading

gijsk commented Mar 12, 2024 • edited Loading

panda01 commented Mar 12, 2024 • edited Loading

panda01 commented Mar 12, 2024

gijsk left a comment

Choose a reason for hiding this comment

gijsk Apr 8, 2024

Choose a reason for hiding this comment

gijsk Apr 8, 2024

Choose a reason for hiding this comment

gijsk Apr 8, 2024

Choose a reason for hiding this comment

gijsk Apr 8, 2024

Choose a reason for hiding this comment

panda01 commented Apr 11, 2024 • edited Loading

panda01 commented Mar 12, 2024 •

edited

Loading

gijsk commented Mar 12, 2024 •

edited

Loading

panda01 commented Mar 12, 2024 •

edited

Loading

panda01 commented Apr 11, 2024 •

edited

Loading