-
Notifications
You must be signed in to change notification settings - Fork 611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changing class and id check to node name check to follow more semantic HTML #848
Conversation
…st the nodeName instead of the className and id
I'm a bit confused by this PR. Most of the substrings in The net effect of your patch is likely roughly the same as eliminating these regular expressions and associated code entirely (or at least reducing the regular expressions to only It would be useful to have more examples of the websites you're trying to fix, and a bit more detail as to what nodes are getting stripped / downscored that you think shouldn't be. Right now I'm not sure what to suggest instead. |
Thank you for getting back to me in such a timely fashion! I appreciate it, I apologize for my tardiness, I was attempting to try and gather more examples. And here are just a couple of them. I'm trying to come up with more. This Url misses like the first few paragraphs seemingly because they have a "section" in their class name, and also not one of the This one captures some of the hidden data, but not others (the FAQ's) because of the hidden class being on the FAQ accordian and the precense of the aria-hidden="true". I would actually prefer it captured neither, or both, but it also not be hinged on the class containing something like "hidden" On my local instance with the char threshold lowered to 300 for this page it still collects the list of items below it. |
I also want to say I very much agree with you, about the list not really being elements, I thought maybe they were a possible I'm not really sure what the solution is, this is just something I tried, that seemed to work pretty well, even with the tests. I'm open to working out some kind of solution. I would just imagine that if I give semantic tags like |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @panda01, thanks for continuing to work on this! I'm aware this is still marked draft so not sure if you were planning to make further changes... but just a few quick notes inline. Is the added logging helping you get to the bottom of what was going on in the testcases that you were dealing with?
@@ -143,7 +143,7 @@ Readability.prototype = { | |||
b64DataUrl: /^data:\s*([^\s;,]+)\s*;\s*base64\s*,/i, | |||
// Commas as used in Latin, Sindhi, Chinese and various other scripts. | |||
// see: https://en.wikipedia.org/wiki/Comma#Comma_variants | |||
commas: /\u002C|\u060C|\uFE50|\uFE10|\uFE11|\u2E41|\u2E34|\u2E32|\uFF0C/g, | |||
commas: /[\s\D][\u002C|\u060C|\uFE50|\uFE10|\uFE11|\u2E41|\u2E34|\u2E32|\uFF0C][\s\D]/g, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can probably be dropped in favour of #853, I guess?
if (this.REGEXPS.unlikelyCandidates.test(node.nodeName.toLowerCase()) && | ||
!this.REGEXPS.okMaybeItsACandidate.test(node.nodeName.toLowerCase()) && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As noted earlier, we would not take this change as it breaks most of the existing filtering that Readability does.
var linkDensity = this._getLinkDensity(node); | ||
var contentLength = this._getInnerText(node).length; | ||
var lessParagraphsThanImages = (img > 1 && p / img < 0.5 && !this._hasAncestorTag(node, "figure")); | ||
var isNotListAndMoreListItemsThanParagraphs = (!isList && li > p); | ||
var moreInputsThanPs = (input > Math.floor(p/3)); | ||
var headingDensityAndContentLengthOff = (!isList && headingDensity < 0.9 && contentLength < 25 && (img === 0 || img > 2) && !this._hasAncestorTag(node, "figure")); | ||
var weightAndLinkDensityIsLow = (!isList && weight < 25 && linkDensity > 0.25); | ||
var weightAndLinkDensityTooHigh = (weight >= 25 && linkDensity > 0.5); | ||
var embedCountAndContentLengthOff = ((embedCount === 1 && contentLength < 75) || embedCount > 1); | ||
var haveToRemove = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for organizing this better!
@@ -1,6 +1,6 @@ | |||
{ | |||
"name": "@mozilla/readability", | |||
"version": "0.5.0", | |||
"name": "@panda01/readability", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume this change was unintentional?
Firstly, @gijsk Thank you very much! I appreciate all of your feedback! I will close this PR though, as this is based off of the main branch, and I simply didn't realize it, if anything if you want any of the changes I can create another branch and cherry pick some of the code changes. To answer your question above though; the extra console logs did indeed help me understand how this worked and also spot some of the issues and find out exactly why certain pages weren't being captured properly, which has been a task of mine recently for sure! Just let me know if there are any changes you want, I guess with comments or something, and we can figure out how to get it in that marvelous repo y'all are maintaining over there. It would be my honor! |
I added this PR to try and fix persistent issues I noticed when trying to fetch data from a website using reader view. the main fix is changing
unlikelyCandidates
regex check to instead check for thenode.nodeName
as opposed to checking the class and id names used viamatchString
.I'm not really sure why we even check the class/id names, and this PR is kind of to ask that question while proposing a solution as well.
Some Examples of websites that are broken, that this fixes. (will add more)
https://www.windsorstore.com/blogs/occasions/11-gorgeous-anniversary-outfit-ideas
Some problems I noticed is that with the tests they all seem to be complaining about classes or id's missing, but I'm having a hard time understanding some of the errors and spotting them. with time I'm sure I could figure it out, but maybe there is some automated way to get the diff between the actual and expected
I'm also willing to clean this up either with a squash and merge or a rebase, as well.
I'm also open to any criticism! This thing works wonderfully 95% of the time and I don't want to break it!