-
Notifications
You must be signed in to change notification settings - Fork 610
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Numerical Commas Fix #853
base: main
Are you sure you want to change the base?
Numerical Commas Fix #853
Conversation
It'd be helpful to get a little more detail here. Is there a specific part of this page (such as the description) you'd like to be included in the results? What's the character threshold value to see the expected results? |
@cmkm Sorry had a hectic past few days.
When Readability tries to grab the content it only grabs the product list and the descriptions below it. In my investigations I found that the comma detection was giving the product list extra points (like 55 was the score) because of the commas it was detecting in the prices like 109,00. I just am assuming here that detecting commas in prices, numbers, etc was not the original purpose of that code, and we really don't care about those anyway if we're looking for readable content. So i just changed the Regex to only look for commas surrounded by non digits. Which def helped with the extraction for the specific page above, and gave me that first paragraph and all of the content below. Which is def closer I would say to what a user might expect when using readability on that page |
Thank you! This seems reasonable, but before we merge it, would you mind please adding a test case by using the instructions here? If you have an example of a page with slightly more readerable content, that would be ideal. Thanks! :) |
@cmkm Awesome, I definitely would love to contribute! However I did try and run that script to no avail, and even went so far as adding some more console.log's on it to see where it's getting hung up. Below is a screenshot of my modifications and the output when I try and run the command. I'm not sure what I'm doing wrong here, so any guidance is greatly appreciated! |
Thanks for your patience here! I think what you're running into is the same thing that I was, which is that this website may be blocking certain user agents, so your requests are timing out. Is there another website test case that your patch fixes that you might be able to try instead? |
Busy couple of weeks there! I can def try and find some new source to check this. Just was thinking what happened with this PR |
So I tried to generate a testcase for this locally to test this. The problem is that it seems to do the opposite of what the comments suggest. Specifically, before this patch, the "expected" HTML includes the "Wil je graag een airfryer kopen?" text. After the patch, that paragraph is removed. |
@gijsk This is an image of the website on my ff v127.0 64bit apple intel. I don't see the above mentioned section "Wil je graag een airfryer kopen?" Also, with this particular website, I believe you also have to lower the character threshold in order for this to work. Either way, the product list shouldn't be counted so high in the algorithm. The reason why it is counted so high is because of the commas being counted in the prices, which is why this PR works. |
Hello, I would really like to contribute to this repo. And apparently this is still relevant, however this seems stuck. Is there something I'm missing here as to why this can't be merged? It's a very tiny PR, and I'm just having a hard time understanding what the hold up is, and what I can do differently. |
Well, we'd really like to make sure that the patch works and that we don't break it in the future. In order to do that it'd be helpful to have a testcase as part of the pull request that checks the output for a site where this PR makes a difference. As I said, I tried to generate one based on the site linked in the initial comment but the outcome of the changes here is that the produced text is worse than before the changes. It sounds like you're confident that this is a positive change so it would be helpful if you could include a testcase that demonstrates this. |
Added a fix to not count commas inside of things with prices like 5,99 etc to give a more accurate content score.
One example of this causing problems is on this page https://www.krefel.be/nl/c/airfryers. However I think you also have to lower the char threshold for it to work