-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SITES] https://www.bbc.com #625
Comments
Hi there, make sure that you are not blocked by bbc - try:
also check if the article object has some content in the can you also try the same code with v 0.9.3? |
Thank you for responding. I am not blocked by the BBC. Most other fields are non empty like title, authors and even html which seems to have the text of the article. Just the 'text' field in the output which used to be populated earlier is now empty. I noticed that the website UI of the BBC has changed. I think that could be the reason for this issue. I am unable to import newspaper after installing the v 0.9.3. There seems to be some dependency issue which I am trying to figure out. But I don't think this could be the problem. |
What's the error. I did in fact change how dependencies are installed.
You are right, there seems to be an issue with bbc. I will investigate it |
Yeah, there is a problem. It seems that bbc.com is now just dynamically rendered, there page is constructed with javascript after it loads. Here, you can see that there are not text elements to render without javascript: https://www.textise.net/showText.aspx?strURL=https%253A//www.bbc.com/news/business-67470876 Quick Fixes:
Anyway i will think about other alternative solutions |
The html component of the response seems to have the text content although not in a contiguous paragraph form. Maybe that is something to look at. I tried v0.9.3 on google colab. Regarding the error while importing newspaper, this is what I got when I did pip install:
|
On subject of bbc and dates why does bbc article prepend date to _text string now? e.g. Published\n\n8 March\n\n source https://www.bbc.co.uk/news/uk-england-london-68511760 |
interesting about playwright and textise. Content for bbc seems to be in the main page render as well as attached via some nonce window object. e.g.
|
First please check that it is really an issue with the library, and not some special case of website:
[ ] There is no paywall
[ ] You do not have to be logged in to see the articles
[ ] You tried using a common browser user agent in your configuration / call
[ ] The website is not in the list of well known problematic sites
Your report as follows:
Website that does not parse correctly:
Some sample urls that I have tried
The exact code i used to test this articles/website
** What parts of the article are missing / not parsed correctly **
[ ] Text Content
Other information, remarks, messages, etc:
It was working until a few days ago. I am using the package with version 0.9.2
The text was updated successfully, but these errors were encountered: