[SITES] https://www.bbc.com #625

Sriram629009746 · 2024-03-18T10:23:41Z

First please check that it is really an issue with the library, and not some special case of website:

[ ] There is no paywall
[ ] You do not have to be logged in to see the articles
[ ] You tried using a common browser user agent in your configuration / call
[ ] The website is not in the list of well known problematic sites

Your report as follows:

Website that does not parse correctly:

https://www.bbc.com

Some sample urls that I have tried

https://www.bbc.com/news/world-australia-67832905
https://www.bbc.com/news/business-67470876

The exact code i used to test this articles/website

import newspaper

url = "https://www.bbc.com/news/business-67470876"
result = newspaper.article(url)
print(result.text)
# the text is empty

** What parts of the article are missing / not parsed correctly **
[ ] Text Content

Other information, remarks, messages, etc:
It was working until a few days ago. I am using the package with version 0.9.2

AndyTheFactory · 2024-03-18T10:33:51Z

Hi there,

make sure that you are not blocked by bbc - try:

import requests
response = requests.get("https://www.bbc.com/news/business-67470876")
print(response.status_code)
print(response.text)

also check if the article object has some content in the html property

can you also try the same code with v 0.9.3?

Sriram629009746 · 2024-03-18T15:40:06Z

Thank you for responding.

I am not blocked by the BBC. Most other fields are non empty like title, authors and even html which seems to have the text of the article. Just the 'text' field in the output which used to be populated earlier is now empty.

I noticed that the website UI of the BBC has changed. I think that could be the reason for this issue.

I am unable to import newspaper after installing the v 0.9.3. There seems to be some dependency issue which I am trying to figure out. But I don't think this could be the problem.

AndyTheFactory · 2024-03-18T17:37:25Z

I am unable to import newspaper after installing the v 0.9.3. There seems to be some dependency issue which I am trying to figure out. But I don't think this could be the problem.

What's the error. I did in fact change how dependencies are installed.

I am not blocked by the BBC. Most other fields are non empty like title, authors and even html which seems to have the text of the article. Just the 'text' field in the output which used to be populated earlier is now empty.

You are right, there seems to be an issue with bbc. I will investigate it

AndyTheFactory · 2024-03-18T18:14:34Z

Yeah, there is a problem. It seems that bbc.com is now just dynamically rendered, there page is constructed with javascript after it loads. Here, you can see that there are not text elements to render without javascript: https://www.textise.net/showText.aspx?strURL=https%253A//www.bbc.com/news/business-67470876

Quick Fixes:

Easiest. Replace bbc.com with bbc.co.uk - https://www.bbc.co.uk/news/business-67470876 works just fine
A little more cumbersome - use playwright to render the webpage before parsing it. Here is an example

Anyway i will think about other alternative solutions

Sriram629009746 · 2024-03-18T18:43:29Z

The html component of the response seems to have the text content although not in a contiguous paragraph form. Maybe that is something to look at.

I tried v0.9.3 on google colab. Regarding the error while importing newspaper, this is what I got when I did pip install:

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
bigframes 0.24.0 requires pandas<2.1.4,>=1.5.0, but you have pandas 2.2.1 which is incompatible.
google-colab 1.0.0 requires pandas==1.5.3, but you have pandas 2.2.1 which is incompatible.
Successfully installed feedparser-6.0.11 newspaper4k-0.9.3 numpy-1.26.4 pandas-2.2.1 requests-file-2.0.0 sgmllib3k-1.0.0 tldextract-5.1.1 tzdata-2024.1

When I try to import after installation, I get this error:

2dareis2do · 2024-03-27T14:19:53Z

On subject of bbc and dates why does bbc article prepend date to _text string now?

e.g.

Published\n\n8 March\n\n

source https://www.bbc.co.uk/news/uk-england-london-68511760

2dareis2do · 2024-03-27T14:33:41Z

interesting about playwright and textise.

Content for bbc seems to be in the main page render as well as attached via some nonce window object.

e.g.

<script nonce>
window.__INITIAL_DATA__={}
</script>

Sriram629009746 added the sites not working label Mar 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SITES] https://www.bbc.com #625

[SITES] https://www.bbc.com #625

Sriram629009746 commented Mar 18, 2024

AndyTheFactory commented Mar 18, 2024

Sriram629009746 commented Mar 18, 2024

AndyTheFactory commented Mar 18, 2024

AndyTheFactory commented Mar 18, 2024 •

edited

Loading

Sriram629009746 commented Mar 18, 2024

2dareis2do commented Mar 27, 2024

2dareis2do commented Mar 27, 2024

[SITES] https://www.bbc.com #625

[SITES] https://www.bbc.com #625

Comments

Sriram629009746 commented Mar 18, 2024

First please check that it is really an issue with the library, and not some special case of website:

Your report as follows:

AndyTheFactory commented Mar 18, 2024

Sriram629009746 commented Mar 18, 2024

AndyTheFactory commented Mar 18, 2024

AndyTheFactory commented Mar 18, 2024 • edited Loading

Sriram629009746 commented Mar 18, 2024

2dareis2do commented Mar 27, 2024

2dareis2do commented Mar 27, 2024

AndyTheFactory commented Mar 18, 2024 •

edited

Loading