Getting Older News Articles #580

AndyTheFactory · 2023-10-24T20:01:24Z

Issue by PaulKMandal
Wed Sep 13 18:19:42 2023
Originally opened as codelucas/newspaper#973

Hello, seven years ago this was posted: codelucas/newspaper#245

I have a problem that requires me to scrape a large corpus of titles from 2013-2019 from various news sources. Ideally I would like to scrape 10 articles per date through this date range. The issue that I have is that newspaper only pulls the latest results. Does anyone have any insights on how to achieve this? Thanks!

AndyTheFactory · 2023-10-24T20:01:25Z

Comment by banagale
Wed Sep 13 20:20:37 2023

Paul, out of curiosity, can you share why you’re trying to use this package instead of scrapy?

AndyTheFactory · 2023-10-24T20:01:27Z

Comment by johnbumgarner
Mon Sep 18 13:01:33 2023

Paul, can you share an example of what you are trying to do?

AndyTheFactory · 2023-10-24T20:01:29Z

Comment by PaulKMandal
Thu Sep 21 19:46:04 2023

I apologize for the delay.

Paul, out of curiosity, can you share why you’re trying to use this package instead of scrapy?

Because NewsPaper has very robust functionality for scraping News Articles.

Paul, can you share an example of what you are trying to do?

Ideally, I want to be able to specify a date and scrape the news articles from a certain date. I wrote an implementation that pulls articles from Archive.org. My implementation is available here and it works as intended, but Archive.org can be slow and often times out.

AndyTheFactory · 2023-10-24T20:01:31Z

Comment by johnbumgarner
Sun Sep 24 21:07:53 2023

So you want to search the wayback archives. I wrote an example on this in my overview document for NewsPaper. If you provide me some more details I will add another example on searching a resource (website) for random articles for a certain date(s).

FYI the wayback archives are always slow for scraping. There might be a way to gain some performance, but that would require testing.

AndyTheFactory added the documentation Improvements or additions to documentation label Oct 25, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting Older News Articles #580

Getting Older News Articles #580

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

Getting Older News Articles #580

Getting Older News Articles #580

Comments

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023

AndyTheFactory commented Oct 24, 2023