An example of how to scrape a newspapers' website using Python, requests and bs4.
This small project is meant to exemplify how one can easily scrape a newspaper website - this is a free access website, you'd use selenium should log in actions be required - to retrieve articles and respective content, writing the results to a csv file with one article per line.
The approach is amenable to feed downstream Natural Language Processing applications, especially if combined with the Mediacloud API endpoint for identifying article urls of interest.
Git and Python (you can change the version in Pipfile) installation is assumed. To set up run:
git clone [email protected]:marquesafonso/scraping101_py.git
pip install pipenv
pipenv install
Here we grab 5 articles from https://24.sapo.pt/, place them in a list and inspect the html using the F12 key in the browser. See the image below for an example:
This allows us to investigate the elements we wish to scrape:
- Label: The category of the article.
- Title: The title of the article.
- Lead: The lead of the article.
- Author: The author of the article.
- Date: The date the article was published in.
- Body: The article text itself.
The requirements of the project are:
- requests: Allows us to make HTTP requests to the urls we wish to scrape, returning the html as the response
- bs4: Allows us to convert the responses from the requests into soup objects, which come with methods such as find() allowing us to efficiently parse the html and retrieve the text we are looking for.
Two additional functions are used to conveniently convert the date strings into a general date/time pattern (long time) - see https://docs.microsoft.com/en-us/dotnet/standard/base-types/standard-date-and-time-format-strings for more info. This makes the date string ready for more interesting uses and to be loaded to a database if needed.
Feel free to play around with the code and adapt it to your needs. To test it simply run:
pipenv run python scraper.py --outfile 'output/ex_scrape.csv'
And check out the output folder to see the results!