This repository shows an example integration of the following tools:
- Scrapy, scraping framework for Python
- Playwright, headless cross-platform browser
- Scrapy rotating proxies
- TOR network proxies
- Playwright stealth
- Beautiful Soup
The target used in this demo are the following authorized playgrounds:
The following websites are also used to get more information about the proxies and the web browser:
Refer to the documentation here for more details.
Clone the repository as:
git clone https://github.com/dmg0345/scrapy_tor_playwright_demo
Ensure the Github file with the relevant environment variables exist as expected in the compose.yaml file and the correct paths are set in the manage.ps1 file for your environment. Afterwards, find the base Docker image for the development container at DockerHub.
To develop using devcontainers and Visual Studio Code:
docker pull dmg00345/scrapy_tor_playwright_demo:latest
docker pull pickapp/tor-proxy:latest
./manage.ps1 run
To generate a release follow the steps below:
- Create a
release
branch fromdevelop
branch, e.g.release/X.Y.Z
. - Update version in conf.py file and in pyproject.toml file.
- Create pull request from
release
branch tomaster
with the changes with title Release X.Y.Z. - When merged in
master
create release and tag from Github, review production workflow passes for deployment. - Delete the
release/X.Y.Z
branch.