Web-scraper that bypasses Cloudflare

This scraper is used to extract data from 538 URLs on a research platform called CABI Digital Library. The website uses anti-bot mechanism and services such as cloudflare to obstruct we data extractors from saving the world.

This scraper was bot with interactions, like mouse movements, scrolling, etc in order to mimick a human. It was also built with the ability to change useragent for every requests in order to mask as another user.

Without this scraper

With this scraper

SETUP

Install python
Open your terminal and navigate to the root directory of this project
Create your virtual environment

python -m venv venv

Activate your virtual environment (windows only)

venv/source/activate

Install necessary packages

pip install -r requirements.txt

Run the script

python main.py

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
README.md		README.md
captcha_handler.py		captcha_handler.py
crossref.csv		crossref.csv
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt
source.json		source.json
urls.py		urls.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-scraper that bypasses Cloudflare

SETUP

About

Releases

Packages

Languages

alexander01202/web-scraper-for-research-data

Folders and files

Latest commit

History

Repository files navigation

Web-scraper that bypasses Cloudflare

SETUP

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages