Polish-Cave-Data-Scraper is a robust Python-based tool designed to scrape and collect comprehensive data on Polish caves from the Central Geological Database of Polish Caves (CBDG) managed by the Polish Society for Friends of Earth Sciences (PTPNoZ). The scraper gathers standardized information, including geolocation, morphology, environmental data, historical descriptions, and graphic attachments such as plans, sections, and photographs. This dataset serves as a valuable resource for researchers, conservationists, and speleologists interested in the geological and environmental aspects of Polish caves.
- Python 3.8 or higher
- Poetry (Python package manager)
-
First, ensure you have Poetry installed on your system. If not, install it using:
curl -sSL https://install.python-poetry.org | python3 -
-
Clone the repository:
git clone https://github.com/yourusername/polish-cave-data-scraper.git cd polish-cave-data-scraper
-
Install project dependencies using Poetry:
poetry install
To ensure a clean environment for the project:
-
Remove any existing virtual environment (if present):
poetry env remove python
-
Clear Poetry's cache (optional):
poetry cache clear . --all
-
Create a new virtual environment and install dependencies:
poetry install
The scraper consists of two main scripts that should be run in sequence:
-
First, run the data fetching script:
poetry run python fetch.py
This script collects raw data from the CBDG database.
-
Then, run the parsing script:
poetry run python parse.py
This script processes the collected data into a structured format.