Web Crawler is a Python-based command-line tool for crawling web pages and extracting information about links and page ranks. It supports crawling HTTP and HTTPS URLs up to a specified depth limit.
- Crawls web pages and processes links up to a specified depth.
- Calculates page rank based on the ratio of same-domain links.
- Supports HTTP and HTTPS URLs.
- Avoids re-downloading pages already visited during the current execution.
- Customizable logging and verbose output.
- Clone the repository
- Install the required dependencies using Poetry:
poetry install
git clone https://github.com/liorzam/web-crawler.git cd web-crawler poetry install
<root_url>
: The URL of the starting web page for crawling.- -d or --depth_limit: (Optional) The recursion depth limit for crawling (default: 3).
- -l or --page_link_limit: (Optional) The maximum number of page href links to collect (default: 10).
$ # crawler <root_url> [-d <depth_limit>] [-l <page_link_limit>]
$ crawler <root_url> --depth_limit 3 --page_link_limit 10
$ crawler <root_url> -d 3 -l 10
The crawler supports different levels of logging verbosity. By default, the log level is set to 'info', providing essential information about the crawling process. You can increase the verbosity to 'debug' level to get more detailed logs, which can be helpful for debugging.
To enable 'debug' level logging, set the LOG_LEVEL environment variable to 'debug' before running the crawler:
LOG_LEVEL=debug poetry run crawler https://example.com
$ docker-compose run crawler -d 3 -l 10 <root_url>
$ python main.py -h
usage: main.py [-h] [-d DEPTH_LIMIT] [-l PAGE_LINK_LIMIT] root_url
Web Crawler
positional arguments:
root_url The root URL to start crawling from
optional arguments:
-h, --help show this help message and exit
-d DEPTH_LIMIT, --depth_limit DEPTH_LIMIT
The recursion depth limit (default: 3)
-l PAGE_LINK_LIMIT, --page_link_limit PAGE_LINK_LIMIT
Page href link limit (default: 10)
Replace <root_url>
with the URL of the root page to start crawling from, and <depth_limit>
with a positive integer representing the recursion depth limit.
The crawler will output a TSV file containing URL, depth, and rank information.
- Prepare Your Package - Before publishing your package, make sure your project structure is well-organized, and you have defined the necessary metadata in your pyproject.toml file. Ensure that your package includes the required files and dependencies.
- Increment Version - If you're updating your package, consider incrementing the version number in your pyproject.toml file to indicate the new version. You can update the version field under [tool.poetry].
- Build and Publish the Package - The --build option tells Poetry to build the distribution files before publishing.
poetry publish --build
This project uses the following dependencies:
- requests: For fetching web pages.
- beautifulsoup4: For HTML parsing.
- argparse: For command-line argument parsing.
You can find the complete list of dependencies in the pyproject.toml file.
This project is licensed under the MIT License - see the LICENSE file for details.
Feel free to contribute to this project by opening issues or pull requests. Happy crawling!!