This repository hosts a powerful web crawler specifically designed for monitoring activities on the hidden web. It leverages Docker Compose to seamlessly orchestrate multiple services, including MongoDB for data storage, Redis for caching and task management, and multiple Tor containers to ensure robust anonymity and secure communication. This setup provides a scalable and efficient framework for collecting and analyzing hidden web data while prioritizing privacy and security. |
1. Docker-Based Deployment: Quick setup and deployment using Docker. 2. Advanced Search Functionality: Provides comprehensive search capabilities with various filters and options to refine search results. 3. Data Visualization: Generates visual representations of the data, making it easier to analyze search results. 4. Customizable Search Parsers: Allows for integrating custom parsers to refine data extraction from specific websites. 5. Integrated Machine Learning Models: Incorporates NLP and machine learning models to provide search relevance, content categorization, and detection of specific data patterns.
|
Ensure you have the following installed on your system:
- Python
- Docker
- Docker Compose
git clone https://github.com/msmannan00/Orion-Crawler.git
cd Orion-Crawler
docker-compose up --build
This command will build and start the following services:
API Service (api): The main webcrawler service that runs according to the predefined settings.
MongoDB (mongo): Database for storing crawled data.
Redis (redis_server): In-memory data store for caching and task queuing.
Tor Containers (tor-extend-*): Multiple Tor instances to route crawler traffic through different Tor exit nodes.
You can run the webcrawler in two ways:
- Copy app/libs/nltk_data folder to appdata in windows or home directory in linux.
- Navigate to the Orion-Crawler/app/ directory.
- Run the webcrawler directly using:
python main_direct.py
- The webcrawler can also be started using Docker, which utilizes the start_app.sh script:
docker-compose up --build
api/: Contains the webcrawler source code. data/db/: Directory where MongoDB stores data. dockerFiles/: Dockerfiles for building custom images.
Follow the installation steps to set up and run the webcrawler. After starting the services, the crawler will automatically begin monitoring specified dark web URLs through the Tor network, storing data in MongoDB. Redis is used for caching and managing tasks.
Each Tor container is configured to run as a separate instance, routing traffic through different Tor exit nodes. This increases anonymity and reduces the chances of IP bans.
You can scale the number of Tor instances by modifying the docker-compose.yml file and adding more tor-extend-* services as needed.