A search engine and web crawler written in Python.
- Automatic web crawler
- SQLite index
- Automatic keyword extraction
- Automatic description snippet generation
- Keyword lemmatisation
- Frontend search engine
- PageRank algorithm
- robots.txt caching and compliance
- Robots meta tag compliance
Install the requirements:
pip install -r requirements.txt
Netquery consists of two components - the crawler and the search engine. For the search engine to work properly, you must run the crawler first for a few hours to generate the index. This may require some manual fine-tuning of the constants encoded in crawler/crawler.py
.
Additionally, ensure that you have authorisation from your network administrator, and you have enough free space on your disk - the pagerank index can take up a few gigabytes.
The crawler can run at the same time as the search engine.
To run the search engine: python app.py
To run the crawler: python crawler/crawler.py
This project is licensed under the GNU Affero General Public License.