- Search for specific keywords within a collection of PDF files.
- View matched lines from the PDF files for each search result.
- Sort search results based on the relevance of matches.
- Display search results with a calculated relevance ratio.
- Web interface powered by Flask and SQLite database.
- Python 3.x
- Flask
- PyPDF2
-
Clone this repository:
git clone https://github.com/FelixKohlhas/pdf_search.git cd pdf_search
-
Install the required Python packages:
pip install -r requirements.txt
-
Create the database
python generate_db.py <path to pdfs>
-
Run the web interface:
python app.py -f <path to pdfs>
-
Open your web browser and navigate to
http://localhost:5001
to access the PDF search.
usage: generate_db.py [-h] [-d DATABASE] pdf_folder
Extract text from PDF files and store it in a SQLite database.
positional arguments:
pdf_folder Path to the folder containing PDF files
options:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
Path of the database
usage: app.py [-h] [-d DATABASE] [-u URL_PREFIX] [-f FILES] [--port PORT]
Flask web interface to search PDF files by their content.
options:
-h, --help show this help message and exit
-d DATABASE, --database DATABASE
Path of the database
-u URL_PREFIX, --url-prefix URL_PREFIX
URL to prefix to relative paths
-f FILES, --files FILES
Directory of PDF files (optional; allows access to the files through webinterface)
--port PORT Port to run the Flask app (default: 5001)
Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or create a pull request.
This project is licensed under the MIT License. See the LICENSE file for details.