PaperMatch: arXiv Search with Embeddings and Milvus
Backend at embed_arxiv_simpler
This project allows users to search for arXiv papers either by ID or abstract. The search functionality is powered by a machine learning embedding model and Milvus, a vector database. Gradio is used to create a user-friendly web interface for interaction.
See implemented demo at papermatch.mitanshu.tech
See full explanation at the corresponding blog post: mitanshu.tech/posts/papermatch
- Search by Abstract: Convert the abstract into a vector and find similar papers based on cosine similarity.
- Search by ID: Retrieve information directly by arXiv ID.
- Top K Results: Display the top K results from Milvus based on similarity.
- Embedding Model: The embedding model used is mixedbread-ai/mxbai-embed-large-v1 which happens to have these nice properties.
-
Clone the repository:
git clone https://github.com/mitanshu7/PaperMatch.git cd PaperMatch
-
Create a virtual environment (optional but recommended):
python -m venv venv source venv/bin/activate # On Windows use `venv\Scripts\activate`
-
Install the required packages:
pip install -r requirements.txt
- Setup app.py :
- If using API to create embeddings, keep
LOCAL=False
:- Get your key from Mixedbread
and paste it in
.env
file. See.env.sample
for config.
- Get your key from Mixedbread
and paste it in
- Keep
FLOAT=True
if you want to use float32 embeddings, else it will use binary embeddings.
-
Run the Gradio app:
python app.py
-
Interact with the web interface:
- Open your web browser and go to
http://localhost:7860
to access the Gradio interface. - Use the search bar to input arXiv ID or abstract and view the search results.
- Open your web browser and go to
Here is a basic example of how to use the search feature:
-
Search by Abstract:
- Enter the abstract of the paper in the provided text box.
- The system will convert it to a vector, query Milvus, and return the most relevant papers.
-
Search by ID:
- Input an arXiv ID directly.
- Retrieve and display the corresponding paper details.
- Create folder using
mkdir -p ~/.config/systemd/user/
if it doesn't already exist. - Create a service file using:
nano ~/.config/systemd/user/papermatch.service
with the following contents (assuming using miniforge package manager with env namepapermatch
):
[Unit]
Description=PaperMatch App
After=network.target
[Service]
WorkingDirectory=/home/$USER/PaperMatch/
ExecStart=/bin/bash -c "source /home/$USER/miniforge3/bin/activate papermatch && python app.py"
Restart=always
[Install]
WantedBy=default.target
- Issue
systemctl --user daemon-reload
to reload systemd. - Issue
systemctl --user start papermatch.service
to start the app. - Issue
systemctl --user enable papermatch.service
to enable app at start up.
Feel free to contribute to the project by submitting issues, pull requests, or suggestions.
This project is licensed under the MIT License. See the LICENSE file for details.
For any questions or feedback, please contact [email protected].