Many tutorials on LangChain chatbots rely on using paid tokens from 3rd party providers. Our aim is to construct a simple document question answering bot using open-source tools.
The model architecture is based on a ConversationalRetrievalChain with memory.
We tested with the following data:
- csv - health record data available from Kaggle
Where the source column of text is 'Transcription' - pdf Several academic papers on using machine learning for automatic code generation.
-
Codebase uses pipenv as environment manager. Install pipenv
pip install pipenv
-
Navigate to the project directory.
-
Run command
pipenv install --ignore-pipfile
You can also of course install the required packages with your preferred environment manager. The package list is found in Pipfile.
-
Make sure you have a directory with the target documents that you want to ask questions about. At present, pdf and csv formats supported.
-
Download an open-source language model. In this code, we used GPT4All.
If you use other models,
you must change the llm and embeddings imports to the appropriate one:
from langchain.llms import GPT4All
from langchain.embeddings import GPT4AllEmbeddings
Refer here for available LLMs.
-
Create a
local_config.py
file with a MODEL_PATH variable pointing to your model. -
Navigate to your src directory and run:
python main.py --directory '<document_path>' --file_type 'csv'
Here are some ideas for improvements. We list them in order of priority:
-
Performance improvement. Currently, a relatively involved question takes around 10 minutes.
Yes, totally impractical but this is a first attempt on open-source data running on consumer-grade hardware. -
Document reference improvement. In the case of CSVs, the output includes the source rows of the answer,
but there appears to be a bug with row duplication. Need to test how this works
for multiple csv documents. -
More advanced pdf processing. It is practical to be able to handle PDFs with charts and tables.