My unofficial implementation of Microsoft's SpreadsheetLLM paper, found here: https://arxiv.org/pdf/2407.09025.
All requirements are listed in requirements.txt. I have attached two Dockerfiles, one for the command line utility and one for the chatbot.
You will also need to download the VFUSE dataset from TableSense, found here: https://figshare.com/projects/Versioned_Spreadsheet_Corpora/20116
Environment Variables: OPENAI_API_KEY for GPT 3.5/4, HUGGING_FACE_KEY for Llama-2/3, Phi-3, and Mistral
By default, running python main.py will generate the number of tables in 7b5a0a10-e241-4c0d-a896-11c7c9bf2040.xls. Use the command line arguments if you want to compress all files in a given directory, change the model, etc.
To run the chatbot, run streamlit run chatbot.py
.
- Only text was considered for the structural anchor-based extraction, formatting (border, color, etc.) was not considered
- NFS Identification currently relies on regular expressions and may not be robust
- Only .xls files will work at this time
- Running tests on the LLM
- Enabling compatibility with other Excel formats