LLMFineTuneDataFromCode is a Python package designed to streamline the process of generating fine-tuning datasets for large language models (LLMs). It leverages code repositories and research papers to create datasets and provides tools to train and run inference on fine-tuned LLMs.
This package is tailored for labs and organizations aiming to extract insights from their existing resources for LLM training while ensuring flexibility and extensibility.
- Code-Based Dataset Generation:
- The
generateCodeFineTune.py
script processes GitHub repositories to generate training datasets from code files. - It includes dependency analysis, Q&A generation, and code completion tasks.
- The
- Research Paper Dataset Generation:
- The
createFinalDataOutput.py
script processes research papers in PDF format to generate Q&A pairs. - Questions can be customized by editing the
QUESTIONS
list.
- The
- Training:
- Use the
train.ipynb
script to fine-tune an LLM with the generated dataset. - Tested with Meta’s Llama 3.1 8B model (requires access from Huggingface).
- Use the
- Inference:
- Run inference on the fine-tuned model using the
inference.ipynb
script to interactively query your model.
- Run inference on the fine-tuned model using the
-
Clone the repository:
git clone https://github.com/Mogtaba-Alim/LLMFineTuneDataFromCode.git cd LLMFineTuneDataFromCode
-
Install the dependencies:
pip install -r requirements.txt
-
Set up your environment variables:
- Create a
.env
file and include your API keys (e.g., OpenAI or Anthropic) as follows:OPENAI_KEY=your_openai_api_key CLAUDE_API_KEY=your_claude_api_key HF_KEY=your_huggingface_key
- Create a
-
Ensure GPU resources are available and compatible. The scripts are tested with NVIDIA A100 GPUs (40GB VRAM) and CUDA 12.6.
python generateCodeFineTune.py
- Specify GitHub repository links in the script.
python createFinalDataOutput.py
- Place your research papers (PDFs) in the
./Papers
folder. - Customize the questions in the
QUESTIONS
list as needed.
python train.ipynb
- Update the training script with your Huggingface model and dataset paths.
python inference.ipynb
- Interactively query your fine-tuned model.
- Python 3.8+
- CUDA 12.6
- Huggingface Transformers
- PyTorch
- Other dependencies listed in
requirements.txt
- GPU Resources: The package is optimized for high-memory GPUs like A100. Adjust parameters in
train.ipynb
for smaller GPUs. - Access Requirements: You must have a Huggingface account and request access to the model being fine-tuned (e.g., Meta's Llama 3.1 8B).
We welcome contributions to improve the package! Please submit a pull request or create an issue to share your ideas or report bugs.
This project is licensed under the MIT License. See the LICENSE file for details.
This package was developed by Mogtaba Alim to facilitate efficient dataset generation and LLM fine-tuning. Special thanks to the open-source community for their invaluable tools and libraries.
Feel free to replace or expand on any section as needed for your specific requirements. Let me know if you'd like to refine this further!