Skip to content

Latest commit

 

History

History
executable file
·
149 lines (107 loc) · 5.04 KB

README.md

File metadata and controls

executable file
·
149 lines (107 loc) · 5.04 KB

Large Language Monkeys

This repository provides the accompanying code for Large Language Monkeys: Scaling Inference Compute with Repeated Sampling.

Specifically, the code needed to:

  1. Generate samples from various datasets and models.
  2. Evaluate the correctness of the samples.

Four datasets are supported:

  • GSM8K
  • MATH
  • CodeContests
  • MiniF2F-MATH

We use vLLM to do inference, so any models that they support will work with our generation scripts.

Installation

We use two different conda environments for this project, as the lean-dojo version we use requires Python 3.9.19.

Environment for MiniF2F-MATH

conda create -n llmonk-minif2f python=3.9.19
pip install -r requirements_minif2f.txt

To run evaluation on this dataset, we additionally need to install lean4. To do this, follow the installation instructions for your system according to this website.

When prompted with

Current installation options:

  default toolchain: stable
  modify PATH variable: yes

1) Proceed with installation (default)
2) Customize installation
3) Cancel installation

Choose 2, and change the default toolchain to: 4.3.0-rc2.

Evironment for everything except MiniF2F-MATH

conda create -n llmonk python=3.11.8
pip install -r requirements.txt

Repository Structure

The repo is organized as follows:

large-language-monkeys/
├── llmonk/
│   ├── evaluate/
│   │   ├── gsm8k.py
│   │   ├── math.py
│   │   ├── code_contests.py
│   │   └── minif2f.py
│   ├── generate/
│   │   ├── gsm8k.py
│   │   ├── math.py
│   │   ├── code_contests.py
│   │   └── minif2f.py
│   └── tests/
│   │   ├── math_datasets.py
│   │   ├── code_contests.py
│   │   └── minif2f.py
├── README.md
└── requirements.txt
  • llmonk/evaluate/: contains the code to evaluate dataset samples
  • llmonk/generate/: contains the code to generate samples from a model
  • llmonk/tests/: contains code to check the correctness of our evaluation scripts

Within each folder, there is a file for each of the supported datasets (note that the scripts for MATH and GSM8K are combined under "math_datasets" for evaluation and testing).

Generation Scripts

These scripts are used to generate samples from a model for a dataset.

Usage

Each file has two mandatory arguments:

  1. model: the huggingface model to use to generate the samples (same string you would pass to .from_pretrained)
  2. save_dir: the directory to save the samples

For the remaining optional arguments (ex. temperature, number of samples, batch size, vllm arguments), please see the GenerateScriptConfig class in llmonk/utils.py.

Output Format

The samples are saved as YAML files (one YAML file per problem). Every dataset's YAML file contains the following keys:

  • prompt: the prompt for the problem
  • question: the current question for the problem
  • samples: a list of samples for each problem

For GSM8K and MATH, there is the additional key:

  • gt_answer: the dataset's ground truth answer for the problem

For CodeContests, there is the additional key:

  • test_cases: dictionary with the following keys:
    • input: list of strings corresponding to test case inputs
    • output: list of strings corresponding to test case outputs

For MiniF2F-MATH, there is the additional key:

  • theorem_name: the name of the theorem to be proven

Evaluation Scripts

These scripts evaluate the correctness of the samples generated by the generation scripts.

Usage

Each file has two mandatory arguments:

  1. samples_dir: the directory to the samples
  2. save_dir: the directory to save the evaluation results

For the remaining optional arguments (ex. number of workers), please see the EvaluateScriptConfig class in llmonk/utils.py.

Output Format

The evaluation results are saved as YAML files (one YAML file per problem), in the same format as the samples generated by the generation scripts with the additional key:

  • is_correct: a list of booleans indicating whether each sample is correct, is_correct[i] is True if and only if samples[i] is correct

Testing Scripts

The llmonk/tests/ directory contains unit tests to evaluate the correctness of the evaluation scripts.

Example Commands

See commands.md for examples of how to run generation, evaluation, and testing.

Citation

If you use this code in your research, please cite our paper. You can use the following BibTeX entry:

@misc{brown2024largelanguagemonkeysscaling,
      title={Large Language Monkeys: Scaling Inference Compute with Repeated Sampling}, 
      author={Bradley Brown and Jordan Juravsky and Ryan Ehrlich and Ronald Clark and Quoc V. Le and Christopher Ré and Azalia Mirhoseini},
      year={2024},
      eprint={2407.21787},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2407.21787}, 
}