First, create a keys.cfg
file at the root of the repository and add your API keys for the different providers as follows:
OPENAI_KEY = 'your_api_key'
ANTHROPIC_KEY = 'your_api_key'
GOOGLE_KEY = 'your_api_key'
If you're using litellm, which supports a variety of providers including vllm, Hugging Face, and Together AI, make sure to include the relevant API key in the keys.cfg
file. Please refer to the docs here. Then, use litellm/*
as the model name when running the command.
For example, to use Together AI's models, you'll need to add the following to your keys.cfg
:
TOGETHERAI_API_KEY = 'your_api_key'
To generate code using the Together AI model (e.g., Meta-Llama-3.1-70B-Instruct-Turbo
), go to the root of this repo and run:
python eval/scripts/gencode_json.py --model litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
To generate code using GPT-4o (with default settings), go to the root of this repo and run:
python eval/scripts/gencode_json.py --model gpt-4o
If you want to include scientist-annotated background in the prompts, use the --with-background
flag:
python eval/scripts/gencode_json.py --model gpt-4o --with-background
Please note that we do not plan to release the ground truth code for each problem to the public. However, we have made a dev set available that includes the ground truth code in eval/data/problems_dev.jsonl
.
In this repository, we only support evaluating with previously generated code for each step.
When running the gencode_json.py
script, you can use the following options:
--model
: Specifies the model name to be used for generating code (e.g.,gpt-4o
orlitellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
).--output-dir
: Directory where the generated code outputs will be saved. Default iseval_results/generated_code
.--input-path
: Directory containing the JSON files describing the problems. Default iseval/data/problems_all.jsonl
.--prompt-dir
: Directory where prompt files are saved. Default iseval_results/prompt
.--with-background
: If enabled, includes the problem background in the generated code.--temperature
: Controls the randomness of the output. Default is 0.
Download the numeric test results and save it as eval/data/test_data.h5
.
To evaluate the generated code using a specific model, go to the root of this repo and use the following command:
python eval/scripts/test_generated_code.py --model "model_name"
Replace "model_name"
with the appropriate model name, and include --with-background
if the code is generated with scientist-annotated background.