-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
4 changed files
with
95 additions
and
47 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -6,11 +6,19 @@ | |
This repo contains the evaluation code for the paper "[SciCode: A Research Coding Benchmark Curated by Scientists](https://arxiv.org/abs/2407.13168)" | ||
|
||
## 🔔News | ||
- **[2024-08-22]: The SciCode benchmark has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass).** | ||
- **[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.** | ||
|
||
**[2024-11-04]: Leaderboard is on! Check [here](https://scicode-bench.github.io/leaderboard/). We have also added Claude Sonnet 3.5 (new) results.** | ||
|
||
**[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.** | ||
|
||
**[2024-08-22]: The SciCode benchmark has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass).** | ||
|
||
**[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.** | ||
|
||
**[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.** | ||
|
||
## Introduction | ||
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only **4.6%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI. | ||
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only **7.7%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI. | ||
|
||
|
||
|
||
|
@@ -19,19 +27,24 @@ SciCode sources challenging and realistic research-level coding problems across | |
|
||
## 🏆 Leaderboard | ||
|
||
| Model | Subproblem | Main Problem | | ||
|---------------------------|------------|--------------| | ||
| Claude3.5-Sonnet | **26** | **4.6** | | ||
| GPT-4o | 25 | 1.5 | | ||
| GPT-4-Turbo | 22.9 | 1.5 | | ||
| Gemini 1.5 Pro | 21.9 | 1.5 | | ||
| Claude3-Opus | 21.5 | 1.5 | | ||
| Deepseek-Coder-v2 | 21.2 | 3.1 | | ||
| Claude3-Sonnet | 17 | 1.5 | | ||
| Qwen2-72B-Instruct | 17 | 1.5 | | ||
| Llama-3.1-70B-Instruct | 16.3 | 1.5 | | ||
| Mixtral-8x22B-Instruct | 16.3 | 0 | | ||
| Llama-3-70B-Chat | 14.6 | 0 | | ||
| Models | Main Problem Resolve Rate | <span style="color:grey">Subproblem</span> | | ||
|--------------------------|-------------------------------------|-------------------------------------| | ||
| 🥇 OpenAI o1-preview | <div align="center">**7.7**</div> | <div align="center" style="color:grey">28.5</div> | | ||
| 🥈 Claude3.5-Sonnet | <div align="center">**4.6**</div> | <div align="center" style="color:grey">26.0</div> | | ||
| 🥉 Claude3.5-Sonnet (new) | <div align="center">**4.6**</div> | <div align="center" style="color:grey">25.3</div> | | ||
| Deepseek-Coder-v2 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">21.2</div> | | ||
| GPT-4o | <div align="center">**1.5**</div> | <div align="center" style="color:grey">25.0</div> | | ||
| GPT-4-Turbo | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.9</div> | | ||
| OpenAI o1-mini | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.2</div> | | ||
| Gemini 1.5 Pro | <div align="center">**1.5**</div> | <div align="center" style="color:grey">21.9</div> | | ||
| Claude3-Opus | <div align="center">**1.5**</div> | <div align="center" style="color:grey">21.5</div> | | ||
| Llama-3.1-405B-Chat | <div align="center">**1.5**</div> | <div align="center" style="color:grey">19.8</div> | | ||
| Claude3-Sonnet | <div align="center">**1.5**</div> | <div align="center" style="color:grey">17.0</div> | | ||
| Qwen2-72B-Instruct | <div align="center">**1.5**</div> | <div align="center" style="color:grey">17.0</div> | | ||
| Llama-3.1-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">17.0</div> | | ||
| Mixtral-8x22B-Instruct | <div align="center">**0.0**</div> | <div align="center" style="color:grey">16.3</div> | | ||
| Llama-3-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">14.6</div> | | ||
|
||
|
||
## Instructions to evaluate a new model | ||
|
||
|
@@ -41,6 +54,11 @@ SciCode sources challenging and realistic research-level coding problems across | |
4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information | ||
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests | ||
|
||
## More information and FAQ | ||
|
||
More information, including a [FAQ section](https://scicode-bench.github.io/faq/), is provided on our [website](https://scicode-bench.github.io/). | ||
If you have trouble reaching the website, please find the markdown source in its [github repository](https://github.com/scicode-bench/scicode-bench.github.io/tree/main/docs). | ||
|
||
## Contact | ||
- Minyang Tian: [email protected] | ||
- Eliu Huerta: [email protected] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,42 +1,72 @@ | ||
## **Generate LLM code** | ||
|
||
Your first need to set up your API keys. For this, create a `keys.cfg` file at the root of the repository | ||
and add keys as follows: | ||
## **Generating Code with LLMs** | ||
|
||
### 1. Set Up Your API Keys | ||
|
||
First, create a `keys.cfg` file at the root of the repository and add your API keys for the different providers as follows: | ||
|
||
``` | ||
OPENAI_KEY = 'your_api_key' | ||
ANTHROPIC_KEY = 'your_api_key' | ||
GOOGLE_KEY = 'your_api_key' | ||
GOOGLE_KEY = 'your_api_key' | ||
``` | ||
|
||
If you're using **litellm**, which supports a variety of providers including **vllm**, **Hugging Face**, and **Together AI**, make sure to include the relevant API key in the `keys.cfg` file. Please refer to the docs [here](https://docs.litellm.ai/docs/providers). Then, use `litellm/*` as the model name when running the command. | ||
|
||
For example, to use **Together AI**'s models, you'll need to add the following to your `keys.cfg`: | ||
|
||
``` | ||
TOGETHERAI_API_KEY = 'your_api_key' | ||
``` | ||
|
||
For example, to create model results with `gpt-4o` and the default settings, go to the root of this repo and run | ||
### 2. Generating Code | ||
|
||
To generate code using the **Together AI** model (e.g., `Meta-Llama-3.1-70B-Instruct-Turbo`), go to the root of this repo and run: | ||
|
||
```bash | ||
python eval/scripts/gencode_json.py --model litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo | ||
``` | ||
|
||
To generate code using **GPT-4o** (with default settings), go to the root of this repo and run: | ||
|
||
```bash | ||
python eval/scripts/gencode_json.py --model gpt-4o | ||
``` | ||
|
||
For results with scientist-annotated background, run | ||
If you want to include **scientist-annotated background** in the prompts, use the `--with-background` flag: | ||
|
||
```bash | ||
python eval/scripts/gencode_json.py --model gpt-4o --with-background | ||
``` | ||
|
||
Please note that we do not plan to release the ground truth code for each problem to the public. However, we have made a dev set available that includes the ground truth code in `eval/data/problems_dev.jsonl`. | ||
|
||
In this repository, **we only support evaluating with previously generated code for each step.** | ||
|
||
### Command-Line Arguments | ||
|
||
- `--model` - Specifies the model name used for generating responses. | ||
- `--output-dir` - Directory to store the generated code outputs (Default: `eval_results/generated_code`). | ||
- `--input-path` - Directory containing the JSON files describing the problems (Default: `eval/data/problems_all.jsonl`). | ||
- `--prompt-dir` - Directory where prompt files are saved (Default: `eval_results/prompt`). | ||
- `--with-background` - Include problem background if enabled. | ||
- `--temperature` - Controls the randomness of the generation (Default: 0). | ||
|
||
## **Evaluate generated code** | ||
When running the `gencode_json.py` script, you can use the following options: | ||
|
||
- `--model`: Specifies the model name to be used for generating code (e.g., `gpt-4o` or `litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`). | ||
- `--output-dir`: Directory where the generated code outputs will be saved. Default is `eval_results/generated_code`. | ||
- `--input-path`: Directory containing the JSON files describing the problems. Default is `eval/data/problems_all.jsonl`. | ||
- `--prompt-dir`: Directory where prompt files are saved. Default is `eval_results/prompt`. | ||
- `--with-background`: If enabled, includes the problem background in the generated code. | ||
- `--temperature`: Controls the randomness of the output. Default is 0. | ||
|
||
--- | ||
|
||
Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5` | ||
## **Evaluating the Generated Code** | ||
|
||
To run the script, go to the root of this repo and use the following command: | ||
### 1. Download Numeric Test Data | ||
|
||
Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save it as `eval/data/test_data.h5`. | ||
|
||
### 2. Run the Evaluation | ||
|
||
To evaluate the generated code using a specific model, go to the root of this repo and use the following command: | ||
|
||
```bash | ||
python eval/scripts/test_generated_code.py --model "model_name" | ||
``` | ||
|
||
Replace `"model_name"` with the appropriate model name, and include `--with-background` if the code is generated with **scientist-annotated background**. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters