Skip to content

Commit

Permalink
Merge branch 'main' into main
Browse files Browse the repository at this point in the history
  • Loading branch information
YanxinLu authored Nov 5, 2024
2 parents 64bb3c8 + 7268a4a commit 7945c08
Show file tree
Hide file tree
Showing 4 changed files with 95 additions and 47 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ repos:
# - id: trailing-whitespace

- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.5.5
rev: v0.6.4
hooks:
# Run the linter.
- id: ruff
Expand Down
50 changes: 34 additions & 16 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,11 +6,19 @@
This repo contains the evaluation code for the paper "[SciCode: A Research Coding Benchmark Curated by Scientists](https://arxiv.org/abs/2407.13168)"

## 🔔News
- **[2024-08-22]: The SciCode benchmark has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass).**
- **[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.**

**[2024-11-04]: Leaderboard is on! Check [here](https://scicode-bench.github.io/leaderboard/). We have also added Claude Sonnet 3.5 (new) results.**

**[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.**

**[2024-08-22]: The SciCode benchmark has been successfully integrated into [OpenCompass](https://github.com/open-compass/opencompass).**

**[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.**

**[2024-07-24]: We add the scientist-annotated background and support setup for w/ background evaluation.**

## Introduction
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. Claude3.5-Sonnet, the best-performing model among those tested, can solve only **4.6%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.
SciCode is a challenging benchmark designed to evaluate the capabilities of language models (LMs) in generating code for solving realistic scientific research problems. It has a diverse coverage of **16** subdomains from **6** domains: Physics, Math, Material Science, Biology, and Chemistry. Unlike previous benchmarks that consist of exam-like question-answer pairs, SciCode is converted from real research problems. SciCode problems naturally factorize into multiple subproblems, each involving knowledge recall, reasoning, and code synthesis. In total, SciCode contains **338** subproblems decomposed from **80** challenging main problems, and it offers optional descriptions specifying useful scientific background information and scientist-annotated gold-standard solutions and test cases for evaluation. OpenAI o1-preview, the best-performing model among those tested, can solve only **7.7%** of the problems in the most realistic setting. Broadly, SciCode demonstrates a realistic and scientists' everyday workflow of identifying critical science concepts and facts and then transforming them into computation and simulation code. We believe SciCode not only helps demonstrate contemporary LLMs' progress towards helpful assistant for scientists but also helps shed light on future building and evaluation of scientific AI.



Expand All @@ -19,19 +27,24 @@ SciCode sources challenging and realistic research-level coding problems across

## 🏆 Leaderboard

| Model | Subproblem | Main Problem |
|---------------------------|------------|--------------|
| Claude3.5-Sonnet | **26** | **4.6** |
| GPT-4o | 25 | 1.5 |
| GPT-4-Turbo | 22.9 | 1.5 |
| Gemini 1.5 Pro | 21.9 | 1.5 |
| Claude3-Opus | 21.5 | 1.5 |
| Deepseek-Coder-v2 | 21.2 | 3.1 |
| Claude3-Sonnet | 17 | 1.5 |
| Qwen2-72B-Instruct | 17 | 1.5 |
| Llama-3.1-70B-Instruct | 16.3 | 1.5 |
| Mixtral-8x22B-Instruct | 16.3 | 0 |
| Llama-3-70B-Chat | 14.6 | 0 |
| Models | Main Problem Resolve Rate | <span style="color:grey">Subproblem</span> |
|--------------------------|-------------------------------------|-------------------------------------|
| 🥇 OpenAI o1-preview | <div align="center">**7.7**</div> | <div align="center" style="color:grey">28.5</div> |
| 🥈 Claude3.5-Sonnet | <div align="center">**4.6**</div> | <div align="center" style="color:grey">26.0</div> |
| 🥉 Claude3.5-Sonnet (new) | <div align="center">**4.6**</div> | <div align="center" style="color:grey">25.3</div> |
| Deepseek-Coder-v2 | <div align="center">**3.1**</div> | <div align="center" style="color:grey">21.2</div> |
| GPT-4o | <div align="center">**1.5**</div> | <div align="center" style="color:grey">25.0</div> |
| GPT-4-Turbo | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.9</div> |
| OpenAI o1-mini | <div align="center">**1.5**</div> | <div align="center" style="color:grey">22.2</div> |
| Gemini 1.5 Pro | <div align="center">**1.5**</div> | <div align="center" style="color:grey">21.9</div> |
| Claude3-Opus | <div align="center">**1.5**</div> | <div align="center" style="color:grey">21.5</div> |
| Llama-3.1-405B-Chat | <div align="center">**1.5**</div> | <div align="center" style="color:grey">19.8</div> |
| Claude3-Sonnet | <div align="center">**1.5**</div> | <div align="center" style="color:grey">17.0</div> |
| Qwen2-72B-Instruct | <div align="center">**1.5**</div> | <div align="center" style="color:grey">17.0</div> |
| Llama-3.1-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">17.0</div> |
| Mixtral-8x22B-Instruct | <div align="center">**0.0**</div> | <div align="center" style="color:grey">16.3</div> |
| Llama-3-70B-Chat | <div align="center">**0.0**</div> | <div align="center" style="color:grey">14.6</div> |


## Instructions to evaluate a new model

Expand All @@ -41,6 +54,11 @@ SciCode sources challenging and realistic research-level coding problems across
4. Run `eval/scripts/gencode_json.py` to generate new model outputs (see the [`eval/scripts` readme](eval/scripts/)) for more information
5. Run `eval/scripts/test_generated_code.py` to evaluate the unittests

## More information and FAQ

More information, including a [FAQ section](https://scicode-bench.github.io/faq/), is provided on our [website](https://scicode-bench.github.io/).
If you have trouble reaching the website, please find the markdown source in its [github repository](https://github.com/scicode-bench/scicode-bench.github.io/tree/main/docs).

## Contact
- Minyang Tian: [email protected]
- Eliu Huerta: [email protected]
Expand Down
64 changes: 47 additions & 17 deletions eval/scripts/README.md
Original file line number Diff line number Diff line change
@@ -1,42 +1,72 @@
## **Generate LLM code**

Your first need to set up your API keys. For this, create a `keys.cfg` file at the root of the repository
and add keys as follows:
## **Generating Code with LLMs**

### 1. Set Up Your API Keys

First, create a `keys.cfg` file at the root of the repository and add your API keys for the different providers as follows:

```
OPENAI_KEY = 'your_api_key'
ANTHROPIC_KEY = 'your_api_key'
GOOGLE_KEY = 'your_api_key' 
GOOGLE_KEY = 'your_api_key'
```

If you're using **litellm**, which supports a variety of providers including **vllm**, **Hugging Face**, and **Together AI**, make sure to include the relevant API key in the `keys.cfg` file. Please refer to the docs [here](https://docs.litellm.ai/docs/providers). Then, use `litellm/*` as the model name when running the command.

For example, to use **Together AI**'s models, you'll need to add the following to your `keys.cfg`:

```
TOGETHERAI_API_KEY = 'your_api_key'
```

For example, to create model results with `gpt-4o` and the default settings, go to the root of this repo and run
### 2. Generating Code

To generate code using the **Together AI** model (e.g., `Meta-Llama-3.1-70B-Instruct-Turbo`), go to the root of this repo and run:

```bash
python eval/scripts/gencode_json.py --model litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo
```

To generate code using **GPT-4o** (with default settings), go to the root of this repo and run:

```bash
python eval/scripts/gencode_json.py --model gpt-4o
```

For results with scientist-annotated background, run
If you want to include **scientist-annotated background** in the prompts, use the `--with-background` flag:

```bash
python eval/scripts/gencode_json.py --model gpt-4o --with-background
```

Please note that we do not plan to release the ground truth code for each problem to the public. However, we have made a dev set available that includes the ground truth code in `eval/data/problems_dev.jsonl`.

In this repository, **we only support evaluating with previously generated code for each step.**

### Command-Line Arguments

- `--model` - Specifies the model name used for generating responses.
- `--output-dir` - Directory to store the generated code outputs (Default: `eval_results/generated_code`).
- `--input-path` - Directory containing the JSON files describing the problems (Default: `eval/data/problems_all.jsonl`).
- `--prompt-dir` - Directory where prompt files are saved (Default: `eval_results/prompt`).
- `--with-background` - Include problem background if enabled.
- `--temperature` - Controls the randomness of the generation (Default: 0).

## **Evaluate generated code**
When running the `gencode_json.py` script, you can use the following options:

- `--model`: Specifies the model name to be used for generating code (e.g., `gpt-4o` or `litellm/together_ai/meta-llama/Meta-Llama-3.1-70B-Instruct-Turbo`).
- `--output-dir`: Directory where the generated code outputs will be saved. Default is `eval_results/generated_code`.
- `--input-path`: Directory containing the JSON files describing the problems. Default is `eval/data/problems_all.jsonl`.
- `--prompt-dir`: Directory where prompt files are saved. Default is `eval_results/prompt`.
- `--with-background`: If enabled, includes the problem background in the generated code.
- `--temperature`: Controls the randomness of the output. Default is 0.

---

Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save them as `./eval/data/test_data.h5`
## **Evaluating the Generated Code**

To run the script, go to the root of this repo and use the following command:
### 1. Download Numeric Test Data

Download the [numeric test results](https://drive.google.com/drive/folders/1W5GZW6_bdiDAiipuFMqdUhvUaHIj6-pR?usp=drive_link) and save it as `eval/data/test_data.h5`.

### 2. Run the Evaluation

To evaluate the generated code using a specific model, go to the root of this repo and use the following command:

```bash
python eval/scripts/test_generated_code.py --model "model_name"
```

Replace `"model_name"` with the appropriate model name, and include `--with-background` if the code is generated with **scientist-annotated background**.
26 changes: 13 additions & 13 deletions eval/scripts/gencode_json.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,10 +8,10 @@
)
from scicode.gen.models import extract_python_script, get_model_function


DEFAULT_PROMPT_TEMPLATE = Path("eval", "data", "background_comment_template.txt").read_text()
BACKGOUND_PROMPT_TEMPLATE = Path("eval", "data", "multistep_template.txt").read_text()


class Gencode:
def __init__(self, model: str, output_dir: Path,
prompt_dir: Path, with_background: bool, temperature: float):
Expand Down Expand Up @@ -57,6 +57,10 @@ def generate_response_with_steps(
save (bool, optional): Save propmt and model response. Defaults to True.
"""
prob_id = prob_data["problem_id"]
output_file_path = (
self.output_dir / Path(self.model).parts[-1] / self._get_background_dir()
/ f"{prob_id}.{num_steps}.py"
)
if num_steps == 1:
self.previous_llm_code = [None] * tot_steps
else:
Expand All @@ -69,8 +73,7 @@ def generate_response_with_steps(
prev_file_path = Path("eval", "data", f"{prob_id}.{prev_step+1}.txt")
else:
prev_file_path = (
self.output_dir
/ model
self.output_dir / Path(self.model).parts[-1] / self._get_background_dir()
/ f"{prob_id}.{prev_step + 1}.py"
)
if prev_file_path.is_file():
Expand All @@ -80,6 +83,9 @@ def generate_response_with_steps(
self.previous_llm_code[prev_step] = function_code
else:
raise Exception(f'Generating {prob_id} step {num_steps} ahead of step {prev_step + 1}.')

if output_file_path.exists():
return
prompt, previous_code = self.generate_prompt_with_steps(prob_data, num_steps, prompt_template)
if save:
self.save_prompt_with_steps(prob_data, prompt, num_steps)
Expand All @@ -89,16 +95,10 @@ def generate_response_with_steps(
model_kwargs["max_tokens"] = 4096
model_kwargs["temperature"] = self.temperature
# write the response to a file if it doesn't exist
output_file_path = (
self.output_dir
/ model
/ f"{prob_id}.{num_steps}.py"
)
if not output_file_path.exists():
model_fct = get_model_function(model, **model_kwargs)
response_from_llm = model_fct(prompt)
self.previous_llm_code[num_steps - 1] = extract_python_script(response_from_llm)
self.save_response_with_steps(prob_data, response_from_llm, previous_code, num_steps)
model_fct = get_model_function(model, **model_kwargs)
response_from_llm = model_fct(prompt)
self.previous_llm_code[num_steps - 1] = extract_python_script(response_from_llm)
self.save_response_with_steps(prob_data, response_from_llm, previous_code, num_steps)

@staticmethod
def process_problem_code(prob_data: dict, num_steps: int) -> str:
Expand Down

0 comments on commit 7945c08

Please sign in to comment.