From 7d77559c4e8c2715f7919e6cdf0cdf70681c17d0 Mon Sep 17 00:00:00 2001 From: Minyang Tian <69544994+mtian8@users.noreply.github.com> Date: Mon, 4 Nov 2024 10:18:16 -0600 Subject: [PATCH] Update README.md --- README.md | 35 ++++++++++++++++++++--------------- 1 file changed, 20 insertions(+), 15 deletions(-) diff --git a/README.md b/README.md index 8978716..14feccd 100644 --- a/README.md +++ b/README.md @@ -7,6 +7,8 @@ This repo contains the evaluation code for the paper "[SciCode: A Research Codin ## πŸ””News +**[2024-11-04]: Leaderboard is on! Check [here](https://scicode-bench.github.io/leaderboard/). We have also added Claude Sonnet 3.5 (new) results.** + **[2024-10-01]: We have added OpenAI o1-mini and o1-preview results.** **[2024-09-26]: SciCode is accepted at NeurIPS D&B Track 2024.** @@ -23,21 +25,24 @@ SciCode sources challenging and realistic research-level coding problems across ## πŸ† Leaderboard -| Model | Subproblem | Main Problem | -|---------------------------|------------|--------------| -| **OpenAI o1-preview** | **28.5** | **7.7** | -| Claude3.5-Sonnet | 26 | 4.6 | -| GPT-4o | 25 | 1.5 | -| GPT-4-Turbo | 22.9 | 1.5 | -| OpenAI o1-mini | 22.2 | 1.5 | -| Gemini 1.5 Pro | 21.9 | 1.5 | -| Claude3-Opus | 21.5 | 1.5 | -| Deepseek-Coder-v2 | 21.2 | 3.1 | -| Claude3-Sonnet | 17 | 1.5 | -| Qwen2-72B-Instruct | 17 | 1.5 | -| Llama-3.1-70B-Instruct | 16.3 | 1.5 | -| Mixtral-8x22B-Instruct | 16.3 | 0 | -| Llama-3-70B-Chat | 14.6 | 0 | +| Models | Main Problem Resolve Rate | Subproblem | +|--------------------------|-------------------------------------|-------------------------------------| +| πŸ₯‡ OpenAI o1-preview |
**7.7**
|
28.5
| +| πŸ₯ˆ Claude3.5-Sonnet |
**4.6**
|
26.0
| +| πŸ₯‰ Claude3.5-Sonnet (new) |
**4.6**
|
25.3
| +| Deepseek-Coder-v2 |
**3.1**
|
21.2
| +| GPT-4o |
**1.5**
|
25.0
| +| GPT-4-Turbo |
**1.5**
|
22.9
| +| OpenAI o1-mini |
**1.5**
|
22.2
| +| Gemini 1.5 Pro |
**1.5**
|
21.9
| +| Claude3-Opus |
**1.5**
|
21.5
| +| Llama-3.1-405B-Chat |
**1.5**
|
19.8
| +| Claude3-Sonnet |
**1.5**
|
17.0
| +| Qwen2-72B-Instruct |
**1.5**
|
17.0
| +| Llama-3.1-70B-Chat |
**0.0**
|
17.0
| +| Mixtral-8x22B-Instruct |
**0.0**
|
16.3
| +| Llama-3-70B-Chat |
**0.0**
|
14.6
| + ## Instructions to evaluate a new model