Skip to content

Can Language Models Replace Programmers? RepoCod Says ‘Not Yet’ - by Shanchao Liang and Yiran Hu and Nan Jiang and Lin Tan

License

Notifications You must be signed in to change notification settings

lt-asset/REPOCOD

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Can Language Models Replace Programmers? REPOCOD Says ‘Not Yet’

We create REPOCOD, a code generation benchmark with 980 problems collected from 11 popular real-world projects, with more than 58% of them requiring file-level or repository-level context information. In addition, REPOCOD has the longest average canonical solution length (331.6 tokens) and the highest average cyclomatic complexity (9.00) compared to existing benchmarks. Each task in REPOCOD includes 313.5 developer-written test cases on average for better correctness evaluation. In our evaluations on ten LLMs, none of the models achieves more than 30 pass@1 on REPOCOD, disclosing the necessity of building stronger LLMs that can help developers in real-world software development.

Updates:

10/30/2024: Our preprint available at: ArXiv

10/30/2024: Our dataset is available on Huggingface: link

11/01/2024: The leaderboard is available here: leaderboard

Usage

Install Dependencies

Optional choice to use a conda environment:

conda create -n repocod python=3.10 -y
conda activate repocod

Please use the following commands to install the necessary packages for inference and evaluation using REPOCOD.

pip install --upgrade pip
pip install -r requirements.txt

Inference

Please refer to ./inference/Inference.md for using REPOCOD.

Evaluation

To evaluate on REPOCOD, please refer to ./evaluate/Evaluate.md.

Data Collection

Overview of REPOCOD's data collection process

We employ a three-stage data collection pipeline to efficiently gather target functions from popular repositories: Repository Selection, Target Function Selection, and Relevant Test Case Collection. For more details, feel free to read our paper!

LLMs' Performance

LLM's performance on REPOCOD

This table shows 10 LLMs’ performance on REPOCOD, under three retrieval settings. On all retrieval methods, commercial LLMs have better performance. Specifically, GPT-4o has the best result, reaching up to 27.35 pass@1.

However, Compared to their pass@1 on HumanEval (about 90 pass@1) and MBPP, SOTA LLMs are still far away from writing real-world programs requiring repository-level information.

Citation

@misc{liang2024languagemodelsreplaceprogrammers,
      title={Can Language Models Replace Programmers? REPOCOD Says 'Not Yet'}, 
      author={Shanchao Liang and Yiran Hu and Nan Jiang and Lin Tan},
      year={2024},
      eprint={2410.21647},
      archivePrefix={arXiv},
      primaryClass={cs.SE},
      url={https://arxiv.org/abs/2410.21647}, 
}

About

Can Language Models Replace Programmers? RepoCod Says ‘Not Yet’ - by Shanchao Liang and Yiran Hu and Nan Jiang and Lin Tan

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published