The source code of the paper "Automatic Generation of Pull Request Description".
Our collected 333K pull requests can be downloaded from here. Here is a PR example in the json file:
{
"id": "elastic/elasticsearch_37980",
"body": "'Eclipse build files were missing so .eclipse project files were not being generated.\\r\\nCloses #37973\\r\\n\\r\\n'",
"cms": [
"'Added missing eclipse-build.gradle files\\n\\nCloses #fix/37973'"
],
"commits": {
"'3e10ee798c932cc1cab1ea6ca679417408fc1416'": {
"cm": "'Added missing eclipse-build.gradle files\\n\\nCloses #fix/37973'",
"comments": []
}
}
}
- id:
$user/$project_$prid
- body: PR description
- cms: the commit messages in this PR
- commis: the commits in this PR
- key is the SHA1 hash digest
- cm: commit message
- comments: source code comments added in this commit
- key is the SHA1 hash digest
Our dataset can be downloaded from here, which contains:
- the train, validation and test sets
- a json file for building vocabulary
To preprocess the raw data, we used the following regular expressions:
email_pattern = r'(^|\s)<[\w.-]+@(?=[a-z\d][^.]*\.)[a-z\d.-]*[^.]>'
url_pattern = r'https?://[-a-zA-Z0-9@:%._+~#?=/]+(?=($|[^-a-zA-Z0-9@:%._+~#?=/]))'
reference_pattern = r'#[\d]+'
signature_pattern = r'^(signed-off-by|co-authored-by|also-by):'
at_pattern = r'@\S+'
structure_pattern = r'^#+'
version_pattern = r'(^|\s|-)[\d]+(\.[\d]+){1,}'
sha_pattern = r'(^|\s)[\dA-Fa-f-]{7,}(?=(\s|$))'
digit_pattern = r'(^|\s|-)[\d]+(?=(\s|$))'
$ git clone https://github.com/Tbabm/PRSummarizer.git
$ cd PRSummarizer
$ mkdir data
# download our preprocessed dataset and place the four files in `data`
$ mkdir models
- See here for instructions about installing ROUGE
- Please make sure you have correctly set environment variable
ROUGE
to/absolute/path/to/ROUGE-RELEASE-1.5.5
Through conda:
$ conda env create -f environment.yml
OR through pip
$ pip install -r requirements.txt
- install and test pyrouge if you haven't done it.
$ git clone https://github.com/bheinzerling/pyrouge
$ cd pyrouge
$ pip install .
# set rouge path for pyrouge
$ pyrouge_set_rouge_path ${ROUGE}
# test the installation of pyrouge
$ python -m pyrouge.test
Train Attn+PG
first:
python3 -m prsum.prsum train --param-path params_attn_pg.json
After training, suppose the models are stored in models/train_12345678/model/
. Select the best Attn+PG
model:
python3 -m prsum.prsum select_model \
--param_path params_attn_pg.json \
--model_pattern "models/train_12345678/model/model_{}_" \
--start_iter 1000 \
--end_iter 26000
Suppose the best model is model_12000_87654321
. Train Attn+PG+RL
based on the best model:
python3 -m prsum.prsum train \
--param_path params_attn_pg_rl.json \
--model_path "models/train_12345678/model/model_12000_87654321"
Select the best Attn+PG
model:
# start_iter = the best iteration of `Attn+PG` (here, 12000) + save_interval (here, 1000)
START_ITER=13000
python3 -m prsum.prsum select_model
--param_path params_attn_pg_rl.json \
--model_pattern "models/train_12345678/model/rl_model_{}_" \
--start_iter $START_ITER \
--end_iter 41000
Suppose the best model is model_34000_98765432
.
Test the best Attn+PG+RL
model:
python3 -m prsum.prsum decode \
--param_path params_attn_pg_rl.json \
--model_path "models/train_12345678/model/rl_model_34000_98765432" \
--ngram_filter 1
Now, you will get the test results.
NOTE: Your test results may be slightly different from those reported in our paper. Because the pointer generator uses the scatter_add
function in pytorch. When using GPUs, this function is undeterministic. See here for more details.
Our pre-trained model and test results can be downloaded here. To test with our pre-atrained model:
mkdir models
mv rl_model_34000 ./models
python3 -m prsum.prsum decode \
--param_path params_attn_pg_rl.json \
--model_path "models/rl_model_34000" \
--ngram_filter 1
If you use this code, please consider citing our paper:
@inproceedings{liu2019automatic,
title={Automatic generation of pull request descriptions},
author={Liu, Zhongxin and Xia, Xin and Treude, Christoph and Lo, David and Li, Shanping},
booktitle={Proceedings of the 34th IEEE/ACM International Conference on Automated Software Engineering},
pages={176--188},
year={2019},
}
Thanks!
- Our paper: "Automatic Generation of Pull Request Description"
- https://github.com/atulkum/pointer_summarizer
- https://github.com/rohithreddy024/Text-Summarizer-Pytorch