Disclaimer

This is the repository for the experiments of the paper "Exploring NLP Techniques for Code Smell Detection: A Comparative Study." The study compares various NLP-based models for detecting code smells to a baseline model, highlighting their strengths and weaknesses.

Disclaimer

This research was conducted as part of a CIFRE doctoral program (Convention Industrielle de Formation par la Recherche) in collaboration with Adservio, Universite Paris-Saclay and ISEP.

Prerequisites

First, create a conda environment and install the dependencies

conda create -n mlcqenv python=3.10
conda activate mlcqenv
conda install -f requirements.txt

In order to recreate the json containing the code snippet according to the paths specified in the MLCQ dataset

First set your github token to communicate with the api, see here for more information on setting your token
Export your acquired token as an environment variable : export GITHUB_TOKEN=<your_github_token>
Run the DataExtractor script : python DataExtractor.py

Baseline

The baseline here is j48, a decision tree-based algorithm widely considered state-of-the-art for code smell detection ( see 1 and 2 )

We need to first compute code metrics as they are the features needed for this model, to do so we use Designite, install it following the official repo

Run python baseline/MetricsExtractor.py to prepare the code snippets in .java files.
Run python baseline/DesigniteRun.py to execute Designite on the java files producing a DesigniteOutput file.
Run baseline/DatasetCreator.py to prepare the final dataset to feed to the model.
Finally, run train.py to train and test the model.

Tip: You can speed up the Designite processing by specifying the number of workers when using MetricsExtractor.py. This divides the dataset into batches, enabling parallel processing for faster execution.

Training

There are different models each with different components, to train the final bilstm with attention model run :

python bilstm_attn_train.py --batch_size 16 --epochs 20 --learning_rate 0.0001  --hidden_dim 512 --num_layers 2

Whereas to run the CodeBert model run :

python bert.py

All the results will be stored to their corresponding log files.

Acknowledgments

Authors :

Djamel Mesbah ( [email protected]/[email protected])
Nour El Madhoun ( [email protected])
Hani Chalouati ( [email protected] )
Khaldoun Al Agha ([email protected])

This work relies on:

The MLCQ dataset
The Designite tool
CodeBert pretrained model from huggingface

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
baseline		baseline
.gitignore		.gitignore
DataExtractor.py		DataExtractor.py
LICENSE		LICENSE
MLCQCodeSmellSamples.csv		MLCQCodeSmellSamples.csv
README.md		README.md
bert.py		bert.py
bilstm_attn_bpe.py		bilstm_attn_bpe.py
bilstm_attn_codebert.py		bilstm_attn_codebert.py
bilstm_attn_train.py		bilstm_attn_train.py
bilstm_train.py		bilstm_train.py
config.py		config.py
lstm_train.py		lstm_train.py
models.py		models.py
requirements.txt		requirements.txt
tokenizer.py		tokenizer.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Disclaimer

Prerequisites

Baseline

Training

Acknowledgments

About

Releases

Packages

Languages

License

Kheims/MLCQ-Experiments

Folders and files

Latest commit

History

Repository files navigation

Disclaimer

Prerequisites

Baseline

Training

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages