Skip to content

This repository contains all the experiments for smell detection on the MLCQ dataset.

License

Notifications You must be signed in to change notification settings

Kheims/MLCQ-Experiments

Repository files navigation

This is the repository for the experiments of the paper "Exploring NLP Techniques for Code Smell Detection: A Comparative Study." The study compares various NLP-based models for detecting code smells to a baseline model, highlighting their strengths and weaknesses.

Disclaimer

This research was conducted as part of a CIFRE doctoral program (Convention Industrielle de Formation par la Recherche) in collaboration with Adservio, Universite Paris-Saclay and ISEP.

Prerequisites

First, create a conda environment and install the dependencies

conda create -n mlcqenv python=3.10
conda activate mlcqenv
conda install -f requirements.txt

In order to recreate the json containing the code snippet according to the paths specified in the MLCQ dataset

  1. First set your github token to communicate with the api, see here for more information on setting your token
  2. Export your acquired token as an environment variable : export GITHUB_TOKEN=<your_github_token>
  3. Run the DataExtractor script : python DataExtractor.py

Baseline

The baseline here is j48, a decision tree-based algorithm widely considered state-of-the-art for code smell detection ( see 1 and 2 )

We need to first compute code metrics as they are the features needed for this model, to do so we use Designite, install it following the official repo

  1. Run python baseline/MetricsExtractor.py to prepare the code snippets in .java files.
  2. Run python baseline/DesigniteRun.py to execute Designite on the java files producing a DesigniteOutput file.
  3. Run baseline/DatasetCreator.py to prepare the final dataset to feed to the model.
  4. Finally, run train.py to train and test the model.

Tip: You can speed up the Designite processing by specifying the number of workers when using MetricsExtractor.py. This divides the dataset into batches, enabling parallel processing for faster execution.

Training

There are different models each with different components, to train the final bilstm with attention model run :

python bilstm_attn_train.py --batch_size 16 --epochs 20 --learning_rate 0.0001  --hidden_dim 512 --num_layers 2

Whereas to run the CodeBert model run :

python bert.py

All the results will be stored to their corresponding log files.

Acknowledgments

Authors :

This work relies on:

About

This repository contains all the experiments for smell detection on the MLCQ dataset.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages