Spell-Correction

This project involves calculating s@k for misspelled tokens in Birkbeck corpus and tokens in Wordnet with the aid of MED (Minimum Edit Distance)

Dataset Details

Birkbeck
Total number of correct words : 6136
Total misspelled words : 36134

Wordnet
Total number of words (all correct) in the corpus : 147306

Environment setup

python 3.10 recommended
Use the following commands to setup the environment

pip install virtualenv
virtualenv venvs/spell_correction

# If Windows
venvs/spell_correction/Scripts/activate.bat
# else
source venvs/spell_correction/bin/activate

pip install -r requirements.txt

# [OPTIONAL] change any desired setting in param.py for a modified run

# run from the root folder 'Spell-Correction'
python -u assignment1.py

Parallelization

The primary step includes calculating the Levenshtein Distance for each pair of incorrect word in Birkbeck and a correct word in Wordnet
Because each step of calculation of a single instance of Levenshtein distance depends on three other dependent steps, the parallelization is complicated or quite impossible.
Moreover, considering step 2 as parallelized, we are left with calculating distance calculation for 36134 * 147306 pairs (!) This is at the same time space and time intensive
So, instead of trying to parallelize step 2, we parallelize step 3 in this code
We build a matrix (iw_matrix) by stacking up chunks of rows from multiple processes
Each row i consists of a list of tuple (j, d) where j is the index of the correct wordnet word and d is the distance between i-th incorrect word from Birkbeck and j-th correct word from Wordnet
We reduce the size of the iw_matrix by sorting the tuples in i-th row in ascending order and keeping only the k lowest columns (k = 10)
The total number of rows processed in Step 5-7 are distributed to max 40 cores of cpu in a server
The result of the multiprocessing Step 8 are collected and concatenated to produce the final sorted matrix. This matrix contains the indices of the k nearest words from Wordnet corpus for each incorrect word in Birkbeck
The matrix, Birkbeck and Wordnet corpora can be saved in the 'cache/' folder for faster loading in the later runs
s_at_k for 1, 5 and 10 for each incorrect word in Birkbeck are calculated by looking up into the iw_matrix and then the results are averaged accordingly

Run time Stats

The stats of the total program runtime with and without multiprocessing are shown in the table :

#Birkbeck Incorrect Words	#Wordnet Correct Words	1 CPU Core	40 CPU Cores	% Speed Boost
200	100000	0.4641 minutes	0.0734 minutes	84.18%
500	100000	1.0256 minutes	0.0953 minutes	90.70%
5000	100000	10.3997 minutes	0.5503 minutes	94.71%
36134	147306	Process killed	3.6430 minutes	~100%

Results

For the entire dataset, the average s@k was calculated for k = 1, 5 and 10

Average s@k for k = 1 : 0.4250013837383074
Average s@k for k = 5 : 0.4926661869707201
Average s@k for k = 10 : 0.5010516411136325

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
.idea		.idea
data		data
utils		utils
.gitignore		.gitignore
README.md		README.md
__init__.py		__init__.py
assignment_1.py		assignment_1.py
param.py		param.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Spell-Correction

Dataset Details

Environment setup

Parallelization

Run time Stats

Results

About

Releases

Packages

Languages

jamil2388/Spell-Correction

Folders and files

Latest commit

History

Repository files navigation

Spell-Correction

Dataset Details

Environment setup

Parallelization

Run time Stats

Results

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages