This project involves calculating s@k for misspelled tokens in Birkbeck corpus and tokens in Wordnet with the aid of MED (Minimum Edit Distance)
Birkbeck
Total number of correct words : 6136
Total misspelled words : 36134
Wordnet
Total number of words (all correct) in the corpus : 147306
python 3.10 recommended
Use the following commands to setup the environment
pip install virtualenv
virtualenv venvs/spell_correction
# If Windows
venvs/spell_correction/Scripts/activate.bat
# else
source venvs/spell_correction/bin/activate
pip install -r requirements.txt
# [OPTIONAL] change any desired setting in param.py for a modified run
# run from the root folder 'Spell-Correction'
python -u assignment1.py
- The primary step includes calculating the Levenshtein Distance for each pair of incorrect word in Birkbeck and a correct word in Wordnet
- Because each step of calculation of a single instance of Levenshtein distance depends on three other dependent steps, the parallelization is complicated or quite impossible.
- Moreover, considering step 2 as parallelized, we are left with calculating distance calculation for 36134 * 147306 pairs (!) This is at the same time space and time intensive
- So, instead of trying to parallelize step 2, we parallelize step 3 in this code
- We build a matrix (iw_matrix) by stacking up chunks of rows from multiple processes
- Each row i consists of a list of tuple (j, d) where j is the index of the correct wordnet word and d is the distance between i-th incorrect word from Birkbeck and j-th correct word from Wordnet
- We reduce the size of the iw_matrix by sorting the tuples in i-th row in ascending order and keeping only the k lowest columns (k = 10)
- The total number of rows processed in Step 5-7 are distributed to max 40 cores of cpu in a server
- The result of the multiprocessing Step 8 are collected and concatenated to produce the final sorted matrix. This matrix contains the indices of the k nearest words from Wordnet corpus for each incorrect word in Birkbeck
- The matrix, Birkbeck and Wordnet corpora can be saved in the 'cache/' folder for faster loading in the later runs
- s_at_k for 1, 5 and 10 for each incorrect word in Birkbeck are calculated by looking up into the iw_matrix and then the results are averaged accordingly
The stats of the total program runtime with and without multiprocessing are shown in the table :
#Birkbeck Incorrect Words | #Wordnet Correct Words | 1 CPU Core | 40 CPU Cores | % Speed Boost |
---|---|---|---|---|
200 | 100000 | 0.4641 minutes | 0.0734 minutes | 84.18% |
500 | 100000 | 1.0256 minutes | 0.0953 minutes | 90.70% |
5000 | 100000 | 10.3997 minutes | 0.5503 minutes | 94.71% |
36134 | 147306 | Process killed | 3.6430 minutes | ~100% |
For the entire dataset, the average s@k was calculated for k = 1, 5 and 10
Average s@k for k = 1 : 0.4250013837383074
Average s@k for k = 5 : 0.4926661869707201
Average s@k for k = 10 : 0.5010516411136325