Skip to content

debanjali05/Distinguish_Similar_Languages

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Distinguish Similar Languages

Implementation of a method to distinguish between similar languages such as Croatian, Bosnian and Serbian. We implement an ensemble of SVM and Naive Bayes classifiers using a soft voting classifier along with a character level n-gram (2-6) Tfidf Vectorizer as a feature extractor. The model predicts the class label based on the argmax of the sums of the predicted probabilities estimated from both the classifiers.

Dataset

DSL Corpus Collection (here) is used for this work. The training set consists of 18k lines for each language present in the dataset and the test set consists of 1000 lines for each language.

A subset of the DSLCC v4.0 for the group of similar languages: (Croatian - 'hr', Bosnian - 'bs' and ' Serbian - 'sr') is used for training and testing our model. The subset is generated using this code. This script generates 2 files: train.txt and test.txt (already generated and present in Data folder). The new training set and test set consists of 54k lines (18k * 3) and 3k (1000 * 3) respectively. Some simple preprocessing steps are also involved such as, lowercasing the text and removing digits, punctuations and extra spaces.

Note: The original dataset is not present in this repository. To generate the files again, please download the DSLCC v4.0 dataset in the Data folder before running the script:

python Data/subset_script.py

Also, update the correct path to the generated dataset (subset of the original dataset) in utils.py before running the model.

Requirement

  • Python 3.7.10
  • SciKit Learn 0.23.2
  • Matplotlib 3.3.2
  • Numpy 1.19.2
  • Pandas 1.1.3
  • Seaborn 0.11.0
  • Sklearn 0.0

Running

python language_detection.py

Output

Model Accuracy F1 score
Ensemble 75.5 % 0.7562

The confusion matrix generated in this case:

Confusion Matrix

A sample output can be found in output_ensemble.txt.

Advantages and Disadvantages

Advantage: Provides better accuracy than any of the single models on the whole dataset.

Disadvantages: Lack interpretability and are computationally expensive compared to training a single model.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages