Distinguish Similar Languages

Implementation of a method to distinguish between similar languages such as Croatian, Bosnian and Serbian. We implement an ensemble of SVM and Naive Bayes classifiers using a soft voting classifier along with a character level n-gram (2-6) Tfidf Vectorizer as a feature extractor. The model predicts the class label based on the argmax of the sums of the predicted probabilities estimated from both the classifiers.

Dataset

DSL Corpus Collection (here) is used for this work. The training set consists of 18k lines for each language present in the dataset and the test set consists of 1000 lines for each language.

A subset of the DSLCC v4.0 for the group of similar languages: (Croatian - 'hr', Bosnian - 'bs' and ' Serbian - 'sr') is used for training and testing our model. The subset is generated using this code. This script generates 2 files: train.txt and test.txt (already generated and present in Data folder). The new training set and test set consists of 54k lines (18k * 3) and 3k (1000 * 3) respectively. Some simple preprocessing steps are also involved such as, lowercasing the text and removing digits, punctuations and extra spaces.

Note: The original dataset is not present in this repository. To generate the files again, please download the DSLCC v4.0 dataset in the Data folder before running the script:

python Data/subset_script.py

Also, update the correct path to the generated dataset (subset of the original dataset) in utils.py before running the model.

Requirement

Python 3.7.10
SciKit Learn 0.23.2
Matplotlib 3.3.2
Numpy 1.19.2
Pandas 1.1.3
Seaborn 0.11.0
Sklearn 0.0

Running

python language_detection.py

Output

Model	Accuracy	F1 score
Ensemble	75.5 %	0.7562

The confusion matrix generated in this case:

A sample output can be found in output_ensemble.txt.

Advantages and Disadvantages

Advantage: Provides better accuracy than any of the single models on the whole dataset.

Disadvantages: Lack interpretability and are computationally expensive compared to training a single model.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
Data		Data
sample_output		sample_output
README.md		README.md
classifier_model.py		classifier_model.py
data.py		data.py
language_detection.py		language_detection.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distinguish Similar Languages

Dataset

Requirement

Running

Output

Advantages and Disadvantages

About

Releases

Packages

Languages

debanjali05/Distinguish_Similar_Languages

Folders and files

Latest commit

History

Repository files navigation

Distinguish Similar Languages

Dataset

Requirement

Running

Output

Advantages and Disadvantages

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages