Language Identification Hackathon

South African Language Identification Hack 2022 EDSA 2201 & 2207 classification hackathon

Hackathon Overview: South African Language Identification Hack 2022

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages. From South African Government

With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

Problem Statement

you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in using various classification models.

Data overview

The dataset used for this challenge is the NCHLT Text Corpora collected by the South African Department of Arts and Culture & Centre for Text Technology (CTexT, North-West University, South Africa). The training set was improved through additional cleaning done by Praekelt.

The data is in the form Language ID, Text. The text is in various states of cleanliness. Some NLP techniques will be necessary to clean up the data. File descriptions

train_set.csv - the training set
test_set.csv - the test set
sample_submission.csv - a sample submission file in the correct format

Language IDs

afr - Afrikaans
eng - English
nbl - isiNdebele
nso - Sepedi
sot - Sesotho
ssw - siSwati
tsn - Setswana
tso - Xitsonga
ven - Tshivenda
xho - isiXhosa
zul - isiZulu

Name		Name	Last commit message	Last commit date
Latest commit History 65 Commits
.gitattributes		.gitattributes
FINAL NOTEBOOK.ipynb		FINAL NOTEBOOK.ipynb
README.md		README.md
first notebook.ipynb		first notebook.ipynb
sample_submission.csv		sample_submission.csv
submission.csv		submission.csv
submission1.csv		submission1.csv
submission10.csv		submission10.csv
submission11.csv		submission11.csv
submission12.csv		submission12.csv
submission13.csv		submission13.csv
submission14.csv		submission14.csv
submission15.csv		submission15.csv
submission2.csv		submission2.csv
submission22.csv		submission22.csv
submission23.csv		submission23.csv
submission3.csv		submission3.csv
submission35.csv		submission35.csv
submission36.csv		submission36.csv
submission37.csv		submission37.csv
submission38 (2).csv		submission38 (2).csv
submission38.csv		submission38.csv
submission39.csv		submission39.csv
submission4.csv		submission4.csv
submission40 (2).csv		submission40 (2).csv
submission40.csv		submission40.csv
submission5.csv		submission5.csv
submission6.csv		submission6.csv
submission7.csv		submission7.csv
submissionfinal.csv		submissionfinal.csv
test_set.csv		test_set.csv
train_set.csv		train_set.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Language Identification Hackathon

Hackathon Overview: South African Language Identification Hack 2022

Problem Statement

Data overview

About

Releases

Packages

Languages

ElelwaniTshikovhi/Language-Identification-Hackathon

Folders and files

Latest commit

History

Repository files navigation

Language Identification Hackathon

Hackathon Overview: South African Language Identification Hack 2022

Problem Statement

Data overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages