Skip to content

South African Language Identification Hack 2022 EDSA 2201 & 2207 classification hackathon

Notifications You must be signed in to change notification settings

ElelwaniTshikovhi/Language-Identification-Hackathon

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

65 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Identification Hackathon

South African Language Identification Hack 2022 EDSA 2201 & 2207 classification hackathon

Hackathon Overview: South African Language Identification Hack 2022

South Africa is a multicultural society that is characterised by its rich linguistic diversity. Language is an indispensable tool that can be used to deepen democracy and also contribute to the social, cultural, intellectual, economic and political life of the South African society.

The country is multilingual with 11 official languages, each of which is guaranteed equal status. Most South Africans are multilingual and able to speak at least two or more of the official languages. From South African Government

With such a multilingual population, it is only obvious that our systems and devices also communicate in multi-languages.

In this challenge, you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in.

Problem Statement

you will take text which is in any of South Africa's 11 Official languages and identify which language the text is in. This is an example of NLP's Language Identification, the task of determining the natural language that a piece of text is written in using various classification models.

Data overview

The dataset used for this challenge is the NCHLT Text Corpora collected by the South African Department of Arts and Culture & Centre for Text Technology (CTexT, North-West University, South Africa). The training set was improved through additional cleaning done by Praekelt.

The data is in the form Language ID, Text. The text is in various states of cleanliness. Some NLP techniques will be necessary to clean up the data. File descriptions

  • train_set.csv - the training set
  • test_set.csv - the test set
  • sample_submission.csv - a sample submission file in the correct format

Language IDs

  • afr - Afrikaans
  • eng - English
  • nbl - isiNdebele
  • nso - Sepedi
  • sot - Sesotho
  • ssw - siSwati
  • tsn - Setswana
  • tso - Xitsonga
  • ven - Tshivenda
  • xho - isiXhosa
  • zul - isiZulu

About

South African Language Identification Hack 2022 EDSA 2201 & 2207 classification hackathon

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published