Arabic Dialect Classification with Synthetic Data and Semi-Supervised Learning

Overview

This project investigates the impact of Large Language Model (LLM)-generated synthetic data on Arabic dialect classification, utilizing semi-supervised learning (SSL) techniques to address the challenges posed by the lack of labeled data. We experiment with both real and synthetic data to evaluate their effect on model performance in the classification of Arabic dialects from the MADAR dataset. Our approach demonstrates that synthetic data, in conjunction with SSL, can partially mitigate the lack of labeled data and enhance the performance of dialect classification models in low-resource settings.

Abstract

The scarcity of labeled data presents significant challenges in training high-performance machine learning models, especially for underrepresented languages like Arabic dialects. To address this, we generated synthetic data using LLMs and applied an SSL approach to effectively utilize unlabeled data. Three main experiments were conducted:

Real Data + SSL: Using real data with SSL.
Real + Synthetic Data + SSL: Combining real data with synthetic data for SSL.
Synthetic Data + SSL: Using only synthetic data for SSL.

Our results show that semi-supervised models outperform supervised models trained on limited real data, while synthetic data, when used alone, performs within 5.3% of supervised models trained on a small portion of real data. These findings indicate that both SSL and synthetic data hold promise in overcoming data scarcity issues in Arabic dialect classification.

Objectives

The primary objectives of this project are:

To address the data scarcity issue by generating synthetic data using LLMs.
To mitigate the label shortage problem by leveraging SSL to utilize both labeled and unlabeled data.
To evaluate the impact of combining synthetic and real data in improving model performance for Arabic dialect classification.

Methodology

Figure 1. A parallel workflow for Arabic dialect classification that involves sampling real datasets and generating synthetic data using prompts, followed by data analysis, evaluation, and testing with different learning methods.

Data Generation

Real Data: We used the MADAR dataset, which includes samples from various Arabic dialects.
Synthetic Data: Synthetic sentences were generated using GPT-4o and GPT-4o-mini models for five dialect categories. These synthetic sentences were then utilized alongside real data to evaluate their impact.

Experiments

We conducted three primary experiments to assess the effectiveness of synthetic data:

Real Data + SSL: Evaluated SSL performance using only real data.
Real + Synthetic Data + SSL: Combined real and LLM-generated synthetic data for SSL, with models tested on real data.
Synthetic Data + SSL: Tested SSL using only synthetic data to determine its effectiveness as a substitute for real data.

Machine Learning Models

The models implemented include:

Stochastic Gradient Descent (SGD) Classifier
Logistic Regression
Linear Support Vector Classifier (SVC)
Passive-Aggressive Classifier
Multinomial Naive Bayes

All models were tested under both supervised and self-training settings to evaluate their effectiveness with different combinations of real and synthetic data.

Results

The findings of our experiments are summarized as follows:

Self-Training Performance: Semi-supervised models trained with a mix of real and synthetic data significantly outperformed those trained on only 20% of the real data, showing the value of SSL in low-resource environments.
Synthetic vs. Real Data: Models trained solely on synthetic data performed within 5.3% of supervised models trained on 20% real data, indicating that synthetic data can be an effective substitute when real data is limited.
Dialect-Specific Observations: Egyptian dialect achieved the highest accuracy, while Gulf dialect presented the most challenges across all model

Figure 2. Comparison of Real and Generated Datasets Across Dialects by Training Method and Metric: Each graph illustrates Precision, Recall, or F1-Score for 100% Supervised, 20% Supervised, and Self-Training methods, highlighting performance differences of average linear classifiers between Real and Generated datasets across dialects

Key Insights

Data Scarcity: Leveraging SSL and synthetic data effectively mitigates the challenges of data scarcity, enabling improved model performance.
Model Performance: Real data still outperforms generated data in all metrics, but generated data shows promise, especially when used with SSL.
Dialect Complexity: Gulf dialect remains a challenging category, requiring more nuanced methods for accurate classification.

How to Run the Code

Clone the Repository:

git clone https://github.com/SondosBsharat/Arabic-Dialect-SemiSupervised-LLM-AI-Project.git

Set Up the Environment:
- Install dependencies using the requirements.txt file:
```
pip install -r requirements.txt
```
Data Preparation:
- Prepare the datasets using the using the provided notebook data_selection.ipynb.
- Generate synthetic data using the provided notebook data_generation.ipynb.
Train and Evaluate Models:
- Use train_model.py to train models and evaluate.py to assess their performance.

Future Work

Future research directions include:

Exploring LLM Variations: Experimenting with synthetic data generated by other LLMs to evaluate their impact on model performance.
Benchmark Development: Creating more comprehensive benchmarks for Arabic dialect classification to evaluate the scalability of SSL with synthetic data.
Increasing Synthetic Data Volume: Generating larger synthetic datasets to assess improvements in model accuracy and generalization.

Acknowledgments

We extend our gratitude to Mohamed Bin Zayed University of Artificial Intelligence for supporting this research.

Contact

For questions or collaborations, please contact:

Sondos Bsharat , Mariam Barakat ,Mena Attia :
GitHub Repository: Arabic-Dialect-SemiSupervised-LLM-AI-Project

Name		Name	Last commit message	Last commit date
Latest commit History 52 Commits
Analysis		Analysis
Real Data Selection		Real Data Selection
experiments		experiments
final_data		final_data
generated_data		generated_data
generated_data_4o		generated_data_4o
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
synthetic_data_script.ipynb		synthetic_data_script.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Arabic Dialect Classification with Synthetic Data and Semi-Supervised Learning

Overview

Abstract

Objectives

Methodology

Data Generation

Experiments

Machine Learning Models

Results

Key Insights

How to Run the Code

Future Work

Acknowledgments

Contact

About

Releases

Packages

Contributors 3

Languages

SondosBsharat/Arabic-Dialect-SemiSupervised-LLM-AI-Project-

Folders and files

Latest commit

History

Repository files navigation

Arabic Dialect Classification with Synthetic Data and Semi-Supervised Learning

Overview

Abstract

Objectives

Methodology

Data Generation

Experiments

Machine Learning Models

Results

Key Insights

How to Run the Code

Future Work

Acknowledgments

Contact

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages