Skip to content

Latest commit

 

History

History

data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 

Data Directory

This directory contains datasets used in the repository, organized as follows:

IMDb Dataset

The imdb directory includes data that has been processed and split into training, testing, and evaluation sets.

Source

  • Original Dataset: IMDb Dataset
    • This dataset was created by the Stanford AI Lab and contains movie reviews along with sentiment polarity labels.

Files

  • train.csv (35,000 samples)
  • val.csv (5,000 samples)
  • test.csv (10,000 samples)

Generation

These files were generated using the script located at utils/preprocess_imdb_dataset.py. The script processes the original IMDb dataset to create three distinct splits: train, test, and eval.

Bias-DeBiased Dataset

This file, debiased_profanity_check_with_keywords.csv, contains data related to profanity and bias checks, with specific keywords highlighted for analysis.

Source

  • Dataset: Bias-DeBiased
    • This dataset is part of efforts to understand and mitigate bias in media texts, hosted on Hugging Face.

Reference Paper

Citation

If you use the Bias-DeBiased dataset provided in this directory, please cite the appropriate sources as follows:

@misc{raza2023newsmediabias,
  Author     = {Shaina Raza},
  title     = {News Media Bias},
  year      = {2023},
  url       = {https://huggingface.co/newsmediabias},
}