Skip to content
This repository has been archived by the owner on Sep 19, 2020. It is now read-only.

Latest commit

 

History

History

dataset_statistics

Dataset Language Statistics

We provide statistics about the relative and absolute prevalence of different languages in the dataset mix used during training of GPT-3.

The concepts of "characters" and "words" can have different meanings in different languages, so any effort to count is imperfect, but our hope is that this provides helpful information to our readers nonetheless. To help support a wide variety of downstream analyses, we provide language-level summary counts broken down at the unicode-character level, whitespace-delineated word level, and document level.