Script for bundling Common Voice (https://voice.mozilla.org) clips by language.
- Query database for all clip data
- Download all those clips from an S3, separated into language directories
- Write the clips metadata to a
clips.tsv
file with anonymizedclient_id
values - Analyze the clips metadata and assemble aggregate stats for
stats.json
- Calculate the total duration of each dataset
- Prompt you to run
corpora-creator
, which will take theclips.tsv
file and analyze it to create test/dev/train sets for machine learning purposes - Create
.tar.gz
bundles according to your settings, usually one per language - Create a checksum for each tarball
- Upload the tarball to S3
- Write the checksum to the
stats.json
file and also upload that to S3
- Install node (>= 8.3.0)
- Install yarn
- Install CorporaCreator
- Install mp3-duration-sum
git clone [email protected]:Common-Voice/common-voice-bundler.git
- Override the keys defined in config.js with a
config.json
in the same dir yarn
yarn start
- You will be prompted to run
corpora-creator
separately. Follow the instructions.
In order to run this, you need to override the default keys defined in config.js with a config.json
in the same directory. At an absolute minimum, you will need:
releaseName
: the name of the release. this can take the form of an AWS key, and/
in the name will be treated as directoriesqueryFile
: the name of the file that specifies the SQL query for a given dataset - see/queries
directory for past files- the
db
object - the
clipBucket
object - the
outBucket
object (which refers to the bucket that the bundled dataset will be hosted on)
The other options are:
cutoffTime
: clips will only be downloaded if they were created before this timestartCutoffTime
: clips will only be included if they were created after this time. To be used for delta releases and inconjuction withcutoffTime
skipBundling
: this will do everything except bundle and upload clips (used mostly for testing)skipCorpora
: this will do everything but skip waiting for you to create the corpora (used if the process was interrupted and you already have the appropriate corpora)skipHashing
: this will skip hashing the client ID (used mostly for testing)skipDownload
: this will skip downloading the file and just create theclips.tsv
(used mostly for testing)skipMinorityCheck
: this will skip checking which languages have fewer than 5 speakersskipReportedSentences
: this will not include the list of reported sentences in each dataset (used for the singleword target segment bundle)startFromCorpora
: this will begin the whole process at the prompt for the corpora (used if the process was interrupted and you already have all the files and clips metadata)singleBundle
: this will create a single archive with all languages, instead of one tar per language
You should run this script from a tmux
in the EC2 shell you're provided with, so that if your connection dies the script can still continue to run. Sometimes, the script itself will die, in which case it will attempt to gracefully recover in the following ways:
- It will skip downloading files that are already on disk
- It will skip tarring/uploading language bundles that have already been successfully uploaded
- It will attempt to write to and load from
stats.json
as much as possible, so that you have in-progress stats even if the whole process doesn't finish
In addition, you can use the options specified above to resume from key points in the process instead of running through the entire process from scratch.
stats.json
has durations of 0 for some/all languages:mp3-duration-sum
runs in the background after all the clips have been downloaded, and there is no signal when it completes other than the stats file receiving updated durations. If you skip corpora creation or if most of your tar files have already been created, the script my terminate beforemp3-duration-sum
has completed and updated the stats file. The work around for this is to artificially pause the script by settingskipCorpora
as false, and simply not moving onto the next stage until you've verified that the durations have been updatedCorporaCreator
terminates or runs out of memory: The Corpora Creator is itself somewhat fragile, as it hasn't been substantially updated since it was created, and may need tweaking to run. You can test where the bug is by creating a smaller version ofclips.tsv
by taking the first 10,000 rows usinghead
and then trying to runCorporaCreator
on the smaller file, to identify whether the bug is the file size or your install. If the problem is the file size, you may need to upgrade to a larger instance of EC2. Contact IT-SE.