Repository containing current experiments related to the simi (semantic) task of the ZeroSpeech 2021 Challenge.
We can score phone-level segmentation by running it through a simple SIMI solution and then rate it with official zerospeech2021-evaluate
tool. In order to do so we will need phone-segmented dataset and semantic/dev/**
subsets. An example on how to prepare such segmentations is in script simi/task/prepare_segmentation.sh.
To score segmentations use the file simi/task/run.sh. You will need to change these parameters:
VOCAB_SIZE
- this is the size of the sentencepiece's vocabulary, should be roughly 50000TRAINSET_PATH
- path to the phonemized data setTESTSET_(LIBRISPEECH|SYNTHETIC)_PATH
- path to the phonemized test sets:semantic/dev/librispeech
andsemantic/dev/synthetic
OUTPUT_PATH
- directory where two folders (librispeech
andsynthetic
) consisting of semantic vectors will be created, in format accepted byzerospeech2021-evaluate
.
Example script: scoring/run.sh
We compute the PER (Phone Error Rate) of a clusterization by greedy-mapping every sentence piece to a most frequent ground-truth phone. Then we do a greedy alignment, and compute the mismatch error rate.
Parameters:
gt
: path to the ground-truth segmentation, eg.:/pio/data/zerospeech2021/librispeech_alignments/<librispeech_subset>
quantized
: path to the segmentation you want to rateframe_shift
: offset in number of frames, in most cases it should be 0
Command: python simi/statistics/pieces.py <segmentation_path>
It prints the number of pieces and max/min/mean piece counts.
Example script: simi/clusterization/run.sh
This is a script for running the CPC_kmeans checkpoint, but without the latest argmax (we leave each frame as distribution on centroids). This is required for the viterbi segmentation.
It uses the zerospeech2021_baseline
conda environment, unlike the other scripts. For installation please follow this chapter: https://github.com/bootphon/zerospeech2021_baseline#installation
To get the description of parameters run: python clusterization.py --help
.
Example script: simi/segmentation/run.sh
We run sentencepiece on given trainset
, then using learnt language model we try to predict the best segmentation. There are two ways of doing so:
- use sentencepiece's default segmentation (by default), but it's bad because our language is not very "exact" - some pseudophones (output of the CPC+clustering) may be mismatched
- use viterbi segmentation (flag
--viterbi
). It takes errors into account, and it usually gives lower PER.
To get the description of parameters run: python segmentation.py --help
.
You can also train sentencepiece and evaluate segmentation separately, using these two scripts: train_sentencepiece.py and eval_segmentation.py. Run python <SCRIPT_NAME> --help
for more info.
Example script: simi/grouping/run.sh
The idea is based on the fact that segmentation using sentencepiece requires large vocab to work, but it results in multiple different sentence pieces mapping to the same phoneme. In order to reduce number of sentence pieces, we run word2vec on them and then k-means, grouping and merging multiple sentencepieces together. It never increases PER, but (greatly) reduces vocab size.
To get the description of parameters run: python grouping.py --help
.