See #1546 for more details.
Number of model parameters: 72526519, i.e., 72.53 M
The best CER, for CommonVoice 16.1 (cv-corpus-16.1-2023-12-06/zh-HK) is below:
Dev | Test | Note | |
---|---|---|---|
greedy_search | 1.17 | 1.22 | --epoch 24 --avg 5 |
modified_beam_search | 0.98 | 1.11 | --epoch 24 --avg 5 |
fast_beam_search | 1.08 | 1.27 | --epoch 24 --avg 5 |
When doing the cross-corpus validation on MDCC (w/o blank penalty), the best CER is below:
Dev | Test | Note | |
---|---|---|---|
greedy_search | 42.40 | 42.03 | --epoch 24 --avg 5 |
modified_beam_search | 39.73 | 39.19 | --epoch 24 --avg 5 |
fast_beam_search | 42.14 | 41.98 | --epoch 24 --avg 5 |
When doing the cross-corpus validation on MDCC (with blank penalty set to 2.2), the best CER is below:
Dev | Test | Note | |
---|---|---|---|
greedy_search | 39.19 | 39.09 | --epoch 24 --avg 5 --blank-penalty 2.2 |
modified_beam_search | 37.73 | 37.65 | --epoch 24 --avg 5 --blank-penalty 2.2 |
fast_beam_search | 37.73 | 37.74 | --epoch 24 --avg 5 --blank-penalty 2.2 |
To reproduce the above result, use the following commands for training:
export CUDA_VISIBLE_DEVICES="0,1"
./zipformer/train_char.py \
--world-size 2 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir zipformer/exp \
--cv-manifest-dir data/zh-HK/fbank \
--language zh-HK \
--use-validated-set 1 \
--context-size 1 \
--max-duration 1000
and the following commands for decoding:
for method in greedy_search modified_beam_search fast_beam_search; do
./zipformer/decode_char.py \
--epoch 24 \
--avg 5 \
--decoding-method $method \
--exp-dir zipformer/exp \
--cv-manifest-dir data/zh-HK/fbank \
--context-size 1 \
--language zh-HK
done
Detailed experimental results and pre-trained model are available at: https://huggingface.co/zrjin/icefall-asr-commonvoice-zh-HK-zipformer-2024-03-20
See #997 for more details.
Number of model parameters: 70369391, i.e., 70.37 M
Note that the result is obtained using GigaSpeech transcript trained BPE model
The best WER, as of 2023-04-17, for Common Voice English 13.0 (cv-corpus-13.0-2023-03-09/en) is below:
Results are:
Dev | Test | |
---|---|---|
greedy_search | 9.96 | 12.54 |
modified_beam_search | 9.86 | 12.48 |
To reproduce the above result, use the following commands for training:
export CUDA_VISIBLE_DEVICES="0,1,2,3"
./pruned_transducer_stateless7/train.py \
--world-size 4 \
--num-epochs 30 \
--start-epoch 1 \
--use-fp16 1 \
--exp-dir pruned_transducer_stateless7/exp \
--max-duration 550
and the following commands for decoding:
# greedy search
./pruned_transducer_stateless7/decode.py \
--epoch 30 \
--avg 5 \
--decoding-method greedy_search \
--exp-dir pruned_transducer_stateless7/exp \
--bpe-model data/en/lang_bpe_500/bpe.model \
--max-duration 600
# modified beam search
./pruned_transducer_stateless7/decode.py \
--epoch 30 \
--avg 5 \
--decoding-method modified_beam_search \
--beam-size 4 \
--exp-dir pruned_transducer_stateless7/exp \
--bpe-model data/en/lang_bpe_500/bpe.model \
--max-duration 600
Pretrained model is available at https://huggingface.co/yfyeung/icefall-asr-cv-corpus-13.0-2023-03-09-en-pruned-transducer-stateless7-2023-04-17
See #1018 for more details.
Number of model parameters: 70369391, i.e., 70.37 M
The best WER for Common Voice French 12.0 (cv-corpus-12.0-2022-12-07/fr) is below:
Results are:
decoding method | Test |
---|---|
greedy_search | 9.95 |
modified_beam_search | 9.57 |
fast_beam_search | 9.67 |
Note: This best result is trained on the full librispeech and gigaspeech, and then fine-tuned on the full commonvoice.
Detailed experimental results and Pretrained model are available at https://huggingface.co/shaojieli/icefall-asr-commonvoice-fr-pruned-transducer-stateless7-streaming-2023-04-02