Commonvoice import cv2 py error #617

JRMeyer · 2021-03-08T03:08:08Z

JRMeyer
Mar 8, 2021
Maintainer

>>> tuttlebr
[April 30, 2019, 3:25am]

Hello,

I am trying to run the import_cv2.py script. I have downloaded the
english commonvoice release and my environment is a docker container
from the repo.

The input

python3 bin/import_cv2.py --filter_alphabet data/alphabet.txt data/cv/en/ --normalize slash
(the alphabet file is also from the repo. same result with or without
slash --normalize)

The error

Saving new DeepSpeech-formatted CSV file to: data/cv/en/clips/train.csv
Traceback (most recent call last):
File 'bin/import_cv2.py', line 158, in
_preprocess_data(params.tsv_dir, audio_dir, label_filter)
File 'bin/import_cv2.py', line 43, in _preprocess_data
_maybe_convert_set(input_tsv, audio_dir, label_filter)
File 'bin/import_cv2.py', line 56, in _maybe_convert_set
for row in reader:
File '/usr/lib/python3.6/csv.py', line 111, in next
self.fieldnames
File '/usr/lib/python3.6/csv.py', line 98, in fieldnames
self._fieldnames = next(self.reader)
File '/usr/lib/python3.6/encodings/ascii.py', line 26, in decode
return codecs.ascii_decode(input, self.errors)[0]
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 8085: ordinal not in range(128)

I'm guessing this is some mp3 prefix or buffer but I'm not really
familiar with it. I searched the github repo as well as this forum and
found no prior issues. Perhaps I'm not running it correctly.

[This is an archived TTS discussion thread from discourse.mozilla.org/t/commonvoice-import-cv2-py-error]

JRMeyer · 2021-03-08T03:08:10Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[April 30, 2019, 6:53am]

Strange, we had no issue, can you give more context on your system, what
release of common voice ? Also try without --filter_alphabet ?

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:08:13Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> Tilman_Kamp
[April 30, 2019, 8:33am]

Just imported CV2-en with latest master.

DeepSpeech$ python3 bin/import_cv2.py --normalize --filter_alphabet data/alphabet.txt /work/CommonVoice/en
Loading TSV file: /work/CommonVoice/en/train.tsv
Saving new DeepSpeech-formatted CSV file to: /work/CommonVoice/en/clips/train.csv
Importing mp3 files...
Progress |####################################| 100% completedWriting CSV file for DeepSpeech.py as: /work/CommonVoice/en/clips/train.csv
Progress |####################################| 100% completed
Imported 12123 samples.
Skipped 72 samples that failed on transcript validation.
Skipped 12 samples that were longer than 10 seconds.
Loading TSV file: /work/CommonVoice/en/test.tsv
Saving new DeepSpeech-formatted CSV file to: /work/CommonVoice/en/clips/test.csv
Importing mp3 files...
Progress |####################################| 100% completedWriting CSV file for DeepSpeech.py as: /work/CommonVoice/en/clips/test.csv
Progress |####################################| 100% completed
Imported 6804 samples.
Skipped 306 samples that failed on transcript validation.
Skipped 212 samples that were longer than 10 seconds.
Loading TSV file: /work/CommonVoice/en/dev.tsv
Saving new DeepSpeech-formatted CSV file to: /work/CommonVoice/en/clips/dev.csv
Importing mp3 files...
Progress |####################################| 100% completedWriting CSV file for DeepSpeech.py as: /work/CommonVoice/en/clips/dev.csv
Progress |####################################| 100% completed
Imported 6940 samples.
Skipped 3 samples that failed upon conversion.
Skipped 268 samples that failed on transcript validation.
Skipped 73 samples that were longer than 10 seconds.
Progress |####################################| 100% completed
Progress |####################################| 100% completed
Progress |####################################| 100% completed

Maybe your archive is different - could you run md5sum on it?

DeepSpeech$ md5sum /work/CommonVoice/en.tar.gz
a639b0e22b969d76abe1c40beb0d3439 /work/CommonVoice/en.tar.gz

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:08:15Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> tuttlebr
[April 30, 2019, 12:01pm]

I tried the process outside of docker, it works with slash --alphabet. md5sum
checks out as well. I can continue to troubleshoot and post if I find
the root cause. It must be my container. I'm going to prune my system
and rebuild the image/container. Looks like the Dockerfile was also
updated a couple days ago, I'll use the update and see what happens.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:08:18Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> tuttlebr
[May 4, 2019, 4:15am]

I've since resolved this by not running the pre-processing steps within
a container but thought I'd just follow up.

Ubuntu 18.04 slash
Docker version 18.09.5, build e8ff056 slash
nvidia-docker 2.0 slash
Common Voice 2.0 (Whatever's currently on the Mozilla Website as of my
original post) slash
Dockerfile

brandondaedalus:/Documents/DeepSpeech$ docker build -f Dockerfile -t deepspeech .
brandondaedalus:/Documents/DeepSpeech$ nvidia-docker run -it -d slash
-p 8888:8888 -p 6006:6006 slash
-u $(id -u):$(id -g) slash
-e HOME=/home/$USER slash
-v /home/$USER:/home/$USER slash
-v /home/brandon/raid/share:/ncc1701 slash
deepspeech
brandondaedalus:/Documents/DeepSpeech$ docker exec -it 7f1423f7c6d1 bash

Process for getting audio outside Docker which worked:

brandondaedalus:/Documents/DeepSpeech/data$ wget https://voice-prod-bundler-ee1969a6ce8178826482b88e843c335139bd3fb4.s3.amazonaws.com/cv-corpus-1/en.tar.gz
brandondaedalus:/Documents/DeepSpeech/data$ tar -xzf en.tar.gz
brandondaedalus:/Documents/DeepSpeech$ python3 bin/import_cv2.py --filter_alphabet data/alphabet.txt data/cv/en/

After handling the process outside of the Docker, I haven't had any
issues training.

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:08:21Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> geotou
[July 8, 2019, 10:52am]

I fixed this with setting the docker image locale to UTF-8.

apt update
apt install locales

locale-gen en_US.UTF-8

export LANG=en_US.UTF-8

[Archived Post]

0 replies

JRMeyer · 2021-03-08T03:08:23Z

JRMeyer
Mar 8, 2021
Maintainer Author

>>> lissyx
[June 14, 2019, 5:47pm]

Thanks, do not hesitate to send a PR if needed !

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Commonvoice import cv2 py error #617

{{title}}

Replies: 6 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Commonvoice import cv2 py error #617

JRMeyer Mar 8, 2021 Maintainer

Replies: 6 comments

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer Mar 8, 2021 Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author

JRMeyer
Mar 8, 2021
Maintainer Author