Replies: 17 comments
-
>>> tbozo |
Beta Was this translation helpful? Give feedback.
-
>>> elpimous_robot |
Beta Was this translation helpful? Give feedback.
-
>>> lissyx |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
>>> tbozo |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
>>> DJ-Hay |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
>>> elpimous_robot |
Beta Was this translation helpful? Give feedback.
-
>>> elpimous_robot |
Beta Was this translation helpful? Give feedback.
-
>>> yv001 |
Beta Was this translation helpful? Give feedback.
-
Beta Was this translation helpful? Give feedback.
-
>>> lissyx |
Beta Was this translation helpful? Give feedback.
-
>>> DJ-Hay |
Beta Was this translation helpful? Give feedback.
-
>>> DJ-Hay |
Beta Was this translation helpful? Give feedback.
-
>>> elpimous_robot |
Beta Was this translation helpful? Give feedback.
-
>>> elpimous_robot |
Beta Was this translation helpful? Give feedback.
-
>>> elpimous_robot
[January 26, 2018, 5:13pm]
### Tutorial How to build your homemade deepspeech model from scratch
Adapt links and params with your needs...
For my robotic project, I needed to create a small monospeaker model,
with nearly 1000 sentences orders (not just single word !)
I recorded wav's with a Respeaker Microphone Array :
>
Wav's were recorder with the following params : mono / 16 bits / 16
k.
The use of the google vad lib helped me to limit white space
before/after each wav, but Deepspeech seems to process wav, and
un-necessary white sound too. (But, as Kdavis told me, removing white
sound before processing, limits time spent for model creation !)
MATERIAL PREPARATION :
better !!)
(utf8 encoded)
we'll call this textfile, the original textfile.
1 - original textfile cleaning :
present in alphabet.txt slash
remove any punctuation, but you can keep the apostroph, if present
in alphabet, as a character.
2 - create 3 directories : train, dev, test.
3 - feed each dir. with corresponding wav's and a new transcript's
textfile, as CSV file, slash
containing those specific wav's transcript...
Note about the textfiles :
dir** and test.csv in test dir
wav_filename,wav_filesize,transcript
}
```
/home/nvidia/DeepSpeech/data/alfred/test/record.1.wav,85484,qui es-tu et qui est-il
/home/nvidia/DeepSpeech/data/alfred/test/record.2.wav,97004,quel est ton nom ou comment tu t'appelles...
70 - 20 - 10 !
70% of all wav's content in train dir, with corresponding
train.csv file,
20% in dev dir, with corresponding dev.csv file,
10% in test dir, with corresponding test.csv file.
IMPORTANT : A wav file can only appear in one directory file. slash
It's needed for good model creation (Otherwise, it could result in
overfitting...)
LANGUAGE MODEL CREATION :
Here, we use the original textfile, containing 100% of wav's
transcripts, and we rename it vocabulary.txt
We'll use the powerfull Kenlm tools for our LM build :
http://kheafield.com/code/kenlm/estimation/
DONT FORGET TO COMPILE, OTHERWISE YOU WILL NOT FIND BINARIES
1 - Creating arpa file for binary build :
/bin/bin/./lmplz --text vocabulary.txt --arpa words.arpa --o 3
I asked Kenneth Heafield about -o param ('order model to estimate')
It seems that for small corpus (my case), a value from 3 to 4 seems to
be the best way to success
See lmplz params on web link, if needed.
2 - creating binary file :
/bin/bin/./build_binary -T -s words.arpa lm.binary
TRIE CREATION :
We'll use the native_client 'generate_trie' binary to create our trie
file,
Adapt your links !
/home/nvidia/tensorflow/bazel-bin/native_client/generate_trie /
/home/nvidia/DeepSpeech/data/alphabet.txt /
/home/nvidia/DeepSpeech/data/lm.binary /
/home/nvidia/DeepSpeech/data/vocabulary.txt /
/home/nvidia/DeepSpeech/data/trie
RUN MODEL CREATION :
Verify your directories :
train.csv
record.1.wav
record.2.wav...(remember : all wav's are different)
dev.csv
record.1.wav
record.2.wav...
test.csv
record.1.wav
record.2.wav...
2 - Write your run file :
run-alfred.sh:
#!/bin/sh
set -xe
if [ ! -f DeepSpeech.py ]; then
echo 'Please make sure you run this from DeepSpeech's top level directory.'
exit 1
fi;
python -u DeepSpeech.py slash
--train_files /home/nvidia/DeepSpeech/data/alfred/train/train.csv slash
--dev_files /home/nvidia/DeepSpeech/data/alfred/dev/dev.csv slash
--test_files /home/nvidia/DeepSpeech/data/alfred/test/test.csv slash
--train_batch_size 80 slash
--dev_batch_size 80 slash
--test_batch_size 40 slash
--n_hidden 375 slash
--epoch 33 slash
--validation_step 1 slash
--early_stop True slash
--earlystop_nsteps 6 slash
--estop_mean_thresh 0.1 slash
--estop_std_thresh 0.1 slash
--dropout_rate 0.22 slash
--learning_rate 0.00095 slash
--report_count 100 slash
--use_seq_length False slash
--export_dir /home/nvidia/DeepSpeech/data/alfred/results/model_export/ slash
--checkpoint_dir /home/nvidia/DeepSpeech/data/alfred/results/checkout/ slash
--decoder_library_path /home/nvidia/tensorflow/bazel-bin/native_client/libctc_decoder_with_kenlm.so slash
--alphabet_config_path /home/nvidia/DeepSpeech/data/alfred/alphabet.txt slash
--lm_binary_path /home/nvidia/DeepSpeech/data/alfred/lm.binary slash
--lm_trie_path /home/nvidia/DeepSpeech/data/alfred/trie slash
'$'
Adapt links and params to fit your needs...
Now, run the file IN YOUR DEEPSPEECH directory : slash
/bin/run-alfred.sh
you can leave the computer, watch an entire 12 episodes serie...before
end process
IF everything worked correctly, you should now have a
`/model_export/output_graph.pb` ,your model.
My Model : (I didn't really respect percentage LOL)
Sure I'll do better with more material (wav's and transcripts)
Enjoy your model with inferences.
DATA AUGMENTATION :
Now, I have a very good model : Awesome!
But, sure I could do better : slash
'Alfred', my robot encounters bad inferences, when I talk to it (him?)
in 2 cases :
In first case, the distance produces echos, change voice wave
amplitude... slash
In the second case, motor wheels produce noise, ground
texture too... slash
In both cases, noises, echos, amplitude signal variations, cause bad
inferences !!
How could I do ?
1/ Well, I'll use specificity : slash
Alfred, you need to listen and process an echo wav sentence, due to
distance ? Ok !
I'll make you learn modified sentences (with echos inside wav's)
in different locations in room/house
Noise ? happy voice ? sad one ? slash -- slash > I'll make you learn a bit of each
!!!
(noise should differ)
Sure it needs a bit of time, recording new waves, to fit all scenarios,
but I must say that the inference difference is very impressive : I
can talk to my robot when moving, and ask it to stop, for example.
2/ Other way : slash
A mozilla tool to modify wav's with echos, pitch, adding noise... slash
{.site-icon width='32'
{.thumbnail
### mozilla/voice-corpus-tool
voice-corpus-tool - Tool for creation, manipulation and maintenance of
voice corpora
> python /media/nvidia/neo_backup/voice-corpus-tool-master/voice.py
>
> A tool to apply a series of commands to a collection of samples. slash
> Usage: voice.py (command ... slash [-opt1 slash [ slash ] slash ] slash [-opt2
> slash [ slash ] slash ] ...) slash *
>
> Commands:
>
> help slash
> Display help message
>
> add slash
> Adds samples to current buffer slash
> Arguments: slash
> source: string - Name of a named buffer or filename of a CSV file or
> WAV file (wildcards supported)
>
> Buffer operations:
>
> shuffle slash
> Randoimize order of the sample buffer
>
> order slash
> Order samples in buffer by length
>
> reverse slash
> Reverse order of samples in buffer
>
> take slash
> Take given number of samples from the beginning of the buffer as new
> buffer slash
> Arguments: slash
> number: int - Number of samples
>
> repeat slash
> Repeat samples of current buffer times as new buffer slash
> Arguments: slash
> number: int - How often samples of the buffer should get repeated
>
> skip slash
> Skip given number of samples from the beginning of current buffer slash
> Arguments: slash
> number: int - Number of samples
>
> find slash
> Drop all samples, who's transcription does not contain a keyword slash
> Arguments: slash
> keyword: string - Keyword to look for in transcriptions
>
> clear slash
> Clears sample buffer
>
> Named buffers:
>
> set slash
> Replaces named buffer with contents of buffer slash
> Arguments: slash
> name: string - Name of the named buffer
>
> stash slash
> Moves buffer to named buffer (buffer will be empty afterwards) slash
> Arguments: slash
> name: string - Name of the named buffer
>
> push slash
> Appends buffer to named buffer slash
> Arguments: slash
> name: string - Name of the named buffer
>
> drop slash
> Drops named buffer slash
> Arguments: slash
> name: string - Name of the named buffer
>
> Output:
>
> print slash
> Prints list of samples in current buffer
>
> play slash
> Play samples of current buffer
>
> write slash <dir_name slash > slash
> Write samples of current buffer to disk slash
> Arguments: slash
> dir_name: string - Path to the new sample directory. The directory and
> a file with the same name plus extension '.csv' should not exist.
>
> Effects:
>
> reverb slash [-room_scale slash <room_scale slash > slash ] slash [-hf_damping slash <hf_damping slash > slash ]
> slash [-wet_gain slash <wet_gain slash > slash ] slash [-stereo_depth slash <stereo_depth slash > slash ]
> slash [-reverberance slash ] slash [-wet_only slash ] slash [-pre_delay slash <pre_delay slash > slash ] slash
> Adds reverberation to buffer samples slash
> Options: slash
> -room_scale: float - Room scale factor (between 0.0 to 1.0) slash
> -hf_damping: float - HF damping factor (between 0.0 to 1.0) slash
> -wet_gain: float - Wet gain in dB slash
> -stereo_depth: float - Stereo depth factor (between 0.0 to 1.0) slash
> -reverberance: float - Reverberance factor (between 0.0 to 1.0) slash
> -wet_only: bool - If to strip source signal on output slash
> -pre_delay: int - Pre delay in ms
>
> echo slash <gain_in slash > slash <gain_out slash > slash <delay_decay slash > slash
> Adds an echo effect to buffer samples slash
> Arguments: slash
> gain_in: float - Gain in slash
> gain_out: float - Gain out slash
> delay_decay: string - Comma separated delay decay pairs - at least one
> (e.g. 10,0.1,20,0.2)
>
> speed slash
> Adds an speed effect to buffer samples slash
> Arguments: slash
> factor: float - Speed factor to apply
>
> pitch slash
> Adds a pitch effect to buffer samples slash
> Arguments: slash
> cents: int - Cents (100th of a semi-tome) of shift to apply
>
> tempo slash
> Adds a tempo effect to buffer samples slash
> Arguments: slash
> factor: float - Tempo factor to apply
>
> sox slash
> Adds a SoX effect to buffer samples slash
> Arguments: slash
> effect: string - SoX effect name slash
> args: string - Comma separated list of SoX effect parameters (no white
> space allowed)
>
> augment slash [-gain slash ] slash [-times slash ] slash
> Augment samples of current buffer with noise slash
> Arguments: slash
> source: string - CSV file with samples to augment onto current sample
> buffer slash
> Options: slash
> -gain: float - How much gain (in dB) to apply to augmentation audio
> before overlaying onto buffer samples slash
> -times: int - How often to apply the augmentation source to the sample
> buffer
Oh, I can modify a whole csv file with a lot of params... slash
Calling a csv file, you process on the fly modifications on every wav in
csv !!! slash
I tested PITCH, SPEED, TEMPO, with values between (-0.05, 0.05), with
very good results (test Wer/2)
...to follow!
[Customizing language model
[Can we use DeepSpeech for Vietnamese Speech To Text?
[TV speech recognition
[Can I use other language modelling tools than KenLM
[Cleaning Transcript Files (Invalid label when building
trie)
[Custom language model for buzzwords
[Creating an Indian accent model with slash ~115k files
[Improving accuracy by creating a specific model?
[Training DeepSpeech for Mandarin
[Model from scratch
[Bad Accuracy
[Adding in a custom language model (like BERT)
[Trained model on my own data
[Creation of language model and trie
[Tune MoziilaDeepSpeech to recognize specific
sentences
[DeepSpeech model training
[Tune MoziilaDeepSpeech to recognize specific
sentences
[Error: Can slash 't parse trie file, invalid header. Try updating your trie
file
[How to indicate if sentence has trained word
[DeepSpeech french model
[Trie file version mismatch (4 instead of expected 3). Update your trie
file
[Terminate called after throwing an instance of
slash 'lm::FormatLoadException slash '
[Different LM produces drastic effect on WER
[How to use deep speech for training Hindi/Tamil
[Issue when training custom model
[Issue regarding dataset format
[High loss on Russian model
[ValueError
[FATAL Flags parsing error: -alphabet_config_path must exist and be
readable
[I cant find native_client.tar.xz slash [DeepSpeech 0.6.1 slash ]
[Error while training alphabet, says it is missing
characters
[Error while compiling generate_trie.cpp
[Error while compiling generate_trie.cpp
[FormatLoadException Error during running build_binary
command
[Train for only one voice
[Problem : converging to a wrong model
[Error when running inference on an audio file
[Creating an Indian accent model with slash ~115k files
[Training model for additional data using checkpoint
[Language Model Creation
[If there is any possible ways is there to add my audio vocab.txt
binaries into pretrained binaries for our deepspeech?
[When will you release deepspeech pretrained model
v0.2.0?
[When will you release deepspeech pretrained model
v0.2.0?
[How can i add custom vocab.txt and build a language model lm.binary,
trie for pretrained model v0.2.0
[How can i add custom vocab.txt and build a language model lm.binary,
trie for pretrained model v0.2.0
[Problem of train a Mandarin model
[I need help
[Error when run generate trie
[How can i add custom vocab.txt and build a language model lm.binary,
trie for pretrained model v0.2.0
[Training Deepspeech
[This is an archived TTS discussion thread from discourse.mozilla.org/t/tutorial-how-i-trained-a-specific-french-model-to-control-my-robot]
Beta Was this translation helpful? Give feedback.
All reactions