- A pytorch implementation of d-vector based speaker recognition system.
- ResNet-based feature extractor, global average pooling and softmax layer with cross-entropy loss.
- All the features (log Mel-filterbank features) for training and testing are uploaded.
- Korean manual is included ("2019_LG_SpeakerRecognition_tutorial.pdf").
python 3.5+
pytorch 1.0.0
pandas 0.23.4
numpy 1.13.3
pickle 4.0
matplotlib 2.1.0
We used the dataset collected through the following task.
- No. 10063424, 'development of distant speech recognition and multi-task dialog processing technologies for in-door conversational robots'
Specification
- Korean read speech corpus (ETRI read speech)
- Clean speech at a distance of 1m and a direction of 0 degrees
- 16kHz, 16bits
We uploaded 40-dimensional log Mel-filterbank energy features extracted from the above dataset.
python_speech_features library is used.
You can extract the features using the code below:
import librosa
import numpy as np
from python_speech_features import fbank
def normalize_frames(m,Scale=True):
if Scale:
return (m - np.mean(m, axis=0)) / (np.std(m, axis=0) + 2e-12)
else:
return (m - np.mean(m, axis=0))
audio, sr = librosa.load(filename, sr=sample_rate, mono=True)
filter_banks, energies = fbank(audio, samplerate=sample_rate, nfilt=40, winlen=0.025)
filter_banks = 20 * np.log10(np.maximum(filter_banks,1e-5))
feature = normalize_frames(filter_banks, Scale=False)
24000 utterances, 240 folders (240 speakers)
Size : 3GB
feat_logfbank_nfilt40 - train
20 utterances, 10 folders (10 speakers)
Size : 11MB
feat_logfbank_nfilt40 - test
Background model (ResNet based speaker classifier) is trained.
You can change settings for training in 'train.py' file.
python train.py
Extract the speaker embeddings (d-vectors) using 10 enrollment speech files.
They are extracted from the last hidden layer of the background model.
All the embeddings are saved in 'enroll_embeddings' folder.
python enroll.py
For speaker verification, you can change settings in 'verification.py' file.
python verification.py
For speaker identification, you can change settings in 'identification.py' file.
python identification.py
train_DB, valid_DB = split_train_dev(c.TRAIN_FEAT_DIR, val_ratio)
- 'c.TRAIN_FEAT_DIR' in configure.py should be the path of your dataset
- 'c.TRAIN_FEAT_DIR' should have the structure as: FEAT_DIR/speaker_folders/features_files.p
def find_feats(directory, pattern='**/*.p'):
"""Recursively finds all files matching the pattern."""
return glob(os.path.join(directory, pattern), recursive=True)
- I assumed that all the features are extracted in '.p' format.
- If you want to change the extension, please change line 31 in DB_wav_reader.py
- pattern='**/*.p' should be changed according to your feature format.
- If you didn't extract features yet, please do that using python_speech_features library.
- I didn't upload the code for feature extraction.
- Of course, you can use other libraries.
def read_MFB(filename):
with open(filename, 'rb') as f:
feat_and_label = pickle.load(f)
feature = feat_and_label['feat'] # size : (n_frames, dim=40)
label = feat_and_label['label']
- It is assumed that the feature file format is pickle. You need to change the code according to the format.
- You have to change the function 'read_MFB' according to your situation.
- From line 12 to line 16, we load feature (it is assumed the feature is saved using pickle) and label.
- Feature size should be (n_frames, dim) as written in the comment. Label should be the speaker identity in string.
- You can remove from line 20 to 24 because it is assumed that the front and back of the utterance is silence.
- Be aware that all the settings are set to the small dataset as the training set in this tutorial is very small.
- According to your dataset, you can make the model wider (increase the number of channels) and deeper (change to the ResNet-34) or increase the number of input frames ('NUM_WIN_SIZE' in configure.py).
- More advanced loss function or pooling method (attentive pooling...) also can be used (not implemented here).
Youngmoon Jung ([email protected]) at KAIST, South Korea