Skip to content

Latest commit

 

History

History
45 lines (29 loc) · 12 KB

speaker-recognition.md

File metadata and controls

45 lines (29 loc) · 12 KB

Automatic Speaker Recognition

Automatic Speaker Recognition (determining the identity of the person that is speaking in a recording) is a research topic in the area of Speech Processing, alongside of similar topics such as Speech Recognition (finding the words spoken), accent, language, and emotion recognition, but also speech synthesis (generating speech from text), speech coding, compression, or perhaps the most basic of all: speech activity detection (finding out the periods in an audio recording where there is any speech). In speaker recognition we're interested in the technique of recognizing speakers, not in particular persons per se, e.g., Osama bin Laden or president Biden. Therefore, the problem is often cast as a speaker detection task: given two segments of audio, determine whether these were spoken by the same speaker or by different speakers. Traditionally, one of these segments was called the enrollment, the other the test segment. A direct application of this task is speaker verification: is this person speaking the person who she claims she is? Often this application is contrasted to speaker identification, where the task is: what is the actual identity of the person speaking? A system that can perform well on the speaker detection task can quite easily be used effectively in the other applications, and it is therefore that in all speaker recognition research this task is the topic of study. And so it is in this course.

Any speaker recognition system internally works with a score that expresses the similarity between the speakers of the two segments. We traditionally work with a sense of the score that is bigger to indicate more similarity (as opposed to a distance score that would get smaller, and is often bounded by 0). For an actual decision such a score needs to be thresholded: scores higher than the threshold will get the decision "same speaker", whereas comparisons with a score lower than the threshold will receive the decision "different speaker". Setting such a threshold is actually far from trivial, and in general depends on a specific application and priors. The capability of setting thresholds well is called calibration, and is a research area in itself. In this course, like in most of the speaker recognition research, we will not assess calibration.

Evaluation metrics

Systems are evaluated by giving them many pairs of speech segments. Each pair is called a trial, and the task is to give a score that is higher when the system finds the speakers more likely to be the same speaker. For the evaluation test set, the identities of the speakers in the test segments are not known by the system (or by the students of this course building the system). When the scores are submitted for evaluation, the identities are used to compute the performance metric, the Equal Error Rate (EER). We will give a short explanation what this metric means and how it is computed.

There are two kinds of trials:

  • target trials, when both segments are spoken by the same speaker,
  • non-target trials, when the two segments are spoken by different speakers.

When a system would make a decision by thresholding a score, two different types of errors can be made. A target trial can be classified as "different speakers", this is called a false negative or missed speaker. Alternatively, a non-target trial can be classified as "same speaker", this is called a false positive or false alarm. The submitted scores can be grouped in "target scores" and "non-target scores", once the speaker identities are known, and their distributions will be different, as non-target scores tend to be lower than target scores, as this is the task of the speaker recognition system.

When, given a set of submitted scores, the threshold is swept from low to high, the corresponding false negative rate (the number of false negatives divided by the number of target trials) and false positive rate (the number of false positives divided by the number of non-target trials) will vary from the extreme (FNR = 0, FPR = 1), when the threshold is below the lowest submitted score, to the other extreme (FNR = 1, FPR = 0) when it is above the highest. In-between these extremes, the false negative rate is traded off against the false positive rate---this is where the action is. This trade-off can be appreciated in a parametrized plot of the false negative rate versus the false positive rate. This essentially is a Receiver Operating Characteristic (ROC), but in speaker recognition we're used to warping the axes a little and calling this plot a Detection Error Trade-off (DET) plot.

The trick now is that, given the set of submitted trials, this whole process can be done by the evaluator, who knows the identities of the speakers in the evaluation trials. Rather than the full trade-off curve, it is nice to have a single metric that is characteristic of the entire curve. There are many candidates for this, but in speaker recognition it is customary to use the Equal Error Rate, this is the place on the curve where the false negative rate is equal to the false positive rate. Typically, when the equal error rate is lower, the system is better. The range of the EER is from 0 (perfect separation of target and non-target trials) to 50% (random scores).

During the development of a system it is useful to have a set of trials, the development set, with speakers that are different from the speaker that are used in training, and for which you know for each trial whether it is a target or non-target trial, so that you can test your system and inspect scores and compute the EER yourself. Although it is not rocket science to compute an EER, there are some caveats here and there, so we will provide code that computes the EER efficiently. The EER we compute actually is the ROC Convex Hull EER, the value you get by first computing the convex hull of the (steppy) ROC, and then intersecting this with the line FPR = FNR.

A very short summary of existing approaches

Modern approaches to speaker recognition are end-to-end, they directly start from the uncompressed audio waveform data, processing it using a neural network starting with a couple of CNN layers (creating features), followed by transformers and/or fully connected layers, and then pooling over the time-dimension, producing an embedding. During training this is further classified using a fully connected layer with all speaker IDs from the train set as targets, and during inference the embeddings of the two sides of the test trial are compared using something like a cosine score.

Most earlier approaches first do some kind of feature extraction, often computing Mel Frequency Cepstral Coefficients (MFCCs), representing the audio waveform as a sequence of fixed-length vectors. These can then be further processed by, e.g., a neural net, as in the case of x-vectors, or directly modeled. Very early approaches used Gaussian Mixture Models (GMMs), and later deviations from a Uniform Background Model (UBM)---a large GMM modeling all possible speech---were used to compute the comparison score. Later Support Vector Models modeled these deviations, and then Joint Factor Analysis (JFA) managed to factorize directions in these deviations into components that stem from speaker variation and components that stem from session variation. Session variation, incidentally, has always been seen as the hardest problem in automatic speaker recognition. A clever continuation to this approach was to use JFA techniques, but not explicitly separating for session and speaker, producing a single vector representing the entire speech utterance. These vectors were coined i-vectors, and they stood at the basis of virtually all speaker recognition systems until performance was surpassed by neural networks.

Apart from working on basic disciriminability of speakers, a lot of performance can be gained by various forms of normalization (making score distributions from different sub-sets of speakers become very similar) and calibration (making it possible to make the correct decision same/different using a precomputed threshold).

Some literature