No speech-to-text procedure is complete without a speaker diarization module. Say you want to monitor everything that’s being said about and by presidential candidates before an election. You would download lots of interviews and news reports and pass them through a speech-to-text service. Then you could CTRL+F your way through topics of interest and your job becomes so much easier than actually having to listen to the audios. However, without speaker separation, you would not know who said any of the things you find in the transcript. You would not know if a certain candidate said that, or its opponent, or someone else present in the discussion. And in order to find that out, you have to manually turn back to the audio and listen to it. Oh, but that's the exact thing you’ve been trying to avoid in the first place.
Speaker diarization is the process of chunking an input audio file into a set of segments and attributing each segment to a corresponding speaker. While checking out the transcript of a meeting, speaker diarization will let you know exactly who said what. Thus, solving the aforementioned problem.
From a Machine Learning standpoint there are many ways in which speaker diarization can be performed. Throughout this blog we will investigate two of the most common ways to implement the process, such that, next time you’re choosing a speaker diarization service, you’ll have a better idea of what’s happening under the hood.
First Method: Voice Activity Detection + Segment Encoding + Clustering
This is what you would get if you employ the popular pyannote (https://arxiv.org/abs/1911.01255) framework. The first step, as the title suggests, is voice activity detection. Given an audio file, a (first) neural network is trained to detect such segments for further processing.
In order to refine these segments, two additional neural networks are employed. One detects speaker change, and the other detects overlapping speech. This is so that the final segments are guaranteed, to the best degree possible, to belong to a single individual. Once these segments are determined, a fourth neural network encodes these segments. This machine learning model is trained to encode audio samples pertaining to the same speaker into similar mathematical vectors, and audio samples belonging to different speakers, as vectors which are further apart in their respective space.
After encoding the segments into mathematical vectors, such that, vectors from audios with the same vocal fingerprint are closer together while vectors from segments featuring different speakers are further apart, a clustering process occurs.
Here, on the left, we illustrate what plotting the encodings for each segment can look like. On the right, we see the final output of the clustering algorithm. A clustering algorithm takes points in a mathematical space and determines which points belong to the same group based on a similarity function (such as distance). Points (encoded segments) which are considered by the clustering algorithm to belong together, are attributed to the same speaker, thus, concluding the pipeline.
Second Method: End-to-End Transformer Model
Another line of research in the machine learning community aims at performing speaker diarization with a single end-to-end neural network (e.g. https://arxiv.org/abs/1909.06247). Instead of relying on a multi-step pipeline, a transformer model, receiving the original audio (processed as a log-mel spectrogram) as input, is trained to output speaker masks directly. That is, for each 20ms (this time interval can be adjusted) of audio, and for each speaker, the model outputs 1 if the given speaker speaks during that time frame and 0 otherwise.
The obvious, direct advantage, here, is that instead of running multiple networks (which are unaware of the final goal) and procedures, such as the clustering method which is algorithmic in nature and requires a lot of handcrafting, this network can learn the entire process and directly optimize our end goal. And Transformer models are actually quite well suited for this task.
A transformer encoder, such as the one used by OpenAI’s Whisper (https://cdn.openai.com/papers/whisper.pdf) relies on what’s called an attention mechanism, illustrated in the following figure.
The transformer model will have an internal representation (mathematical vector) for each time frame, which we will call a token. During attention, each token creates a query, a key and a value. The query is what the token is looking for, the key is what the token has to offer and the value is what the token actually offers with respect to building the vocal fingerprints, but the exact meaning and theory behind this is not very important here. What’s important is that during this process, an attention matrix, which measures the similarity of each token with every other token is formed. And each token is rewritten ( transformed :) ) as a weighted sum of similar tokens. So basically, similar tokens are grouped together while different tokens are not, in a learned procedure. In the end, the transformer model becomes a learned procedure of grouping time frames with similar vocal prints together, which is exactly what we want.
Both models have their advantages and disadvantages. One difference worth looking into is the amount of information available at the speaker encoding phase. While the first method has the advantage of being able to cumulate all data points before the final clustering, prior to this step, each data point is encoded separately, by a small network, which only sees the current segment of voice activity and not the entire context. In contrast, the second method which receives a long audio as input can use the immediate context to better understand a voice activity segment in the context of other utterances from the same or other speakers, however, when tracking speakers across a longer audio, it won’t be able to refer to all the previous voice activity segment representations as the first method does.
Whichever method you choose to employ, we hope this blog provided a starting point for getting to understand what speaker diarization is and what the process under the hood of common diarization services looks like.