You can use speech clustering to segment a speech waveform and separate it out into a number of speakers. Speech Server produces a timed sequence of labels that correspond to speaker assignments.
Speech Server provides the following speech clustering tasks:
ClusterSpeech
.This task carries out the basic clustering of wide-band speech into speaker segments. For example, if two speaker clusters are identified, the output labels are Cluster_0
and Cluster_1
respectively.ClusterSpeechTel
. This task is essentially the same as the ClusterSpeech
task, but is optimized for telephony audio. In particular, the audio classification is configured to suit speech, noise and music in telephone calls, and the final output can feature dial tones and DTMF-recognized characters.ClusterSpeechToTextTel
. This task performs clustering of two speakers in a phone call, and uses the resulting speaker clusters to improve speech-to-text performance slightly by using speaker-sided acoustic normalization. As before, any telephony artifacts such as dial tones or DTMF tones are included, interspersed with the recognized words.For more information on the speaker clustering tasks, see the IDOL Speech Server Reference.
The key process in clustering speech is called SplitSpeech
. This process requires:
The SplitSpeech
process uses an agglomerative algorithm to find the best two segment clusters to merge. This process is then repeated until all potential merges fail a Bayesian Information Criterion threshold check. The final process should result in a smaller number of acoustically homogeneous speaker clusters.
You can specify the minimum and maximum numbers of speakers to produce. For example, if you know that a telephone call consists of two speakers, you can use the MinNumSpeakers
and MaxNumSpeakers
parameters to set both the minimum and maximum number of speakers to 2
to guarantee that the process produces exactly this number of speakers.
Alternatively, you can algorithmically determine the number of speakers based on a sensitivity threshold. In this case, you can change the MergeThresh
parameter from the default value of 0.0
to ensure that more or fewer speakers are produced.
For more information on these parameters, see the IDOL Speech Server Reference.
The recursive process of splitting speech can be very resource-intensive. If you are processing audio files longer than 10 minutes, and those files have consistent speakers, HPE recommends that you crystallize speaker information after approximately 10 minutes. For example, if your audio file is 30 minutes in length, and you crystallize the speaker information after five minutes, the process clusters the speakers in the first 5 minutes of the file. Thereafter, any subsequent speech segments are clustered into one of the speaker segments from the first five minutes of the file, rather than being assigned to new speakers. This crystallization means that classification of subsequent speech segments is faster.
You can also configure processing depending on whether accuracy or speed is more important. For instance, full matrices are more accurate, but slower; if processing speed is more important to you than accuracy, you could use diagonal covariance matrices (which are faster but less accurate) instead.
For more information on using the DiagCov
and FixTime
parameters, see the IDOL Speech Server Reference.
|