IDOL Speech Server provides a set of speaker identification tasks that cover both the training of a set of speaker templates, and the identification of speakers by using this set. These sections cover the speaker identification training process in more detail, and give some basic guidance on how best to optimize the system and run speaker identification tasks. For more detailed information on optimization and speaker identification, and on the default speaker identification tasks, refer to the Speech Server Administration Guide.
The process of training and using a speaker identification system involves four main steps:
SpkIdTrain
, SpkIdTrainWav
, or SpkIdTrainStream
, depending on the form of the input data.SpkIdDevel
, SpkIdDevelWav
, or SpkIdDevelStream
, depending on the form of the input data.Calculate thresholds. Speech Server uses the score information in the ATD files to estimate a score threshold for each speaker template, and stores the value in the template file. This value is the score below which a result is not considered to be a genuine hit, leading to “Unknown” results in an open-set identification system. This step is performed by using the SpkIdDevelFinal
task.
SpkIdEvalWav
or SpkIdEvalStream
tasks to identify known speakers that might be present in your test audio data based on the set of speaker templates.Note: For a closed-set speaker identification system, it is not necessary to run Step 2 and Step 3, and score thresholds are not required.
Open-set speaker identification identifies sections of speech that do not match any of the known speakers as being from an unknown speaker. To achieve this, a score threshold is estimated for each speaker. A hit is considered genuine only if the score is over the threshold estimated for the specific speaker template. After you train the speaker templates, the thresholds are calculated based on typical true and false hit scores for each template generated from a further set of audio samples.
The following example shows an unknown speaker result in the results:
1 A 10.630 9.460 Unknown_ FEMALE 0.000 1 A 20.090 6.150 Brown MALE 6.983
In this segment, from 10.630
to 20.090
none of the trained speakers scored above the relevant threshold, and thus a result of Unknown_
was returned.
Closed-set speaker identification assumes that all the speakers in the audio data being processed are from the closed speaker set. There are no score thresholds applied in this case. When you run a closed-set identification task, set the ClosedSet
parameter in the AudioTemplateScore
module to True
(you can do this by using tthe ClosedSet
action parameter for the SpkIdEvalWav
and SpkIdEvalStream
speaker identification tasks).
The following example shows the same results as previously, but this time for closed-set identification rather than open-set:
1 A 10.630 9.460 Smith FEMALE 0.15200 1 A 20.090 6.150 Brown MALE 0.863
In this segment, from 10.630
to 20.090
the speaker Smith
was the top-scoring template and was given as the result (even though the score is relatively low). Note also that the score for Brown
is different from that shown in the open-set example. When you set ClosedSet
to True
, a different score normalization strategy is used that is more suited to closed-set operation, and thus the scores might differ.
Note: An Unknown_
result might still be given in a closed-set system, but only if the audio segment is determined to be non-speech, or is too short for accurate classification.
|