Date of Graduation
Statler College of Engineering and Mineral Resources
Lane Department of Computer Science and Electrical Engineering
Although successfully implemented in certain situations, the reliability of speaker recognition (SR) decreases due to speaker and channel variability present between enrollment and evaluation samples, as well as the available length of speech utterances. The issue of speaker variability becomes more pronounced when the number of speakers in the evaluation set increases, as there is a higher probability that two voices may sound more alike. The differentiation of intra-twin pairs' voices can be beneficial to general SR because inclusion can simulate this effect. This is due to the fact that their shared voice similarities are comparable to the potential inter-speaker similarities one would expect in a large database of enrolled speakers. Furthermore, twin occurrence has steadily increased over the past thirty years. With these considerations in mind, there have been few research efforts analyzing the impact of identical twins on SR, and they have been lacking in terms of the corpus of individuals sampled and/or the technology employed.;In this research effort, a recurrent neural network that specializes in processing time series data, specifically the long short-term memory (LSTM) network, is evaluated on a large corpus of identical twins' speech collected over two years with multiple speaking modes. The LSTM's recurrent capability enables the exploitation of higher level speech features which are hypothesized to be more variant between identical twins. The LSTM is configured as a single network, and in a Siamese fashion, to evaluate the performance of varied utterance lengths and speech features. Matching results are analyzed and discussed in comparison to state of the art i-vector methodologies. Results in terms of the equal error rate (EER) indicate that, as the length of the enrollment and test utterances are reduced, the LSTM outperforms the i-vector system by 17% for two seconds of speech data. Of the three speech features investigated in the Siamese configuration, mel-frequency spectral coefficients resulted in the highest rate of twin voice differentiation with an EER of 8.57% for six seconds of data. Lastly, in comparison to other twins' voice studies, the introduction of more individuals degrades performance with respect to male speakers from nearly 0% to 0.598% EER. However, the results from these experiments are far less than female trials in other studies.
Sabatier, Stallone Bruno-Ray, "A Long Short-term Memory Neural Network for Improved Twins' Voice Differentiation" (2018). Graduate Theses, Dissertations, and Problem Reports. 6551.