Semester

Fall

Date of Graduation

2020

Document Type

Thesis

Degree Type

College

Statler College of Engineering and Mineral Resources

Department

Lane Department of Computer Science and Electrical Engineering

Committee Chair

Thirimachos Bourlai

Committee Member

Natalia Schmid

Committee Member

Yuxin Liu

Abstract

Speech mode classification is an area that has not been as widely explored in the field of sound classification as others such as environmental sounds, music genre, and speaker identification. But what is speech mode? While mode is defined as the way or the manner in which something occurs or is expressed or done, speech mode is defined as the style in which the speech is delivered by a person.

There are some reports on speech mode classification using conventional methods, such as whispering and talking using a normal phonetic sound. However, to the best of our knowledge, deep learning-based methods have not been reported in the open literature for the aforementioned classification scenario. Specifically, in this work we assess the performance of image-based classification algorithms on this challenging speech mode classification problem, including the usage of pre-trained deep neural networks, namely AlexNet, ResNet18 and SqueezeNet. Thus, we compare the classification efficiency of a set of deep learning-based classifiers, while we also assess the impact of different 2D image representations (spectrograms, mel-spectrograms, and their image-based fusion) on classification accuracy. These representations are used as input to the networks after being generated from the original audio signals. Next, we compare the accuracy of the DL-based classifies to a set of machine learning (ML) ones that use as their inputs Mel-Frequency Cepstral Coefficients (MFCCs) features. Then, after determining the most efficient sampling rate for our classification problem (i.e. 32kHz), we study the performance of our proposed method of combining CNN with LSTM (Long Short-Term Memory) networks. For this purpose, we use the features extracted from the deep networks of the previous step. We conclude our study by evaluating the role of sampling rates on classification accuracy by generating two sets of 2D image representations – one with 32kHz and the other with 16kHz sampling. Experimental results show that after cross validation the accuracy of DL-based approaches is 15% higher than ML ones, with SqueezeNet yielding an accuracy of more than 91% at 32kHz, whether we use transfer learning, feature-level fusion or score-level fusion (92.5%). Our proposed method using LSTMs further increased that accuracy by more than 3%, resulting in an average accuracy of 95.7%.

Recommended Citation

Vakkantula, Pratyusha Chowdary, "Speech Mode Classification using the Fusion of CNNs and LSTM Networks" (2020). Graduate Theses, Dissertations, and Problem Reports. 7845.
https://researchrepository.wvu.edu/etd/7845

Download

Included in

Data Science Commons, Electrical and Computer Engineering Commons

COinS

DOI

https://doi.org/10.33915/etd.7845

Graduate Theses, Dissertations, and Problem Reports

Speech Mode Classification using the Fusion of CNNs and LSTM Networks

Semester

Date of Graduation

Document Type

Degree Type

College

Department

Committee Chair

Committee Member

Committee Member

Abstract

Recommended Citation

Included in

DOI

Browse

Resources

Search

Author Corner

Graduate Theses, Dissertations, and Problem Reports

Speech Mode Classification using the Fusion of CNNs and LSTM Networks

Author

Semester

Date of Graduation

Document Type

Degree Type

College

Department

Committee Chair

Committee Member

Committee Member

Abstract

Recommended Citation

Included in

Share

DOI

Browse

Resources

Search

Author Corner