Pertanika Journal

Go to Pertanika

Go to JTAS Home

Go to Pertanika Facebook

Home / Regular Issue / JST Vol. 29 (3) Jul. 2021 / JST-2460-2021

Speech Emotion Recognition Using Deep Learning LSTM for Tamil Language

Bennilo Fernandes and Kasiprasad Mannepalli

Pertanika Journal of Science & Technology, Volume 29, Issue 3, July 2021

DOI: https://doi.org/10.47836/pjst.29.3.33

Keywords: BiLSTM, DNN, Emotional Recognition, LSTM, RNN

Published on: 31 July 2021

Abstract

Deep Neural Networks (DNN) are more than just neural networks with several hidden units that gives better results with classification algorithm in automated voice recognition activities. Then spatial correlation was considered in traditional feedforward neural networks and which do not manage speech signal properly to it extend, so recurrent neural networks (RNNs) were implemented. Long Short-Term Memory (LSTM) systems is a unique case of RNNs for speech processing, thus considering long-term dependencies Deep Hierarchical LSTM and BiLSTM is designed with dropout layers to reduce the gradient and long-term learning error in emotional speech analysis. Thus, four different combinations of deep hierarchical learning architecture Deep Hierarchical LSTM and LSTM (DHLL), Deep Hierarchical LSTM and BiLSTM (DHLB), Deep Hierarchical BiLSTM and LSTM (DHBL) and Deep Hierarchical dual BiLSTM (DHBB) is designed with dropout layers to improve the networks. The performance test of all four model were compared in this paper and better efficiency of classification is attained with minimal dataset of Tamil Language. The experimental results show that DHLB reaches the best precision of about 84% in recognition of emotions for Tamil database, however, the DHBL gives 83% of efficiency. Other design layers also show equal performance but less than the above models DHLL & DHBB shows 81% of efficiency for lesser dataset and minimal execution and training time.

References

Badshah, A. M., Rahim, N., Ullah, N., Ahmad, J., Muhammad, K., Lee, M. Y., Kwon, S., & Baik, S.W. (2019). Deep features-based speech emotion recognition for smart affective services. Multimedia Tools and Applications, 78(5), 5571-5589. https://doi.org/10.1007/s11042-017-5292-7.
Cummins, N., Amiriparian, S., Hagerer, G., Batliner, A., Steidl, S., & Schuller, B. W. (2017). An image-based deep spectrum feature representation for the recognition of emotional speech. In Proceedings of the 25th ACM international conference on Multimedia (pp. 478-484). Association for Computing Machinery. https://doi.org/10.1145/3123266.3123371.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 770-778). Computer Vision Foundation. https://doi.org/10.1109/CVPR.2016.90.
Huang, J., Chen, B., Yao, B., & He, W. (2019). ECG arrhythmia classification using STFT-based spectrogram and convolutional neural network. IEEE Access, 7, 92871-92880. https://doi.org/10.1109/ACCESS.2019.2928017.
Hussain, T., Muhammad, K., Ullah, A., Cao, Z., Baik, S. W., & De Albuquerque, V. H. C. (2019). Cloud-assisted multiview video summarization using CNN and bidirectional LSTM. IEEE Transactions on Industrial Informatics, 16(1), 77-86. https://doi.org/10.1109/TII.2019.2929228.
Jiang, S. (2019). Memento: An emotion-driven lifelogging system with wearables. ACM Transactions on Sensor Networks (TOSN), 15(1), 1-23. https://doi.org/10.1145/3281630.
Karim, F., Majumdar, S., & Darabi, H. (2019). Insights into LSTM fully convolutional networks for time series classi_cation. IEEE Access, 7, 7718-67725. https://doi.org/10.1109/ACCESS.2019.2916828.
Khalil, R. A., Jones, E., Babar, M. I., Jan, T., Zafar, M. H., & Alhussain, T. (2019). Speech emotion recognition using deep learning techniques: A review. IEEE Access, 7, 117327-117345. https://doi.org/10.1109/ACCESS.2019.2936124
Khamparia, A., Gupta, G., Nguyen, N. G., Khanna, A., Pandey, B., & Tiwari, P. (2019). Sound classi_cation using convolutional neural network and tensor deep stacking network. IEEE Access, 7, 7717-7727. https://doi.org/10.1109/ACCESS.2018.2888882.
Khan, S. U., Haq, I. U., Rho, S., Baik, S. W., & Lee, M. Y. (2019). Cover the violence: A novel Deep-Learning-Based approach towards violence detection in movies. Applied Sciences, 9(22), Article 4963. https://doi.org/10.3390/app9224963.
Kishore, P. V. V., & Prasad, M. V. D. (2016). Optical flow hand tracking and active contour hand shape features for continuous sign language recognition with artificial neural networ. International Journal of Software Engineering and its Applications, 10(2), 149-170. https://doi.org/10.1109/IACC.2016.71
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. Advances in Neural Information Processing Systems, 25, 1097-1105. https://doi.org/10.1145/3065386.
Kumar, K. V. V., Kishore, P. V. V., & Kumar, D. A. (2017). Indian classical dance classification with adaboost multiclass classifier on multi feature fusion. Mathematical Problems in Engineering, 20(5), 126-139. https://doi.org/10.1155/2017/6204742.
Liu, B., Qin, H., Gong, Y., Ge, M., Xia, W., & Shi, L. (2018). EERA-ASR: An energy-efficient reconfigurable architecture for automatic speech recognitionbwith hybrid DNN and approximate computing. IEEE Access, 6, 52227-52237. https://doi.org/10.1109/ACCESS.2018.2870273.
Liu, Z. T., Wu, M., Cao, W. H., Mao, J. W., Xu, J. P., & Tan, G. Z. (2018). Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing, 273, 271-280. https://doi.org/10.1016/j.neucom.2017.07.050.
Ma, X., Wu, Z., Jia, J., Xu, M., Meng, H., & Cai, L. (2018, September 2-6). Emotion Recognition from Variable-Length Speech Segments Using Deep Learning on Spectrograms. In Interspeech (pp. 3683-3687). Hyderabad, India. https://doi.org/10.21437/Interspeech.2018-2228.
Ma, X., Yang, H., Chen, Q., Huang, D., & Wang, Y. (2016). Depaudionet: An efficient deep model for audio based depression classification. In Proceedings of the 6th international workshop on audio/visual emotion challenge (pp. 35-42). Association for Computing Machinery. https://doi.org/10.1145/2988257.2988267.
Mannepalli, K., Sastry, P. N., & Suman, M. (2016a). FDBN: Design and development of fractional deep belief networks for speaker emotion recognition. International Journal of Speech Technology, 19(4), 779-790. https://doi.org/10.1007/s10772-016-9368-y
Mannepalli, K., Sastry, P. N., & Suman, M. (2016b). MFCC-GMM based accent recognition system for Telugu speech signals. International Journal of Speech Technology, 19(1), 87-93. https://doi.org/10.1007/s10772-015-9328-y
Mustaqeem, & Kwon, S. (2020). A CNN-assisted enhanced audio signal processing for speech emotion recognition. Sensors, 20(1), Article 183. https://doi.org/10.3390/s20010183.
Navyasri, M., RajeswarRao, R., DaveeduRaju, A., & Ramakrishnamurthy, M. (2017). Robust features for emotion recognition from speech by using Gaussian mixture model classification. In International Conference on Information and Communication Technology for Intelligent Systems (pp. 437-444). Springer. https://doi.org/10.1007/978-3-319-63645-0_50.
Ocquaye, E. N. N., Mao, Q., Song, H., Xu, G., & Xue, Y. (2019). Dual exclusive attentive transfer for unsupervised deep convolutional domain adaptation in speech emotion recognition. IEEE Access, 7, 93847-93857. https://doi.org/10.1109/ACCESS.2019.2924597.
Rao, G. A., & Kishore, P. V. V. (2016). Sign language recognition system simulated for video captured with smart phone front camera. International Journal of Electrical and Computer Engineering, 6(5), 2176-2187. https://doi.org/10.11591/ijece.v6i5.11384
Rao, G. A., Syamala, K., Kishore, P. V. V., & Sastry, A. S. C. S. (2018). Deep convolutional neural networks for sign language recognition. International Journal of Engineering and Technology (UAE), 7(Special Issue 5), 62-70. https://doi.org/10.1109/SPACES.2018.8316344
Sainath, T. N., Vinyals, O., Senior, A., & Sak, H. (2015). Convolutional, long short-term memory, fully connected deep neural networks. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 4580-4584). IEEE Conference Publication. https://doi.org/10.1109/ICASSP.2015.7178838.
Sastry, A. S. C. S., Kishore, P. V. V., Prasad, C. R., & Prasad, M. V. D. (2016). Denoising ultrasound medical images: A block based hard and soft thresholding in wavelet domain. International Journal of Measurement Technologies and Instrumentation Engineering (IJMTIE), 5(1), 1-14. https://doi.org/10.4018/IJMTIE.2015010101
Schuster, M., & Paliwal, K. K. (1997). Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11), 2673-2681. https://doi.org/10.1109/78.650093.
Tzirakis, P., Trigeorgis, G., Nicolaou, M. A., Schuller, B. W., & Zafeiriou, S. (2017). End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of Selected Topics in Signal Processing, 11(8), 1301-1309. https://doi.org/10.1109/JSTSP.2017.2764438.
Tzirakis, P., Zhang, J., & Schuller, B. W. (2018). End-to-end speech emotion recognition using deep neural networks. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP) (pp. 5089-5093). IEEE Conference Publication. https://doi.org/10.1109/ICASSP.2018.8462677.
Wang, H., Zhang, Q., Wu, J., Pan, S., & Chen, Y. (2018). Time series feature learning with labeled and unlabeled data. Pattern Recognition, 89, 55-66. https://doi.org/10.1016/j.patcog.2018.12.026
Xie, Y., Liang, R., Tao, H., Zhu, Y., & Zhao, L. (2018). Convolutional bidirectional long short-term memory for deception detection with acoustic features. IEEE Access, 6, 76527-76534. https://doi.org/10.1109/ACCESS.2018.2882917.
Zeng, M., & Xiao, N. (2019). Effective combination of DenseNet and BiLSTM for keyword spotting. IEEE Access, 7, 10767-10775. https://doi.org/10.1109/ACCESS.2019.2891838.
Zhang, A., Zhu, W., & Li, J. (2019). Spiking echo state convolutional neural network for robust time series classi_cation. IEEE Access, 7, 4927-4935. https://doi.org/10.1109/ACCESS.2018.2887354.
Zhang, S., Zhang, S., Huang, T., & Gao, W. (2018). Speech emotion recognition using deep convolutional neural network and discriminant temporal pyramid matching. IEEE Transactions on Multimedia, 20(6), 1576-1590. https://doi.org/10.1109/TMM.2017.2766843.