A Multi-Branch BiLSTM with Multi-Head Self-Attention for Suspicious Sound Recognition

Shehu Mohammed Yusuf; Hamza Saidu; Sani Saleh Saminu

doi:10.62411/jcta.15777

Authors

Shehu Mohammed Yusuf Ahmadu Bello University
Hamza Saidu Ahmadu Bello University
Sani Saleh Saminu Ahmadu Bello University

DOI:

https://doi.org/10.62411/jcta.15777

Keywords:

Acoustic event detection, BiLSTM, Deep Learning, Mel-frequency cepstral coefficients, Multi-head self-attention, Smart city surveillance, Sustainable urban monitoring, Urban sound classification

Abstract

Suspicious urban sound recognition is a critical component of intelligent public safety and urban monitoring systems, enabling the automated identification of anomalous acoustic events such as gunshots, sirens, and other security-sensitive sounds. However, existing deep learning approaches often struggle to simultaneously capture long-range temporal dependencies and global contextual relationships, particularly under noisy and acoustically complex urban conditions. This limitation can reduce reliability in safety-critical scenarios where missed detections carry significant risk. To address these challenges, this study proposes a Multi-Branch Bidirectional Long Short-Term Memory (BiLSTM) framework with Multi-Head Self-Attention (MHSA) for enhanced sequential and contextual feature modeling. Mel-frequency cepstral coefficients (MFCCs) are extracted from a curated subset of the UrbanSound8K dataset, comprising five suspicious sound classes, and used as input to the proposed architecture. The multi-branch design enables complementary temporal representations, while the self-attention mechanism provides lightweight contextual weighting of BiLSTM outputs. Experimental results demonstrate that the proposed model achieves a test accuracy of 95.59%, outperforming conventional Dense and LSTM-based baseline models under identical experimental settings. An ablation study further confirms the contribution of multi-branch integration and attention-based enhancement to overall performance. Class-wise evaluation reveals consistently high recall across all sound categories, particularly for safety-critical classes such as gunshots and sirens. These findings indicate that the proposed framework provides robust and reliable performance, making it suitable for real-time smart city surveillance and public safety applications.

Author Biographies

Shehu Mohammed Yusuf, Ahmadu Bello University

Department of Computer Engineering, Ahmadu Bello University, Samaru, Zaria, Kaduna 810211, Nigeria

Hamza Saidu, Ahmadu Bello University

Department of Computer Engineering, Ahmadu Bello University, Samaru, Zaria, Kaduna 810211, Nigeria

Sani Saleh Saminu, Ahmadu Bello University

Department of Computer Engineering, Ahmadu Bello University, Samaru, Zaria, Kaduna 810211, Nigeria

References

B. Kim, J. Kim, H. Chae, D. Yoon, and J. W. Choi, “Deep neural network-based automatic modulation classification technique,” in 2016 International Conference on Information and Communication Technology Convergence (ICTC), Oct. 2016, pp. 579–582. doi: 10.1109/ICTC.2016.7763537.

S. Abdoli, P. Cardinal, and A. Lameiras Koerich, “End-to-end environmental sound classification using a 1D convolutional neural network,” Expert Syst. Appl., vol. 136, pp. 252–263, Dec. 2019, doi: 10.1016/j.eswa.2019.06.040.

J. Salamon, C. Jacoby, and J. P. Bello, “A Dataset and Taxonomy for Urban Sound Research,” in Proceedings of the 22nd ACM international conference on Multimedia, Nov. 2014, pp. 1041–1044. doi: 10.1145/2647868.2655045.

Y. Tokozume and T. Harada, “Learning environmental sounds with end-to-end convolutional neural network,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017, pp. 2721–2725. doi: 10.1109/ICASSP.2017.7952651.

K. J. Piczak, “Environmental sound classification with convolutional neural networks,” in 2015 IEEE 25th International Workshop on Machine Learning for Signal Processing (MLSP), Sep. 2015, pp. 1–6. doi: 10.1109/MLSP.2015.7324337.

A. Vaswani et al., “Attention Is All You Need,” arXiv, vol. 30, Aug. 2023, [Online]. Available: http://arxiv.org/abs/1706.03762

S. Hershey et al., “CNN architectures for large-scale audio classification,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Mar. 2017, pp. 131–135. doi: 10.1109/ICASSP.2017.7952132.

D. Bahdanau, K. Cho, and Y. Bengio, “Neural Machine Translation by Jointly Learning to Align and Translate,” in arXiv, May 2016, pp. 1–15. [Online]. Available: http://arxiv.org/abs/1409.0473

S. Padmaja and N. Sharmila Banu, “A Systematic Literature Review on Sound Event Detection and Classification,” in 2025 5th International Conference on Trends in Material Science and Inventive Materials (ICTMIM), Apr. 2025, pp. 1580–1587. doi: 10.1109/ICTMIM65579.2025.10988199.

A. S. Roman, I. R. Roman, and J. P. Bello, “Robust DoA Estimation from Deep Acoustic Imaging,” in ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2024, pp. 1321–1325. doi: 10.1109/ICASSP48485.2024.10447883.

A. Mukhamadiyev, I. Khujayarov, D. Nabieva, and J. Cho, “An Ensemble of Convolutional Neural Networks for Sound Event Detection,” Mathematics, vol. 13, no. 9, p. 1502, May 2025, doi: 10.3390/math13091502.

A. Bansal and N. K. Garg, “Robust technique for environmental sound classification using convolutional recurrent neural network,” Multimed. Tools Appl., vol. 83, no. 18, pp. 54755–54772, Dec. 2023, doi: 10.1007/s11042-023-17066-2.

N. N. Wijaya, D. R. I. M. Setiadi, and A. R. Muslikh, “Music-Genre Classification using Bidirectional Long Short-Term Memory and Mel-Frequency Cepstral Coefficients,” J. Comput. Theor. Appl., vol. 1, no. 3, pp. 243–256, Jan. 2024, doi: 10.62411/jcta.9655.

S. Hochreiter and J. Schmidhuber, “Long Short-Term Memory,” Neural Comput., vol. 9, no. 8, pp. 1735–1780, Nov. 1997, doi: 10.1162/neco.1997.9.8.1735.

J. Devlin, M.-W. W. Chang, K. Lee, and K. Toutanova, “BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding,” in Proceedings of the 2019 Conference of the North, Oct. 2019, pp. 4171–4186. doi: 10.18653/v1/N19-1423.

K. Zaman, K. Li, M. Sah, C. Direkoglu, S. Okada, and M. Unoki, “Transformers and audio detection tasks: An overview,” Digit. Signal Process., vol. 158, p. 104956, Mar. 2025, doi: 10.1016/j.dsp.2024.104956.

S. Kim and S.-P. Lee, “A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech,” Electronics, vol. 12, no. 19, p. 4034, Sep. 2023, doi: 10.3390/electronics12194034.

A. Mesaros, T. Heittola, and T. Virtanen, “Metrics for Polyphonic Sound Event Detection,” Appl. Sci., vol. 6, no. 6, p. 162, May 2016, doi: 10.3390/app6060162.

S. Domazetovska Markovska, V. Gavriloski, D. Pecioski, M. Anachkova, D. Shishkovski, and A. Angjusheva Ignjatovska, “Urban Sound Classification for IoT Devices in Smart City Infrastructures,” Urban Sci., vol. 9, no. 12, p. 517, Dec. 2025, doi: 10.3390/urbansci9120517.

B. Koçak, R. Cuocolo, D. P. dos Santos, A. Stanzione, and L. Ugga, “Must-have Qualities of Clinical Research on Artificial Intelligence and Machine Learning,” Balkan Med. J., vol. 40, no. 1, pp. 3–12, Jan. 2023, doi: 10.4274/balkanmedj.galenos.2022.2022-11-51.

M. S. Remolina Soto, B. Amaya Guzmán, P. A. Aya-Parra, O. J. Perdomo, M. Becerra-Fernandez, and J. Sarmiento-Rojas, “Intelligent Classification of Urban Noise Sources Using TinyML: Towards Efficient Noise Management in Smart Cities,” Sensors, vol. 25, no. 20, p. 6361, Oct. 2025, doi: 10.3390/s25206361.

Z. Huang, C. Liu, H. Fei, W. Li, J. Yu, and Y. Cao, “Urban sound classification based on 2-order dense convolutional network using dual features,” Appl. Acoust., vol. 164, p. 107243, Jul. 2020, doi: 10.1016/j.apacoust.2020.107243.

K. Zaman, M. Sah, C. Direkoglu, and M. Unoki, “A Survey of Audio Classification Using Deep Learning,” IEEE Access, vol. 11, pp. 106620–106649, 2023, doi: 10.1109/ACCESS.2023.3318015.

M. S. Bawa, S. M. Yusuf, and S. S. Saminu, “MSA-TCN: Robust Urban Suspicious Sound Detection Using Multi-Scale Temporal Convolutions and Dual Attention,” J. Futur. Artif. Intell. Technol., vol. 3, no. 1, pp. 36–52, Apr. 2026, doi: 10.62411/faith.3048-3719-313.

M. Cantarini, L. Gabrielli, A. Mancini, S. Squartini, and R. Longo, “A3CarScene: An audio-visual dataset for driving scene understanding,” Data Br., vol. 48, p. 109146, Jun. 2023, doi: 10.1016/j.dib.2023.109146.

S. Suzić et al., “UNS Exterior Spatial Sound Events Dataset for Urban Monitoring,” in 2024 32nd European Signal Processing Conference (EUSIPCO), Aug. 2024, pp. 176–180. doi: 10.23919/EUSIPCO63174.2024.10715448.

R. Munirathinam and S. Vitek, “Sound Source Localization and Classification for Emergency Vehicle Siren Detection Using Resource Constrained Systems,” in 2024 34th International Conference Radioelektronika (RADIOELEKTRONIKA), Apr. 2024, pp. 1–5. doi: 10.1109/RADIOELEKTRONIKA61599.2024.10524053.

A. Shailendra, C. Bengani, K. S. Kumari, P. Senthilraja, A. Prithivi, and S. Ramesh, “Suspicious Activity Detection based on Audio Detecting Methodology using Deep Learning,” in Recent Trends in Data Science and its Applications, 2023, pp. 683–687. doi: 10.13052/rp-9788770040723.131.

D. Trivedi, R. Sarmukaddam, and V. C. Gandhi, “Deep Learning for Urban Sound Classification: Using CNN and YAMNet Model Integration,” in Lecture Notes of the Institute for Computer Sciences, Social Informatics and Telecommunications Engineering, 2026, pp. 332–349. doi: 10.1007/978-3-031-94283-9_29.

M. Abubakar, Y. Ibrahim, O.-O. Ajayi, and S. S. Saminu, “A Lightweight Maize Leaf Disease Recognition Using PCA-Compressed MobileNetV2 Features and RBF-SVM,” J. Comput. Theor. Appl., vol. 3, no. 3, pp. 334–348, Jan. 2026, doi: 10.62411/jcta.15675.

S. M. Yusuf, E. A. Adedokun, M. B. Mu’azu, I. J. Umoh, and A. A. Ibrahim, “A Novel Multi-Window Spectrogram Augmentation Approach for Speech Emotion Recognition Using Deep Learning,” in 2021 1st International Conference on Multidisciplinary Engineering and Applied Science (ICMEAS), Jul. 2021, pp. 1–6. doi: 10.1109/ICMEAS52683.2021.9692411.

S. M. Yusuf, E. A. Adedokun, M. B. Muazu, I. J. Umoh, and A. A. Ibrahim, “RMWSaug: Robust Multi-window Spectrogram Augmentation Approach for Deep Learning based Speech Emotion Recognition,” in 2021 Innovations in Intelligent Systems and Applications Conference (ASYU), Oct. 2021, pp. 1–6. doi: 10.1109/ASYU52992.2021.9598956.

Year	Acceptance Rate	Days to First Decision
2025	35%	2 days
2024	45%	3 days

A Multi-Branch BiLSTM with Multi-Head Self-Attention for Suspicious Sound Recognition

Authors

DOI:

Keywords:

Abstract

Author Biographies

Shehu Mohammed Yusuf, Ahmadu Bello University

Hamza Saidu, Ahmadu Bello University

Sani Saleh Saminu, Ahmadu Bello University

References

Downloads

Published

How to Cite

Issue

Section

License

Information


This journal is licensed under a Creative Commons Attribution 4.0 International License.