Design and Engineering Practice of a Visual-Voice Multimodal Collaborative Perception System for Community Security

Yang  Zhong

doi:10.63593/IST.2788-7030.2025.09.008

Authors

Yang Zhong Kingdee Software (China) Co., Ltd., Shenzhen, Guangdong 518025, China

DOI:

https://doi.org/10.63593/IST.2788-7030.2025.09.008

Keywords:

community security, multimodal collaborative perception, feature-level fusion, YOLOv12s Improvement, CRNN, attention mechanism, real-time detection, edge deployment, abnormal sound recognition, dynamic decision-making

Abstract

Aiming at the inherent limitations of single-modal perception in community security scenarios—visual detection is susceptible to low-light conditions and occlusions, while voice recognition often suffers from misjudgments due to environmental noise—this study designs and implements a deep learning-based visual-voice multimodal collaborative perception system. Centered on the core of “heterogeneous modal complementary enhancement”, the system adopts a modular technical architecture through feature-level fusion and dynamic decision-making collaborative strategies: (1) The visual module employs an improved YOLOv12s algorithm, integrating adaptive Retinex contrast enhancement and dynamic Gaussian Mixture Model (GMM) background modeling to enhance the robustness of object detection under complex lighting; (2) The voice module is built on a CRNN (CNN+BiLSTM) architecture, combining multi-channel beamforming and SpecAugment data augmentation to strengthen abnormal sound recognition in noisy environments; (3) The multimodal collaboration module innovatively introduces an attention-based feature alignment mechanism and scene-adaptive threshold decision-making to achieve efficient fusion of cross-modal information.

Validated on the self-constructed CommunityGuard V1.0 community security dataset (covering 50 hours of multi-scenario synchronized audio-visual data, including day/night, sunny/rainy, and noisy/quiet sub-scenarios), the multimodal collaborative detection achieves F1-Scores that are 5.8% and 13.6% higher than those of visual single-modal and voice single-modal detection, respectively. Particularly in night-noisy scenarios (illumination < 20lux, noise ≥ 65dB), the F1-Score reaches 85.6%, representing a maximum improvement of 17.4% over single-modal detection. The end-to-end inference latency is stably maintained at 5ms( +- 1 )ms (on Tesla T4 GPU TensorRT10) (Redmon, J., & Farhadi, A., 2018), meeting real-time requirements for community security. Meanwhile, the system is lightweight and deployable on edge devices.

Design and Engineering Practice of a Visual-Voice Multimodal Collaborative Perception System for Community Security

Authors

DOI:

Keywords:

Abstract

Downloads

Published

How to Cite

Issue

Section

cover

Make a Submission