Paper
14 April 2023 Video description combining visual and audio features
Yongbo Li, Zhanbin Che
Author Affiliations +
Proceedings Volume 12613, International Conference on Computer Vision, Application, and Algorithm (CVAA 2022); 1261304 (2023) https://doi.org/10.1117/12.2673214
Event: International Conference on Computer Vision, Application, and Algorithm (CVAA 2022), 2022, Chongqing, China
Abstract
Video description has become a research hotspot in recent years because of its wide application value. Single visual feature information can not accurately guide the generation of accurate video description, resulting in the mismatch between the generated description text and video content. To solve this problem, a video description text generation algorithm combining visual and voice features is proposed, which enhances the accuracy of the generated description text by combining visual and voice features. First, the vision transformer model is used to extract the visual feature vector, and the Mel-Frequency Cepstral Coefficients is used to extract the audio feature vector. After the two feature vectors are spliced, the average pooling process is performed to obtain the global feature information; Secondly, the processed feature information is sent to the transformer encoder for encoding. Finally, the encoded results are sent to the transformer decoder to finally generate the video description text. The transformer framework contains a multi head self-attention mechanism, which can focus on more important video feature information while acquiring temporal feature information, making the generated text description more accurate. The method proposed in this paper has been tested on the public data sets MSRVTT and MSVD and has achieved good results in four different evaluation standards.
© (2023) COPYRIGHT Society of Photo-Optical Instrumentation Engineers (SPIE). Downloading of the abstract is permitted for personal use only.
Yongbo Li and Zhanbin Che "Video description combining visual and audio features", Proc. SPIE 12613, International Conference on Computer Vision, Application, and Algorithm (CVAA 2022), 1261304 (14 April 2023); https://doi.org/10.1117/12.2673214
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Video

Visualization

Feature extraction

Transformers

Video coding

Education and training

Visual process modeling

RELATED CONTENT


Back to Top