Video description combining visual and audio features

Yongbo Li; Zhanbin Che

doi:10.1117/12.2673214

14 April 2023 Video description combining visual and audio features

Yongbo Li, Zhanbin Che

Proceedings Volume 12613, International Conference on Computer Vision, Application, and Algorithm (CVAA 2022); 1261304 (2023) https://doi.org/10.1117/12.2673214
Event: International Conference on Computer Vision, Application, and Algorithm (CVAA 2022), 2022, Chongqing, China

Abstract

Video description has become a research hotspot in recent years because of its wide application value. Single visual feature information can not accurately guide the generation of accurate video description, resulting in the mismatch between the generated description text and video content. To solve this problem, a video description text generation algorithm combining visual and voice features is proposed, which enhances the accuracy of the generated description text by combining visual and voice features. First, the vision transformer model is used to extract the visual feature vector, and the Mel-Frequency Cepstral Coefficients is used to extract the audio feature vector. After the two feature vectors are spliced, the average pooling process is performed to obtain the global feature information; Secondly, the processed feature information is sent to the transformer encoder for encoding. Finally, the encoded results are sent to the transformer decoder to finally generate the video description text. The transformer framework contains a multi head self-attention mechanism, which can focus on more important video feature information while acquiring temporal feature information, making the generated text description more accurate. The method proposed in this paper has been tested on the public data sets MSRVTT and MSVD and has achieved good results in four different evaluation standards.

Citation Download Citation

Yongbo Li and Zhanbin Che "Video description combining visual and audio features", Proc. SPIE 12613, International Conference on Computer Vision, Application, and Algorithm (CVAA 2022), 1261304 (14 April 2023); https://doi.org/10.1117/12.2673214

ACCESS THE FULL ARTICLE

INSTITUTIONAL
Select your institution to access the SPIE Digital Library.

SELECT YOUR INSTITUTION

PERSONAL
Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.

PERSONAL SIGN IN

No SPIE Account? Create one

PURCHASE THIS CONTENT

SUBSCRIBE TO DIGITAL LIBRARY

50 downloads per 1-year subscription

Members: $195

Non-members: $335 ADD TO CART

25 downloads per 1 - year subscription

Members: $145

Non-members: $250 ADD TO CART

PURCHASE SINGLE ARTICLE

Includes PDF, HTML & Video, when available

Members: $17.00

Non-members: $21.00 ADD TO CART

PROCEEDINGS
11 PAGES

DOWNLOAD PAPER SAVE TO MY LIBRARY

GET CITATION

RIGHTS & PERMISSIONS

Get copyright permission Get copyright permission on Copyright Marketplace

KEYWORDS

Video

Visualization

Feature extraction

Transformers

Video coding

Education and training

Visual process modeling

Show All Keywords

Keywords/Phrases

Search In:

Publication Years