Lip reading aims at recognizing texts from a talking face without audio information. Recently, some works have focused on how to effectively extract the spatial information and temporal information. We introduce an innovative two-stream network to make full use of the complementarity of global spatial information and local spatial information. The global spatial information is directly generated by the global stream. Furthermore, we design a patches selection module in the local stream to conveniently select the critical local information using attention mechanism. Then, the fusion features of the two streams and the global features are fed into the temporal module to further explore the temporal clues. To guide the selection of the local information from the fused features and to make the global stream and local stream learn from each other, we design a global information guide loss and a mutual learning loss, respectively. Finally, extensive experiments on both LRW and CAS-VSR-W1K datasets demonstrate the superiority of our two-stream work. |
ACCESS THE FULL ARTICLE
No SPIE Account? Create one
Laser induced plasma spectroscopy
Feature extraction
Feature fusion
Education and training
Design and modelling
Video
Motion models