Visual question answering task requires utilizing the content of the question to locate the corresponding regions of the image. But the traditional attention-based VQA methods can not accurately match the regions of the image that is relevant to the question, which results in less satisfactory performance. In this paper, an Efficient Multi-step Reasoning Attention Network (EMRA), which is mainly composed of the multi-step reasoning attention module and the G-LReLU non-linear layers, is proposed to address this problem. Specifically, the multi-step reasoning attention module combines the initial visual features, question features and the jointed features to generate more effective attended features, which can precisely represent the regions information of an image related to question. Then, the attended visual features generated by multi-step reasoning attention model and the question features are fed into the G-LReLU non-linear layers executing non-linear transformation to better fusion for answer prediction. In addition, considering the relationship between the scaling and the reasoning steps, as the number of inference steps increases, increasing the model width will improve the accuracy of our model. Experimental results on the VQA v2.0 dataset demonstrate that our model significantly outperforms the Bottom up and Top-down Attention based methods, and can be competitive with state-of-the-art models.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.