Generating description for an image can be regard as visual understanding. It is across artificial intelligence, machine learning, natural language processing and many other areas. In this paper, we present a model that generates description for images based on RNN (recurrent neural network) with object attention and multi-feature of images. The deep recurrent neural networks have excellent performance in machine translation, so we use it to generate natural sentence description for images. The proposed method uses single CNN (convolution neural network) that is trained on ImageNet to extract image features. But we think it can not adequately contain the content in images, it may only focus on the object area of image. So we add scene information to image feature using CNN which is trained on Places205. Experiments show that model with multi-feature extracted by two CNNs perform better than which with a single feature. In addition, we make saliency weights on images to emphasize the salient objects in images. We evaluate our model on MSCOCO based on public metrics, and the results show that our model performs better than several state-of-the-art methods.
Access to the requested content is limited to institutions that have purchased or subscribe to SPIE eBooks.
You are receiving this notice because your organization may not have SPIE eBooks access.*
*Shibboleth/Open Athens users─please
sign in
to access your institution's subscriptions.
To obtain this item, you may purchase the complete book in print or electronic format on
SPIE.org.
INSTITUTIONAL Select your institution to access the SPIE Digital Library.
PERSONAL Sign in with your SPIE account to access your personal subscriptions or to use specific features such as save to my library, sign up for alerts, save searches, etc.