Learning from Image Captioning
Visual Question Answering has a very important effect. It may be used to help the blind people see. They can ask questions after being given the image description. All of this will help to understand the image and its contents in depth.
Most papers have described their training was done using Amazon Mechanical Turk. Launched in November 2005, Amazon Mechanical Turk is named after a legendary automaton from the 18th century, the Turk, which could play chess. The wooden man, adorned with a turban, appeared to be powered by the machinery of a clock. The Turk was a sensation: a machine that seemed to think. Coinciding with the birth of the Industrial Revolution, the Turk heightened anxieties that machines would replace humans in the workplace. Of course, it turned out to be a fabulous hoax. The ghost in the machine was, as skeptics had suspected, all too human. A chess expert was hidden in the Turk, making all the right moves. Amazon Mechanical Turk follows this and employs humans to answer some basic questions, eg ‘How many objects are in the image?’. Different datasets and their questions employ different rules, like majority voting among how many users and other rules which decide if the answer is right or wrong.
Visual Question Answering is considered one of the most AI relevant tasks. An AI-complete tasks has the following features:
- Multi-modal knowledge from several sub-domains
- Well defined evaluation metric (currently it is the number of questions that are correctly answered by a VQA system.)
Free form and open-ended VQA
Such questions require a vast set of knowledge (derived from AI capabilities), including
- Fine-grained recognition
- Object detection
- Activity recognition
- Knowledge based reasoning
- Commonsense reasoning
Open-ended VQA is different from multiple choice because the latter needs to pick from a predefined list of possible answers which I think makes the choices much easier.
This paper describes the Microsoft COCO dataset collection and the type of questions.
Computer Vision and Natural Language processing are connected via problems that generate a caption for a given image. This caption is like the description of the image and must be able to capture the objects in the image and their relation to one another. Then, this caption must be expressed in a semantically correct form in a natural language. This paper presents a neural net trained using SGD. This model combines the sub-networks for vision and language models. If the model is pretrained on a larger corpus then the resulting captions can make use of additional data. This model gives a BLEU score of 59 on the Pascal dataset. An LSTM based sentence generator is used.
Previous approaches map both the natural language caption and the image to a common embedding. In this paper a bi-directional mapping between images and their sentence descriptions is learnt using RNN’s, such that the captions are generated after being given an image. Similarly, the visual features can be reconstructed using the image and its description.
A joint feature space for images and their descriptions exists after projecting the image features and sentence features into a common space. This bidirectional representation is capable of generating both novel descriptions from images and visual representation from descriptions. This novel representation dynamically captures the visual aspects of the scene. Hence, as a word is generated or read the corresponding visual representation is update to reflect the new information that the word describes. Long term memory is established using the dynamically updated visual representations. The network is able to pick salient features from this representation.
They use an ensemble of CNN’s over images, BRNN’s over sentences and a structured objective that aligns these two modalities through a multimodal embedding. This Multimodal RNN is used to generate image captions. This model has been tested on the Flickr8K, Flickr80K and MS COCO datasets. This paper treats the language as a rich label space in contrast to the other works which usually focus on labeling images with a fixed set of visual categories. These closed vocabularies constitute the convenient modeling assumptions but are restrictive in comparison to the varying degree of natural language descriptions that a human can compose.
A dense description of images is generated which is rich enough to reason about the content of the image and their representation in the domain of natural language. Also, the descriptions of images have objects in them but whose location within the are unknown. These locations are inferred from contiguous segments of the sentences and these sentences are treated as weak labels. If these alignments are learnt then they can be used in a generative model for the descriptions.
The model uses visual detectors, language models and multimodal similarity models learnt from a dataset of image captions. Multiple instances are used to train visual detectors for words commonly occurring in captions and different POS. Word detector outputs are the conditional inputs to a maximum entropy model language model which learns from the image descriptions to capture the statistics of word usage. These global semantics are captured by re-ranking captions using sentence level features and a deep multimodal similarity model. The caption generator is trained on a dataset of images their corresponding image descriptions. Hence, captions are being directly used as these provide the following advantages:
- Captions contain information that is inherently salient.
- Training a language model on image captions captures the commonsense knowledge about the scene.
- The learnt joint multimodal representation on images and the captions are used to measure the global similarity between images and text.
blog comments powered by Disqus