DenseCap: Fully Convolutional Localization Networks for Dense Captioning
This paper brings forth the new idea of dense caption. The dense caption is that every almost every object has a bounding box and its contents are described by a caption. I like the fact that their open world detection task is like image retrieval. Although, intuitively I think that if anyone searches for “white tennis shoes”, they would want the white tennis shoes to be one of major objects of interest in the image rather than the example images below. However, it is necessary to remember that the image dataset here contain non iconic images (MS COCO has only non iconic images). The question is, will the same model work well on iconic images?
I am not sure what is the intuition behind bilinear interpolation but the paper mentions that the sampling grid may not be in the region that is interesting and bilinear interpolation fixes that.
Dense captioning is a useful task that gives detailed information about all the objects and helps to infer high level information about the image.
- A lot of facts about the image are generated, which can be used for a lot of other tasks.
There could be a way to go from dense captions to images using a tree structure or graphical model an of sorts.
The anchors for the localization might be generated by ranking regions based on confidence.
Is the training NOT end to end?
- It seems that there are different learning mechanisms in different parts of the network
- This is the first time I have seen such a thing, maybe this is the norm. I don’t have a lot of knowledge about this but most papers always say end to end training and this paper addressed it differently.
What advantage is provided by the recognition network when it changes the bounding box?
- Is this the actual use?
- It could also be a feature transformation mechanism
- Transform tensor into compact features
CNNs are scale invariant. VGG works best with 224 * 224. And this paper says the longer side of the image is kept at 720 pixels.
- What is the intuition behind this?
There are ground truth box and edge bounding box. The model is trained on the ground truth boxes and the edge bounding boxes are the ranking results between ground truth boxes.
It should be possible to generate a full sentence after all the small dense captions. This might be a very informative caption and will information like what is interesting about the image.
The L2 constraint on the bounding box. What is the reasoning behind using this?
What is the penalty / loss for box captions and box captions if any?
blog comments powered by Disqus