Generation and Comprehension of unambiguous object descriptions
Spatial positions need bounding box
Evaluation of the mid level vectors because it could be an encoding - model will communicate better with itself in its own encoding.
- CO-Training vs multi view system
- Multiple models on diff views and then improve iteratively
Training is semi supervised in this paper - this is like - bootstrapping?
Examples show interesting visual features that can encode ‘behind’. Assumption is that the training dataset must have such words.
Hard ground truth is the most confusing caption.
Mutual information is used in the paper.
N - hyper param from UNC dataset
- Communication with automated systems
Guide attention models
- generate soft and hard attention
- Harms captioning and VQA
Big LSTM should be able to understand different regions
- Residual networks
- Force the network to remember the image
- Image vector is fed again and again
Are they training word embeddings on their own ? Their dims are different from the standard word embeddings.
- Like an n gram
- Next word given sentence so far
Fine tuning on the MS COCO should give the last layer with 80 dims but the paper says that it is 1000 dims, why?
- 1000 dim is just to make sense with fine tuning?
- End to end fine tuning
Interesting to note that the weak labels did not mess up the actual result labels
A possible idea that comes to mind is, can unambiguous object descriptions be used for an image retrieval task?
- What is the one thing about this image that defines it uniquely
- Something that is different from all other images?
- However, what may be unique in an image may not be unique over the entire dataset
- This might be a problem while approaching the image retrieval task.
blog comments powered by Disqus