Dynamic Memory Networks for Visual and Textual Question Answering
This paper can be thought of as rephrasing the question to go towards the answer, so the memory always represents the question in some way. This idea was said during a paper discussion and I think its a different way to think about QA tasks.
I am not sure how are priors used and how are they derived.
How many passes for training?
The system can also use sentences that it had previously thrown away. What is the intuition behind this?
- The system allows sentences to interact with other sentences without changing gradient, but maybe a better representation could have been reached by actually changing the gradients. Or maybe this is just another way of gathering all the facts and having the previous ranking of sentences. This maintains the importance of sentences at all times.
A trick sometimes used is to feed questions in every network, this helps because it helps the question to remain in memory and so the answer will always be related to the question in some way. This model will not back prop like a residual network.
Preference of L1 over L2 norm: Does l1 make attention sparser, is it minus or cosine for element wise?
Are the multiple attentions according to history ? and if so then how many passes are present.
Another trick is to back prob to remember all the facts you need to answer the question. So, if you back prop more you get better results.
A difference in the memory networks in this paper is that they have the same memory but in other papers, there are different memory components.
There is a bidirectional GRU in DMN2 so it can find answer even if the facts are swapped.
- Gru uses sentences from future and past as well.
- Gives few % higher accuracy
Equation 8 has a minus, I am not sure why. Is it calculating the difference / similarity?
The whole system is differentiable end to end.
Ordering of facts is not lost in the complete system.
The system interacts in its own language so there is an assumption that there is a common embedding feature space rather than the embedding features in visual space or text space.
Is there a chain structure ?
Focus on words that are common with ques
What if there is a series of images and then answer question ?
What is the tied and untied model and the improvement?
- Is the difference just of a bigger and smaller model
- Smaller is easier to learn
- Less convolutions to learn and see
blog comments powered by Disqus