Memory Networks

Being able to learn more information for longer duration of time is possible by having a large enough memory which is compartmentalized suitably to accurately remember facts from the past. RNN’s alone have difficulty in performing memorization due to exploding gradients and vanishing gradients. The memory network has memory indexed as array indexes with components including the input feature map (I), generalization (G), output feature map (O) and response (R). On being given an input, it is converted to a feature map which is updated in the memory by the generalization rule. The output features are computed and decoded to give the final response. At test time, the memory is updated (as is done at training time) but the parameters are not updated.

H, the slot choosing function is developed according to the size of the memory. EG: if a large amount of memory is needed then it is organized differently, usually by topics or entities. When the memory becomes full, the concept of forgetting is also implemented in H through overwriting. The RNN, R must be conditioned on the output of O, because without conditioning, RNN’s will perform poorly.

The MemNN take input text, which is stored in the memory slots, such that each new incoming word in the sentence is stored in the next empty slot without overwriting any previous slot. Conversely, if the input is at the word rather than at the sentence level, then the words are not segmented as statements. So, a segmentation function has to be learnt which remembers the last sequence of words and looks for breakpoints. Hashing techniques are used when stored memory is very large. Hashing can be done in either of the two ways, hashing words or clustering the word embeddings. However, there are some perceivable disadvantages of the two, namely, with hashing words, the memory can be considered if there is at least one shared word with the input, then for the next slot, at least one shared with the first slot and so on. The last slot should have at least one shared word with the answer. Using clustering of word embeddings, this can be solved by clustering into an appropriate number of buckets. Memory hashing is a fast technique.

To encode temporal abilities when a knowledge base is not present, a function on triples is learnt so that the argument which makes this triple maximum is kept before the time features are removed. To model new words, given the neighboring words, the RNN should predict what the word should be and assume that the unknown word is similar to that. So, there is a left and right context that are stored for a BOW model. These add two more features. However, the exact spellings of words may not be differentiated using these models. This is because to store a stem, like ‘fall’ for falling, fell, fallen, takes less dimensions, rather than to store each of the words, falling, fell, fallen.

Simulated question answering is able to differentiate between natural and artificial environments as the artificial ones correspond to the simulation of characters, objects and rooms. These simulated environments help to create stories which can be understood through vision. The end result is to make inferences like the location of an object within a frame, and the relationships between all the parts of the image, through multiple statements. For these statements, there is a limit on the number of time steps, as a RNN has limited memory. The memory is represented through multiple hops.

blog comments powered by Disqus


23 January 2016