Till now, all the work that I have done on deep learning or machine learning takes only one kind of input, it’s either text, images, audio or video. But, in real life humans can take in more than one of these as inputs and process their thoughts. Keeping this as my dream and longer term goal, I have started working on multimodal deep learning, which combines the learning’s from various different inputs. This is all towards a greater dream of achieving an Amazing AI. There are several phrases in the development of an Amazing AI system and I have identified a few and have started working on them.
First is, training this system to learn the art of conversation. This entails a dialog system which uses context to understand what the user or human is talking about. This system is self sufficient in the sense that it can start a conversation with a user about an object or can just ask questions about the weather. This also includes audio, recognizing the words from the audio and how they relate to the person and the surroundings.
Second is, visual understanding of the scene that the system sees. This deals with object detection, understanding the spatial and temporal relationship between objects and how they interact with each other. There are several toy subtasks that are present in this task including, being able to generate a description about the scene.
With further development this system can aid the disabled, an example would be the SeeingAI that was exhibited by Microsoft during their build 2016 conference. A system similar to SeeingAI would be only one part of the AmazingAI dream.
blog comments powered by Disqus