Human cognition is complex and represents the key to how we interact with the world. Linguistic mechanisms play a very important role in these cognitive processes - an aspect which AI systems such as IBM Watson already successfully take advantage of. However, our environment is not made up of language alone. The MIT-IBM Watson AI Lab is therefore teaching its system how to see the world.
It is a unique success story: The artificial intelligence of IBM, which was named after one of the first presidents of the company, is an established and integral part of current digitization projects. From interactive customer service to its use as a literal Dr. Watson in health management, AI is now in use wherever there is a need for intelligent learning systems that can handle large volumes of data in dialogue with the human user in as user-friendly a way as possible. The program succeeded in beating human opponents in the TV quiz show Jeopardy years ago. Until now, Watson has mainly been learning through a text-based approach; the system operates with natural language and that is how it explores the elements of the world in which the respective customer is involved. That makes it very easy to interact with Watson-based chatbots: they speak our language. However, this interface with the digital conversation partner has one major shortcoming: the computer program is largely blind to the experience of dynamic visual events.
Seeing and understanding
That is a fundamental difference between Watson and its human users. “When we grow up, we look around us and see people and objects moving around, and we hear the sounds these people and objects make. An AI system should learn in the same way. To do that, it has to be fed videos and dynamic information,” explains Aude Oliva in an interview with Software Development Times. Oliva is one of the co-founders of a new MIT research project that focuses on exactly this aspect of enhanced learning for artificial intelligence. Watson already recognizes images, but new algorithms are needed to capture events in movement and then draw conclusions from those observations – i.e. to learn. The MIT-IBM Watson AI Lab is now delivering the basis for that with its “Moments in Time Dataset,” a structured collection of three-second videos showing people, animals and objects in action. “This dataset makes it possible to develop new AI models that come close, in terms of complexity and abstract considerations, to those performed by humans every day,” explains Oliva.
The set that forms the cornerstone for an illustrative dynamic depiction of the world currently includes 1 million of these short videos and is one of the biggest collections ever created for such a purpose. The selection process presented the researchers with a whole host of challenges. In addition to establishing distinctive action categories, they also had to find suitable sources and configure the collected films in such a way that an AI program could learn from it as impartially as possible. The present results are impressive and yet this is only the first step on the way towards artificial intelligence that can learn visually. Building on this success, the challenge in future will be to invent algorithms that create analogies, that anticipate unforeseen actions, and that can interpret certain situations – i.e. that make it possible to identify and utilize the dynamics in a very broad range of videos.
This is a doubly visionary approach that could also revolutionize the day-to-day use of Watson and other similar systems in customer dialogue: Because as “intuitive” as Watson is responding to natural language, AI is still limited when it comes to factoring in all aspects of human interaction. However, it could soon be possible to use these new algorithms to make an AI system that can interact with what it sees before it. Alongside the channels commonly used at present, Skype cams could also be used for exchanges with customers. Then, in addition to the entered information, AI would be able to capture and interpret gestures, body posture and other visual cues in real time. And then we really would be seeing eye to eye with AI.