November 19, 2014 -
Two separate research teams working independently of one another have developed artificial intelligence software that is able to not only accurately recognize content of photographs and video, it is also able to describe entire scenes, according to a report in the National Post.
Researchers at Google and at Stanford University have developed self-learning software that can identify entire scenes and then write a caption accurately describing the picture.
As a result of these capabilities, the software is able to effectively catalog and sort through billions of images and hours of video online, which typically have less than accurate descriptions.
Search engines such as Google depend mostly on written language attached to an image or video to distinguish what it entails.
Stanford Artificial Intelligence Laboratory director Fei-Fei Li led the research with graduate student Andrej Karpathy, eventually publishing their findings as a Stanford University technical report.
Meanwhile, the Google research group published its own research paper on Cornell’s open source site arXiv.org.
The new research could lead to new opportunities for surveillance, enabling authorities to accurately identify humans and even predict certain types of behavior.
Currently, Google’s image-recognition software is able to train itself to single out cats among 10 million images randomly taken from YouTube, while artificial intelligence programs in new cars can pinpoint pedestrians and bicyclists from cameras located on the windshield and automatically stop the car to prevent a collision.
However, both these programs only focus on the objects themselves and fail to have an understanding of what is actually happening in an image.
Both the Google and Stanford research teams hope to resolve this issue by refining neural networks — software programs that can train themselves to identify similarities and patterns in data.
Both research teams have integrated two different types of neural networks – one that is solely designed to recognize images, and another that focuses on human language.
In both cases, the groups were able to train the software using small sets of digital images that are accompanied by captions.
Once the software effectively deciphered the patterns in the set of images and captions, the researchers would then introduce new images in which it would be able to double the accuracy of its previous efforts.
Both Google and Stanford research groups believe that they will be able to significantly increase the accuracy as they further update their software and train them using larger sets of captioned images.