Mind's Eye: Visual Intelligence Takes to the Battlefield
Applying visual intelligence to video could reduce the demand on analysts to detect and report on suspicious activity. Berenice Baker reports on a new DARPA project to develop transformative technology that could enable systems to recognise and react intelligently to events captured by UAVs and static platforms.
The use of surveillance drones over areas of conflict has resulted in millions of hours of video footage that has to be studied by human analysts to work out what has been captured and to decide on appropriate action. Man-carried drones, ground systems and static platforms need to be launched or set up by trained military personnel, who are often deployed away from the main body of troops, exposing them to risk.
Using 'persistent stare', in which unmanned aerial (UAVs) or ground vehicles (UGVs) maintain visual contact with a target area for upwards of 24 hours, could help take scouts out of harm's way. However, analysts must still scan the streaming video feed to detect operationally significant activities, increasing the burden on command and control resources.
A research programme by the US Defense Advanced Research Projects Agency (DARPA) aims to develop a new kind of artificial intelligence (AI) known as 'visual intelligence' (VI). 'Mind's Eye' aims to enable machines to robustly recognise and reason about activity in full-motion video footage.
Programme manager for Mind's Eye James Donlon is a retired US Army officer and former infantryman now working in DARPA's computer science study group, software producibility and application communities.
"Until now, progress in machine vision has been mostly limited to identifying and tracking the objects in imagery and video," says Donlon. "These are the 'nouns' of the scene or narrative. This programme focuses on the 'verbs' in video. This involves far more than just recognising them."
Donlon has selected 48 verbs as the initial scope of the programme, inspired by a study of the most commonly-used verbs in the English language related to action in the physical world.
He explains, "Just as you and I do, these machines must be able to visualise and manipulate scenes using these concepts in an imaginative process — the 'mind's eye'. If a machine can do that, then just like us, it will be able to do much more than sound an alarm when it sees a familiar pattern. It will be able to anticipate what might happen next, imagine alternative futures, fill in gaps in its knowledge or perceptual experience, and notice when something is out of place."
With such a complex remit for the programme, Donlon has identified four performance tasks to encourage development of robust VI systems:
- recognition: VI systems will be expected to judge whether one or more verbs is present or absent in a given video
- description: VI systems will be expected to produce one or more sentences describing a short video suitable for human-machine communication
- gap-filling: VI systems will be expected to resolve spatiotemporal gaps in video, predict what will happen next, or suggest what might have come before
- anomaly detection: VI systems will be expected to learn what is normal in longer-duration video, and detect anomalous events.
DARPA has assembled teams from 12 academic and research institutions that will develop a software subsystem suitable for use on a camera, integrating existing state of the art computer vision and AI. They will build on this to make novel contributions in visual event learning, new spatiotemporal representations, machine-generated envisionment, visual inspection and grounding of visual concepts.
They will be joined by teams from General Dynamics Robotic Systems, iRobot and Toyon Research Corporation, which are taking a collaborative approach to developing architectures incorporating newly developed visual intelligence software onto a camera suitable as a payload on a man-portable UGV.
The US Army Research Laboratory (ARL) will act as an evaluation and transition partner, providing all developer teams with purpose-built payload boxes and interface control documents to encourage technology transition potential.
Emulating human analysis
The task of emulating the kind of spatial and object analysis humans do automatically will be a particular challenge to the teams requiring a multidisciplinary approach.
"Some teams will construct 3D models or simulations from the video input, and use that as a basis to work on," says Donlon. "Other teams are taking inspiration from the mechanisms in human vision, implementing machines that pay attention to only the most salient aspects of the scene and construct activity models that reflect the minimal features needed to distinguish among the activities of interest.
"Still others build on more classic ideas of machine vision, but seek to engage higher-level cognitive processes to compose higher-level concepts from constituent parts. Key to the success of this programme will be the teams' willingness to take inspiration from one another to build upon such novel technical ideas."
The research will be informed by previous work in the civilian surveillance world and academia. This will include machine vision (MV), which is traditionally good at detecting objects and their attributes, but not activities. Performance evaluation of tracking and surveillance (PETS), a UK-based programme, has had success in identifying events like gatherings, thefts, or bag abandonment in surveillance video.
Industry and the military, including some Mind's Eye contributors, have worked on 'smart cameras', which use visual processing to extract application-specific information from the images and make decisions based on them.
Huge quantity of data
Visual images represent a huge quantity of data, which poses problems of its own.
"Human visual systems have well-adapted attentional mechanisms that protect our faculties from this data overload," says Donlon. "Artificial systems will need to either incorporate similar strategies to make economical use of the information in this data stream, or get incredibly fast at dealing with this data."
The nature of the programme means there are few limits on the data and Donlon has specified that Mind's Eye must be able to provide broad performance in the full range of VI tasks, even if that means slow performance at first. With this in mind, he has captured over 8,000 videos in different settings with a running time of 46 hours which exemplify the range of actions the project requires machines to reason about.
"The techniques developed must, from the very beginning, adapt to surprising new videos – which use different settings, or even evidence novel variations of the actions to be recognised," says Donlon.
Journey to the battlefield
The five-year Mind's Eye programme was launched in September 2010 and was followed by a principal investigator meeting in January 2011.
Preliminary results demonstrate activity recognition across diverse video inputs and generation of human-understandable textual descriptions of activity in those scenes. The first round of evaluations will follow later in 2011, when the teams will evaluate performance on the full range of Mind's Eye tasks in previously-unseen inputs.
New visual intelligence algorithms will be developed by late 2012 then incorporated into prototype systems over the course of three years. Specific test platforms have not as yet been identified.
"Success in Mind's Eye will mean exceeding the current army requirement for systems that stream video to human eyeballs, and instead put intelligence on board the platform so that warfighters can receive alerts about activity of interest, and even task those platforms," says Donlon. "This could take many soldiers out of harm's way and greatly enhance our situational awareness in areas of operation."
Mind's Eye's activity recognition abilities could also be applied to the military's vast repositories of stored video for indexing and retrieval.
The researchers will look out for interim results that indicate that the system can perform on one of the set of 'verbs' to provide critical cueing or filtering capabilities to overloaded analysts and alert the army to the possibilities.
It may be many years before mature visual intelligence will be able to recognise and reason about activity captured in full-motion video footage in a military context. But when it does it could transform the way video surveillance footage is used and operated upon, and could save the lives of soldiers in the process.