Opportunistic Real-Time Multimodal Sensor Content Exploitation
Navy SBIR 2014.2 - Topic N142-122
ONR - Ms. Lore-Anne Ponirakis - loreanne.ponirakis@navy.mil
Opens: May 23, 2014 - Closes: June 25, 2014

N142-122 TITLE: Opportunistic Real-Time Multimodal Sensor Content Exploitation

TECHNOLOGY AREAS: Information Systems, Sensors, Electronics

ACQUISITION PROGRAM: FNT-FY12-02 Autonomous Persistent Tactical Surveillance DCGS-N ACAT IAM

RESTRICTION ON PERFORMANCE BY FOREIGN NATIONALS: This topic is "ITAR Restricted". The information and materials provided pursuant to or resulting from this topic are restricted under the International Traffic in Arms Regulations (ITAR), 22 CFR Parts 120-130, which control the export of defense-related material and services, including the export of sensitive technical data. Foreign nationals may perform work under an award resulting from this topic only if they hold the "Permanent Resident Card", or are designated as "Protected Individuals" as defined by 8 U.S.C. 1324b(a)(3). If a proposal for this topic contains participation by a foreign national who is not in one of the above two categories, the proposal may be rejected.

OBJECTIVE: Develop tools for real-time frame-level video and voice (acoustics) content search, tagging, and tracking using common representation schemes, feature annotation, and autonomous biometric recognition methods to isolate entities-of-interest. Develop an automatic sequencing application to discover anomalies behaviors and infer intent in context of the activity environment.

DESCRIPTION: The goal is to develop time-critical actionable intelligence and insight from an extensive array of multimedia data sources and organic and non-organic sensors. The problem is lack of capability to perform frame-level contextual query and content tagging of files in multimedia sources including real-time sensor feed or archived files. Specifically, we need the ability to annotate, tag, and fully search content in video and voice (acoustics) files at the frame-level matched to the desired attributes of entities of interest in the frame, i.e., biometric signatures, landmark settings, geolocations, etc., so that objects of interest can be precisely defined, discovered, and tracked with respect to time and space; thus revealing their emerging behaviors, activity patterns, and intent. For example, an analyst enters the characteristics and attributes of the object of interest and events related to certain time windows (hours, days, etc.) and geographical region (county, city, etc.), perhaps using descriptive metadata that consist of geographic, temporal, and other references overlaid onto the image in the software application; in turn, the application automatically discovers all instances of that objectís detections over all space, time, and sensor modalities to support a rapidly evolving event.

We need a software application with an efficient multimedia (video, voice, acoustics) tagging scheme, content-based parsing and indexing, automatic tagged-content propagation, and query functionalities that enable extraction of relevant content of interest automatically from all available multimedia files for a distributed and decentralized operational environment. The software application is expected to rapidly search all available multimedia sources (video, voice, and acoustics files) for specific set of contents of interest and/or the results shared and stored on the cloud. The tagging scheme needs to be contextually informative, for example, it could be a text, a video frame, or a voice clip that captures relevant knowledge being queried on the entities of interest present in the multimedia files (archived and live feeds from opportunistic sensors). The application's knowledge manager enables the flexibility to organize and manage content; for example, the analyst will have control over the entire content of interest being searched (both temporally and spatially), medium for storage, and with whom the content will be shared. Key features may include user-defined format by category, sequence, link, and chronology to effectively capture evolving events of interest over time.

End-user requirements include: application of the tools for joint-problem solving; shared-awareness in a multi-level collaborative environment (i.e., varying levels of security and access) as each participant may offer unique insight; end-users connectivity through a variety of communication devices both mobile and stationary with varying processing capacity.

PHASE I: Determine technical feasibility by investigating, evaluating (modeling and simulation), and identifying the most promising candidate approaches for autonomous real-time video and voice (acoustics) activity search, content tagging, sequencing, and discovery of information contained in distributed networked sensor data files at the frame level. Perform trade off studies amongst those approaches. Propose candidate techniques and technology concepts for systematic search, discovery, tagging of events of interest available from distributed sensors and data sources by which the system tracks entities of interest, discovers behaviors, and infers intent. Recommend key technology design and development requirements and a plan for the Phase II.

Deliverables: technology concepts and final report.

PHASE II: Develop prototype software and hardware system incorporating the technology from Phase I. Develop a mission scenario that employs spatially distributed multimedia data sources by which the system exploits relevant captured data such as geo-location, landmarks, background acoustics, and entities' facial and voice signatures in cluttered dynamic environments to be used for a basic proof-of-concept prototype evaluation. Verify and validate the performance through implementation and demonstration of the basic proof-of-concept prototype. Develop schemes for retrieval and storage of the data on the cloud. Demonstrate the automatic selection and return of relevant video clips to a human analyst for further review, analysis, and annotation. Develop a detailed design and a plan for Phase III.

Deliverables: System architecture, system interface requirements for mobile and stationary platforms, detailed description of techniques, prototype software, technology demonstration, and final report and test results.

Note: Though Phase II work may become classified, the proposal for Phase II work will be UNCLASSIFIED. If the selected Phase II contractor does not have the required certification for classified work, ONR or related DoN Program office will work with the contractor to facilitate certification of related personnel and facility.

PHASE III: Develop these capabilities to TRL 7 or 8 and integrate the new technology into the FNT-FY12-02 Autonomous Persistent Tactical Surveillance DCGS-N ACAT IAM. Once validated conceptually and technically, demonstrate dual use applications of this technology in civilian law enforcement, security services, and private security systems.

PRIVATE SECTOR COMMERCIAL POTENTIAL/DUAL-USE APPLICATIONS: This technology has broad applications for knowledge management, behavior modeling and inference, situational awareness, and security in both government and private sectors. In essence it enables rapid understanding of complex dynamic events and situations and facilitates quick response by connecting the dots in an environment that involves high volume of multimodal multimedia data types.

In government it has numerous applications in military, intelligence communities, law-enforcement, homeland security, and state and local governments to deal with asymmetric threats, deploying first responders, crisis management planning, and humanitarian aid response.

1. Siersdorfer S., San Pedro J., Sanderson M., Automatic Video Tagging Using Content Redundancy, 2009 Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 395-402.

2. Ulges A., Schulze C., Keysers D., Breuel T.M., Content-Based Video Tagging for Online Video Portals.

3. Wang S., Liu Z., Zhu Y., He M., Chen X., Ji Q., 2014, Implicit Video Emotion Tagging from Audiencesí Facial Expression, Multimedia Tools and Applications.

4. Moxley, E., Tao Mei, Manjunath, B.S., Video Annotation Through Search and Graph Reinforcement Mining, 2010, IEEE Transactions on Multimedia, Vol. 12, Issue 3.

KEYWORDS: Video; Voice; Content Tagging; Search; Frame-level; Sequencing

DoD Notice:  
Between April 23 through May 22 you may talk directly with the Topic Authors (TPOC) to ask technical questions about the topics. Their contact information is listed above. For reasons of competitive fairness, direct communication between proposers and topic authors is
not allowed starting May 23, 2014, when DoD begins accepting proposals for this solicitation.
However, proposers may still submit written questions about solicitation topics through the DoD's SBIR/STTR Interactive Topic Information System (SITIS), in which the questioner and respondent remain anonymous and all questions and answers are posted electronically for general viewing until the solicitation closes. All proposers are advised to monitor SITIS (14.2 Q&A) during the solicitation period for questions and answers, and other significant information, relevant to the SBIR 14.2 topic under which they are proposing.

If you have general questions about DoD SBIR program, please contact the DoD SBIR Help Desk at (866) 724-7457 or webmail link.