Results of Year 2
In the second year of the project, a lot of progress has been made. Most importantly, we finalized AXES-V1, the first complete integrated system, and built the AXES-PRO demo system for media professionals on top of it. We installed AXES-PRO at BBC and had it tested by real image searchers at NISV. In parallel, we started working towards AXES-RESEARCH, where the focus will be more on linking and exploring the archives, and we made further advances in the underlying technology for audio-visual content analysis.
The AXES-PRO system
The first demo system of AXES targets media professionals. These people search the archive for content on a daily basis. They know their archive well, and compare our system mostly against their current practice, a search-based system relying exclusively on metadata provided by the archivists. Within AXES, media professionals are represented by NISV, as well as Deutsche Welle and BBC. The figure below shows a snapshot of the AXES-PRO system we developed for this user group:
Apart from the standard functionality such as searching for text in metadata and adding filters based on specific metadata fields, the AXES-PRO system also allows to search in the video content directly. As a very prominent feature, it has various ‘on-the-fly’ search components, as well as novel query-by-example tools and tools for searching spoken text. All this has been integrated in a slick user interface that runs smoothly on several hundreds of hours of video.
The visual search is not restricted to a set of predefined concepts (e.g. faces or indoor vs. outdoor). Instead, the user can specify his query using free text. A new classifier for the query is then learnt on-the-fly in a matter of seconds, starting from a set of images retrieved from a web image search engine like Google Images or Flickr.
Different technologies have been developed to achieve this for different types of queries. In particular, the AXES-PRO system is capable of learning on-the-fly models for categories (objects or scenes, e.g. “motorbike”, “soccer”, or “forest”), for specific places or logos (e.g. “Amsterdam central station” or “Shell”), and for faces (both specific persons, e.g. “George Bush”, as well as facial features, e.g. “moustache”). Likewise, we can search the archive for other videos similar to a given keyframe, where the similarity again applies either to the image as a whole or to the person shown in the keyframe. Finally, there is the option to search spoken text, based on automatic speech recognition software.
The offline indexing pipeline to process an archive has been integrated in the open-source Weblab architecture for smooth operation. For the online part, a lightweight scheme to interface with the database has been developed. The latter has also been installed at BBC.
In line with our user-centric approach, we asked real-world users to evaluate our system. Six image searchers tested AXES-PRO, searching in a repository of 500 hours of NISV content. Based on their feedback we conclude that professional users are highly interested in the type of audio-visual search that AXES advocates. They were impressed by the professional looks and smooth operation of the system, although the lower accuracy obtained by ‘noisy’ content-based automatic tools, especially compared to the very accurate metadata provided by archivists they are used to work with, is still a point of attention.
In parallel to the development of the AXES-PRO system, we started preparing the AXES-RESEARCH system. This system targets journalists as well as academic researchers. Key aspects in which these users differ from the media professionals, that will be addressed in the AXES-RESEARCH system, include:
- search for longer segments
- longer search trails instead of simple queries
- following links to related content to explore the archive
- recall matters, not just precision
Based on feedback from user surveys, a first mockup of the AXES-RESEARCH system has been developed, and research on how to make this happen is ongoing.
Audio-visual content analysis
Finally, a lot of research went into advancing the state-of-the-art in audio-visual content analysis. We only mention a few highlights here:
Multimedia event detection: One type of content that is not yet searchable in AXES-PRO are ‘events’. Examples of events (or activities) are, e.g., « making a sandwich », « a birthday party » or « a parade ». These are typically high-level concepts, that are difficult to model. AXES participated in the TRECVID MED task focussing on this difficult problem, and ended first for the ad-hoc event task and second for the known event task.
Multimodal person clustering: We also worked towards improved person recognition, integrating both speaker recognition as well as face recognition, where one modality helps to improve the other.
On-the-fly pose retrieval: A new on-the-fly component that will be added to future releases of AXES systems consists of searching for videos with similar poses as shown in an example image or specified in an interactive fashion based on a stick figure.