With the current state of technology it is easier to analyze audio than video (see Audio Processing), but undeniably stake holders show a strong interest in video-content analysis, preferably without significant time delay. To perform TV monitoring successfully, you need an analysis system which provides you with real-time analytics, a world-wide and integrated coverage in multiple languages, alerts, information segmented into stories, and the ability to define your areas of interest concisely and precisely, based on textual representations of both audio and video content.
This work is inherently multidisciplinary, requiring competence in signal processing, acoustics, phonetics, phonology, linguistics, computer science, and combining technologies to extract the most from the incoming media. The presented systems are based on Hidden Markov Models (HMMs) which generate best-of-breed results on continuous speech as typically found in broadcast monitoring applications.
Processing video data contains four main parts: (1) key-frame detection, (2) shortcut detection, (3) text detection (4) optical character recognition (OCR).
A keyframe is a frame in which a complete image is stored in the data stream. In order to reduce the amount of information that must be stored, only changes that occur from one frame to the next are stored in the data stream.
Few broadcasts consist of single camera angles and shots. To find those abrupt changes in the video stream, or in other words, to find where camera shots start and where different shots are glued together, so-called shotcut detection is used.
Audio and video are captured from a variety of sources and put on an incoming processing bus. Attached to this bus are processing components.
Generally, for speech and audio recognition, text analysis and text-based applications are very common. Some of the most used text-based technological applications are: information extraction, information retrieval, topic tracking summarization, categorization, concept linkage, information visualization, and question answering.
To be able to use the targeted media, it needs to be formatted by a format converter. Specifically, information is encoded and compressed to be streamable, as well as split up into different segments to provide user-defined downloads at different quality levels.
Some broadcasts provide subtitles, which of course can not be left out. A subtitle reader extracts those subtitles, or closed captions, and converts them into a textual and time-stamped format. Further metadata, like an electronic program guide, is also considered to retrieve even more information.
In some cases channels or programs like to insert text underneath the speakers. This could, for example, be the case in an interview, in which more information about the speaker is displayed on the screen. The text insert reader identifies those blocks and uses optical character recognition (OCR) to transform images into time-stamped text.
Find more information about the latest advances in technology: