We are delighted to announce a new capability in Microsoft Video Indexer: Brand Detection from speech and from visual text! If you are not yet familiar with Video Indexer, you may want to take a look at a few examples on our portal.
Having brands in the video index gives you insights on names of products and organizations, which appear in a video or audio asset without having to watch it. Particularly, it enables you to search over large amounts of video and audio. Customers find Brand Detection useful in a wide variety of business scenarios such as contents archive and discovery, contextual advertising, social media analysis, retail compete analysis and many more.
Out of the box brand detection
Let us take a look at an example. In this Microsoft Build 2017 Day 2 presentation, the brand “Microsoft Windows” appears multiple times. Sometimes in the transcript, sometimes as visual text and never as verbatim. Video Indexer detects with high precision that a term is indeed brand based on the context, covering over 90k brands out of the box, and constantly updating. At 02:25, Video Indexer detects the brand from speech and then again at 02:40 from visual text, which is
Extracting insights from video, or using AI technologies, presents an additional set of challenges and opportunities for optimization as compared to images. There is a misconception that AI for video is simply extracting frames from a video and running computer vision algorithms on each video frame. While you can certainly do that but that would not help you get the insights that you are truly after. In this blog post, I will use a few examples to explain the shortcomings of taking an approach of just processing individual video frames. I will not be going over the details of the additional algorithms that are required to overcome these shortcomings. Video Indexer implements several such video specific algorithms.
Person presence in the video
Look at the first 25 seconds of this video.
Notice that Doug is present for the entire 25 seconds.
If I were to draw a timeline for when Doug is present in the video, it should be something like this.
Note the fact that Doug is not always facing the camera. Seven seconds in the video he is looking at Emily. Same thing happens at 23 seconds.
If you were to run face detection at
Self-service customization for speech recognition
ASR is an important audio analysis feature in Video Indexer. Speech recognition is artificial intelligence at its best, mimicking the human cognitive ability to extract words from audio. In this blog post, we will learn how to customize ASR in VI, to better fit specialized needs.
Before we get in to technical details, let’s take inspiration from a situation we have all experienced. Try to recall your first days on a job. You can probably remember feeling flooded with new words, product names, cryptic acronyms, and ways to use them. After some time, however, you can understand all these new words. You adapted yourself to the vocabulary.
ASR systems are great, but when it comes to recognizing a specialized vocabulary, ASR systems are just like humans. They need to adapt. Video Indexer now supports a customization layer for speech recognition, which allows you to teach the ASR engine new words, acronyms, and how they are used in your business context.
How does Automatic Speech Recognition work? Why is customization needed?