Category Archives : Media Services & CDN



Logic Apps, Flow connectors will make Automating Video Indexer simpler than ever

Video Indexer recently released a new and improved Video Indexer V2 API. This RESTful API supports both server-to-server and client-to-server communication and enables Video Indexer users to integrate video and audio insights easily into their application logic, unlocking new experiences and monetization opportunities.

To make the integration even easier, we also added new Logic Apps and Flow connectors that are compatible with the new API. Using the new connectors, you can now set up custom workflows to effectively index and extract insights from a large amount of video and audio files, without writing a single line of code! Furthermore, using the connectors for your integration gives you better visibility on the health of your flow and an easy way to debug it. 

To help you get started quickly with the new connectors, we’ve added Microsoft Flow templates that use the new connectors to automate extraction of insights from videos. In this blog, we will walk you through those example templates.

Upload and index your video automatically

This scenario is comprised of two different flows that work together. The first flow is triggered when a new file is added to a designated folder in a OneDrive account. It uploads the new




Get video insights in (even) more languages!

For those of you who might not have tried it yet, Video Indexer is a cloud application and platform built upon media AI technologies to make it easier to extract insights from video and audio files. As a starting point for extracting the textual part of the insights, the solution creates a transcript based on the speech appearing in the file; this process is referred to as Speech-to-text. Today, Video Indexer’s Speech-to-text supports ten different languages. Supported languages include English, Spanish, French, German, Italian, Chinese (Simplified), Portuguese (Brazilian), Japanese, Arabic, and Russian.

However, if the content you need is not in one of the above languages, fear not! Video Indexer partners with other transcription service providers to extend its speech-to-text capabilities to many more languages. One of those partnerships is with Zoom Media, which extended the Speech-to-text to Dutch, Danish, Norwegian and Swedish.

A great example for using Video Indexer and Zoom Media is the Dutch public broadcaster AVROTROS; who uses Video Indexer to analyze videos and allow editors to search through them. Finus Tromp, Head of Interactive Media in AVROTROS shared, “We use Microsoft Video Indexer on a daily basis to supply our videos with relevant metadata. The gathered




Build 2018: What’s new in Azure video processing and video AI

Developers and media companies trust and rely on Azure Media Services to build the ability to encode, protect, analyze and deliver video at scale. This week, at the Build 2018 conference in Seattle, we are proud to announce a major new API version for Azure Media Services, along with new developer focused features, and updates to Video Indexer.

Media processing at scale: Public preview of the new Azure Media Services API (v3)

Starting at Build 2018, developers can begin working with the public preview of the new Azure Media Services API (v3). The new API provides a simplified development model, enables a better integration experience with key Azure services like Event Grid and Functions, includes two new media analysis capabilities, and provides a new set of SDKs for .NET, .NET Core, Java, Go, Python, and Node.js!

We have created a set of preliminary documentation to get developers started quickly learning more about the new Azure Media Services preview release announcements.

Get Started with v3 Public Preview: REST API, SDKs, Swagger Files. Code Samples used at the Build 2018 session. Learn more about.. How the new Transform template makes it easier to submit encoding and analysis Jobs. How to use the




A years’ worth of cloud, AI and partner innovation. Welcome to NAB 2018!

As I reflect on cloud computing and the media industry since last year’s NAB, I see two emerging trends. First, content creators and broadcasters such as Rakuten, RTL, and Al Jazeera are increasingly using the global reach, hybrid model, and elastic scale of Azure to create, manage, and distribute their content. Second, AI-powered tools for extracting insights from content are becoming an integral part of the content creation, management and distribution workflows with customers such as Endemol Shine Group, and Zone TV.

Therefore, at this year’s NAB, we are focused on helping you modernize your media workflows, so you can get the best of cloud computing and AI.  We made a number of investments to enable better content production workflows in Azure, including the recent acquisition of Avere Systems. You can learn more about how Azure can help you improve your media workflows and business here.

Read on to learn more about the key advancements we’ve made – in media services, distribution and our partner ecosystem – since last year’s IBC.

Azure Media Services

Democratizing AI for Media Industry: Since its launch at NAB 2016, Azure Media Analytics has come a long way. At Build 2017, we launched Video Indexer,




Using AI to automatically redact faces in videos
Using AI to automatically redact faces in videos

In the last few years, many law enforcement agencies have adopted body worn cameras. In this blog post, I will provide some background on what is driving the growth and will talk about how AI can help law enforcement agencies with the processing of videos captured by body-worn cameras.

Background on body-worn cameras

A body worn camera is a wearable audio, video or photographic recording system. Law enforcement agencies are not the only consumers of body-worn cameras. Other consumers include journalists, medical professionals, athletes, and so on. The forecast unit shipments of body-worn cameras can be seen on this webpage published by Statista.

The National Institute of Justice (NIJ), the research, development and evaluation agency of the US Department of Justice, conducted research on body-worn cameras for law enforcement and conducted a market survey on body-worn cameras for criminal justice. The survey updated in 2016, aggregates and summarizes information on a number of makes and models of body-worn cameras available today, including the approximate costs of each unit. The full market survey on body-worn camera technologies can be found on NIJ’s website.

Freedom of Information Act (FOIA)

FOIA is defined on as a law that gives citizens the right




Brand Detection in Microsoft Video Indexer

We are delighted to announce a new capability in Microsoft Video Indexer: Brand Detection from speech and from visual text! If you are not yet familiar with Video Indexer, you may want to take a look at a few examples on our portal.

Having brands in the video index gives you insights on names of products and organizations, which appear in a video or audio asset without having to watch it. Particularly, it enables you to search over large amounts of video and audio. Customers find Brand Detection useful in a wide variety of business scenarios such as contents archive and discovery, contextual advertising, social media analysis, retail compete analysis and many more.

Out of the box brand detection

Let us take a look at an example. In this Microsoft Build 2017 Day 2 presentation, the brand “Microsoft Windows” appears multiple times. Sometimes in the transcript, sometimes as visual text and never as verbatim. Video Indexer detects with high precision that a term is indeed brand based on the context, covering over 90k brands out of the box, and constantly updating. At 02:25, Video Indexer detects the brand from speech and then again at 02:40 from visual text, which is




How is AI for video different from AI for images

Extracting insights from video, or using AI technologies, presents an additional set of challenges and opportunities for optimization as compared to images. There is a misconception that AI for video is simply extracting frames from a video and running computer vision algorithms on each video frame. While you can certainly do that but that would not help you get the insights that you are truly after. In this blog post, I will use a few examples to explain the shortcomings of taking an approach of just processing individual video frames. I will not be going over the details of the additional algorithms that are required to overcome these shortcomings. Video Indexer implements several such video specific algorithms.

Person presence in the video

Look at the first 25 seconds of this video.

Notice that Doug is present for the entire 25 seconds.

If I were to draw a timeline for when Doug is present in the video, it should be something like this.


Note the fact that Doug is not always facing the camera. Seven seconds in the video he is looking at Emily. Same thing happens at 23 seconds.

If you were to run face detection at




Bring your own vocabulary to Microsoft Video Indexer

Self-service customization for speech recognition

Video Indexer (VI) now supports industry and business specific customization for automatic speech recognition (ASR) through integration with the Microsoft Custom Speech Service!

ASR is an important audio analysis feature in Video Indexer. Speech recognition is artificial intelligence at its best, mimicking the human cognitive ability to extract words from audio. In this blog post, we will learn how to customize ASR in VI, to better fit specialized needs.

Before we get in to technical details, let’s take inspiration from a situation we have all experienced. Try to recall your first days on a job. You can probably remember feeling flooded with new words, product names, cryptic acronyms, and ways to use them. After some time, however, you can understand all these new words. You adapted yourself to the vocabulary.

ASR systems are great, but when it comes to recognizing a specialized vocabulary, ASR systems are just like humans. They need to adapt. Video Indexer now supports a customization layer for speech recognition, which allows you to teach the ASR engine new words, acronyms, and how they are used in your business context.

How does Automatic Speech Recognition work? Why is customization needed?

Roughly speaking,