Voice is the new interface driving ambient computing. This statement has never been more true than it is today. Speech recognition is transforming our daily lives from digital assistants, dictation of emails and documents, to transcriptions of lectures and meetings. These scenarios are possible today thanks to years of research in speech recognition and technological jumps enabled by neural networks. Microsoft is at the forefront of Speech Recognition with its research results, reaching human parity on the Switchboard research benchmark.
Our goal is to empower developers with our AI advances, so they can build new and transformative experiences for their customers. We offer a spectrum of APIs to address the various scenarios and situations developers encounter. Cognitive Services Speech API gives developers access to state of the art speech models. Premium scenarios, using domain specific vocabulary or complex acoustic conditions, offer Custom Speech Service that enables developers to automatically tune speech recognition models to their specific needs. Our services have been previewed on a wide range of scenarios with customers.
Speech recognition systems are composed of several components. The most important components are the acoustic and language models. If your application contains vocabulary items that occur rarely in everyday
Voice is becoming more and more prevalent as a mode of interaction with all kinds of devices and services. The ability to provide not only voice input but also voice output or Text-to-speech (TTS), is also becoming a critical technology that supports AI. Whether you need to interact on a device, over the phone, in a vehicle, through a building PA system, or even with a translated input, TTS is a crucial part of your end-to-end solution. It is also a necessity for all applications that enable accessibility.
We are excited to announce that the Speech API, a Microsoft Cognitive Service, now offers six new TTS languages to all developers, bringing the total number of available languages to 34:
Bulgarian (language code: bg-BG) Croatian (hr-HR) Malaysia (ms-MY) Slovenian (sl-SI) Tamil (ta-IN) Vietnamese (vi-VN)
Powered by the latest AI technology, these 34 languages are available across 48 locales and 78 voice fonts. Through a single API, developers can access the latest-generation of speech recognition and TTS models.
This Text-to-Speech API can be integrated by developers for a broad set of use cases. It can be used on its own for accessibility, hand-free communication or media consumption, or any other machine-to-human interactions.
Extracting insights from video, or using AI technologies, presents an additional set of challenges and opportunities for optimization as compared to images. There is a misconception that AI for video is simply extracting frames from a video and running computer vision algorithms on each video frame. While you can certainly do that but that would not help you get the insights that you are truly after. In this blog post, I will use a few examples to explain the shortcomings of taking an approach of just processing individual video frames. I will not be going over the details of the additional algorithms that are required to overcome these shortcomings. Video Indexer implements several such video specific algorithms.
Person presence in the video
Look at the first 25 seconds of this video.
Notice that Doug is present for the entire 25 seconds.
If I were to draw a timeline for when Doug is present in the video, it should be something like this.
Note the fact that Doug is not always facing the camera. Seven seconds in the video he is looking at Emily. Same thing happens at 23 seconds.
If you were to run face detection at