In Video Indexer, we have the capability for recognizing display text in videos. This blog explains some of the techniques we used to extract the best quality data. To start, take a look at the sequence of frames below.
Did you manage to recognize the text in the images? It is highly reasonable that you did, without even noticing. However, using the best Optical Character Recognition (OCR) service for text extraction on these images, will yield broken words such as “icrosof”, “Mi”, “osoft” and “Micros”, simply because the text is partially hidden in each image.
There is a misconception that AI for video is simply extracting frames from a video and running computer vision algorithms on each video frame but video processing is much more than processing individual frames using an image processing algorithm – for example, with 30 frames per second, a minute-long video is 1800 frames producing a lot of data but, as we see above, not many meaningful words. There is a separate blog that covers how AI for video is different from AI for images.
While humans have cognitive abilities that allow them to complete hidden parts
This blog post is co-authored by Ashish Jhanwar, Data Scientist, Microsoft
Content Moderator is part of Microsoft Cognitive Services allowing businesses to use machine assisted moderation of text, images, and videos that augment human review.
The text moderation capability now includes a new machine-learning based text classification feature which uses a trained model to identify possible abusive, derogatory or discriminatory language such as slang, abbreviated words, offensive, and intentionally misspelled words for review.
In contrast to the existing text moderation service that flags profanity terms, the text classification feature helps detect potentially undesired content that may be deemed as inappropriate depending on context. In addition, to convey the likelihood of each category it may recommend a human review of the content.
The text classification feature is in preview and supports the English language.
How to use
Content Moderator consists of a set of REST APIs. The text moderation API adds an additional request parameter in the form of classify=True. If you specify the parameter as true, and the auto-detected language of your input text is English, the API will output the additional classification insights as shown in the following sections.
If you specify the language as English for non-English text,
Looking to transform your business by improving your on-premises environments? Accelerating your move to the cloud, and gaining transformative insights from your data? Here’s your opportunity to learn from the experts and ask the questions that help your organization move forward.
Join us for one or all of these training sessions to take a deep dive into a variety of topics. Including products like Azure Cosmos DB, along with Microsoft innovations in artificial intelligence, advanced analytics, and big data.
Azure Cosmos DB
Engineering experts are leading a seven-part training series on Azure Cosmos DB, complete with interactive Q&As. In addition to a high-level technical deep dive, this series covers a wide array of topics, including:
By the end of this series, you’ll be able to build serverless applications and conduct real-time analytics using Azure Cosmos DB, Azure Functions, and Spark. Register to attend the whole Azure Cosmos DB series, or register for the sessions that interest you.
Artificial Intelligence (AI)
Learn to create the next generation of applications spanning an intelligent cloud as well as an intelligent edge powered by AI. Microsoft offers a comprehensive set of flexible AI services for any
Integrating geography and location information with AI brings a powerful new dimension to understanding the world around us. This has a wide range of applications in a variety of segments, including commercial, governmental, academic or not-for-profit. Geospatial AI provides robust tools for gathering, managing, analyzing and predicting from geographic and location-based data, and powerful visualization that can enable unique insights into the significance of such data.
Available today, Microsoft and Esri will be offering the GeoAI Data Science Virtual Machine (DSVM) as part of our Data Science Virtual Machine/Deep Learning Virtual Machine family of products on Azure. This is a result of a collaboration between the two companies and will bring AI, cloud technology and infrastructure, geospatial analytics and visualization together to help create more powerful and intelligent applications.
At the heart of the GeoAI Virtual Machine is ArcGIS Pro, Esri’s next-gen 64-bit desktop geographic information system (GIS) that provides professional 2D and 3D mapping in an intuitive user interface. ArcGIS Pro is a big step forward in advancing visualization, analytics, image processing, data management and integration.
ArcGIS Pro is installed in a Data Science Virtual Machine (DSVM) image from Microsoft. The DSVM is a popular experimentation and modeling
This blog post was co-authored by Riham Mansour, Principal Program Manager, Fuse Labs.
Conversational systems are rapidly becoming a key component of solutions such as virtual assistants, customer care, and the Internet of Things. When we talk about conversational systems, we refer to a computer’s ability to understand the human voice and take action based on understanding what the user meant. What’s more, these systems won’t be relying on voice and text alone. They’ll be using sight, sound, and feeling to process and understand these interactions, further blurring the lines between the digital sphere and the reality in which we are living. Chatbots are one common example of conversational systems.
Chatbots are a very trendy example of conversational systems that can maintain a conversation with a user in natural language, understand the user’s intent and send responses based on the organization’s business rules and data. These chatbots use Artificial Intelligence to process language, enabling them to understand human speech. They can decipher verbal or written questions and provide responses with appropriate information or direction. Many customers first experienced chatbots through dialogue boxes on company websites. Chatbots also interact verbally with consumers, such as Cortana, Siri and Amazon’s Alexa. Chatbots are
When we announced our partnership with Cray, it was very exciting news. I received my undergraduate degree in meteorology, so my mind immediately went to how this could be a benefit to weather forecasting.
Weather modeling is an interesting use case. It requires a large number of cores with a low-latency interconnect, and it is very time sensitive. After all, what good is a one hour weather forecast if it takes 90 minutes to run? And weather is a very local phenomenon. In order to resolve smaller scale features without shrinking the domain or lengthening runtime, modelers must add more cores. A global weather model with a 0.5 degree grid spacing can require as many as 50,000 cores.
At that large of a scale, and with the performance required to be operationally useful, a Cray supercomputer is an excellent fit. But the model by itself doesn’t mean much. The model data needs to be processed to generate products. This is where Azure services come in.
Website images are one obvious product of weather models. Image generation programs require small scale and can be done in parallel, so they’re great for using the elasticity of Azure virtual machines. The same can
Voice is the new interface driving ambient computing. This statement has never been more true than it is today. Speech recognition is transforming our daily lives from digital assistants, dictation of emails and documents, to transcriptions of lectures and meetings. These scenarios are possible today thanks to years of research in speech recognition and technological jumps enabled by neural networks. Microsoft is at the forefront of Speech Recognition with its research results, reaching human parity on the Switchboard research benchmark.
Our goal is to empower developers with our AI advances, so they can build new and transformative experiences for their customers. We offer a spectrum of APIs to address the various scenarios and situations developers encounter. Cognitive Services Speech API gives developers access to state of the art speech models. Premium scenarios, using domain specific vocabulary or complex acoustic conditions, offer Custom Speech Service that enables developers to automatically tune speech recognition models to their specific needs. Our services have been previewed on a wide range of scenarios with customers.
Speech recognition systems are composed of several components. The most important components are the acoustic and language models. If your application contains vocabulary items that occur rarely in everyday
Voice is becoming more and more prevalent as a mode of interaction with all kinds of devices and services. The ability to provide not only voice input but also voice output or Text-to-speech (TTS), is also becoming a critical technology that supports AI. Whether you need to interact on a device, over the phone, in a vehicle, through a building PA system, or even with a translated input, TTS is a crucial part of your end-to-end solution. It is also a necessity for all applications that enable accessibility.
We are excited to announce that the Speech API, a Microsoft Cognitive Service, now offers six new TTS languages to all developers, bringing the total number of available languages to 34:
Bulgarian (language code: bg-BG) Croatian (hr-HR) Malaysia (ms-MY) Slovenian (sl-SI) Tamil (ta-IN) Vietnamese (vi-VN)
Powered by the latest AI technology, these 34 languages are available across 48 locales and 78 voice fonts. Through a single API, developers can access the latest-generation of speech recognition and TTS models.
This Text-to-Speech API can be integrated by developers for a broad set of use cases. It can be used on its own for accessibility, hand-free communication or media consumption, or any other machine-to-human interactions.
Extracting insights from video, or using AI technologies, presents an additional set of challenges and opportunities for optimization as compared to images. There is a misconception that AI for video is simply extracting frames from a video and running computer vision algorithms on each video frame. While you can certainly do that but that would not help you get the insights that you are truly after. In this blog post, I will use a few examples to explain the shortcomings of taking an approach of just processing individual video frames. I will not be going over the details of the additional algorithms that are required to overcome these shortcomings. Video Indexer implements several such video specific algorithms.
Person presence in the video
Look at the first 25 seconds of this video.
Notice that Doug is present for the entire 25 seconds.
If I were to draw a timeline for when Doug is present in the video, it should be something like this.
Note the fact that Doug is not always facing the camera. Seven seconds in the video he is looking at Emily. Same thing happens at 23 seconds.
If you were to run face detection at