Category Archives : Big Data



New connectors added to Azure Data Factory empowering richer insights

Data is essential to your business. The ability to unblock business insights more efficiently can be a key competitive advantage to the enterprise. As data grows in volume, variety, and velocity, organizations need to bring together a continuously increasing set of diverse datasets across silos in order to perform advanced analytics and uncover business opportunities. The first challenge to building such big data analytics solutions is how to connect and extract data from a broad variety of data stores. Azure Data Factory (ADF) is a fully-managed data integration service for analytic workloads in Azure, that empowers you to copy data from 80 plus data sources with a simple drag-and-drop experience. Also, with its flexible control flow, rich monitoring, and CI/CD capabilities you can operationalize and manage the ETL/ELT flows to meet your SLAs.

Today, we are excited to announce the release of a set of new ADF connectors which enable more scenarios and possibilities for your analytic workloads. For example, you can now:

Ingest data from Google Cloud Storage into Azure Data Lake Gen2, and process using Azure Databricks jointly with data coming from other sources. Bring data from any S3-compatible data storage that you may consume from third party




Transitioning big data workloads to the cloud: Best practices from Unravel Data

Migrating on-premises Apache Hadoop® and Spark workloads to the cloud remains a key priority for many organizations. In my last post, I shared “Tips and tricks for migrating on-premises Hadoop infrastructure to Azure HDInsight.” In this series, one of HDInsight’s partners, Unravel Data, will share their learnings, best practices, and guidance based on their insights from helping migrate many on-premises Hadoop and Spark deployments to the cloud.

Unravel Data is an AI-driven Application Performance Management (APM) solution for managing and optimizing big data workloads. Unravel Data provides a unified, full-stack view of apps, resources, data, and users, enabling users to baseline and manage app performance and reliability, control costs and SLAs proactively, and apply automation to minimize support overhead. Ops and Dev teams use Unravel Data’s unified capability for on-premises workloads and to plan, migrate, and operate workloads on Azure. Unravel Data is available on the HDInsight Application Platform.

Today’s post, which kicks off the five-part series, comes from Shivnath Babu, CTO and Co-Founder at Unravel Data. This blog series will discuss key considerations in planning for migrations. Upcoming posts will outline the best practices for the migration, operation, and optimization phases of the cloud adoption lifecycle for big data.

Unravel Data’s




Development, source control, and CI/CD for Azure Stream Analytics jobs

Do you know how to develop and source control your Microsoft Azure Stream Analytics (ASA) jobs? Do you know how to setup automated processes to build, test, and deploy these jobs to multiple environments? Stream Analytics Visual Studio tools together with Azure Pipelines provides an integrated environment that helps you accomplish all these scenarios. This article will show you how and point you to the right places in order to get started using these tools.

In the past it was difficult to use Azure Data Lake Store Gen1 as the output sink for ASA jobs, and to set up the related automated CI/CD process. This was because the OAuth model did not allow automated authentication for this kind of storage. The tools being released in January 2019 support Managed Identities for Azure Data Lake Storage Gen1 output sink and now enable this important scenario.

This article covers the end-to-end development and CI/CD process using Stream Analytics Visual Studio tools, Stream Analytics CI.CD NuGet package, and Azure Pipelines. Currently Visual Studio 2019, 2017, and 2015 are all supported. If you haven’t tried the tools, follow the installation instructions to get started!

Job development

Let’s get started by creating a job. Stream




Analyze data in Azure Data Explorer using KQL magic for Jupyter Notebook

Exploring data is like solving a puzzle. You create queries and receive instant satisfaction when you discover insights, just like adding pieces to complete a puzzle. Imagine you have to repeat the same analysis multiple times, use libraries from an open-source community, share your steps and output with others, and save your work as an artifact. Notebooks helps you create one place to write your queries, add documentation, and save your work as output in a reusable format.

Jupyter Notebook allows you to create and share documents that contain live code, equations, visualizations, and explanatory text. Its includes data cleaning and transformation, numerical simulation, statistical modeling, and machine learning.

We are excited to announce KQL magic commands which extends the functionality of the Python kernel in Jupyter Notebook. KQL magic allows you to write KQL queries natively and query data from Microsoft Azure Data Explorer. You can easily interchange between Python and KQL, and visualize data using rich library integrated with KQL render commands. KQL magic supports Azure Data Explorer, Application Insights, and Log Analytics as data sources to run queries against.

Use a single magic “%kql” to run a single line query, or use cell magic “%%kql” to




HDInsight Tools for Visual Studio Code now generally available

We are pleased to announce the general availability for Azure HDInsight Tools for Visual Studio Code (VSCode). HDInsight Tools for VSCode give developers a cross-platform lightweight code editor for developing HDInsight PySpark and Hive batch jobs and interactive query. 

For PySpark developers who value the productivity Python enables, HDInsight Tools for VSCode offer a quick Python editor with simple getting started experiences, and allow you to submit PySpark statements to HDInsight clusters with interactive responses. This interactivity brings the best properties of Python and Spark to developers and empowers you to gain faster insights.

For Hive developers, HDInsight tools for VSCode offer great data warehouse query experiences for big data and helpful features in querying log files and gaining insights. 

Key customer benefits    Integration with Azure worldwide environments for Azure sign-in and HDInsight cluster management  HDInsight Hive and Spark job submission with integration with Spark UI and Yarn UI Interactive responses with the flexibility to execute one or multiple selected Hive and Python scripts Preview and export your interactive query results to CSV, JSON, and Excel format Built-in Hive language services such as IntelliSense auto-suggest, autocomplete, and error marker, among others Supports HDInsight ESP Cluster and Ambari connection Simplified cluster




HDInsight now supported in Azure CLI as a public preview
HDInsight now supported in Azure CLI as a public preview

We recently introduced support for HDInsight in Microsoft Azure CLI as a public preview. With the addition of the new HDInsight command group, you can now utilize all of the features and benefits that come with the familiar cross-platform Azure CLI to manage your HDInsight clusters.

Key Features Cluster CRUD: Create, delete, list, resize, and show properties for your HDInsight clusters. Script actions: Execute script actions, list and delete persistent script actions, promote ad-hoc script executions to persistent script actions, and show the execution history of script actions on HDInsight clusters. Operations Management Suite (OMS): Enable, disable, and show the status of OMS/Log Analytics integration on HDInsight clusters. Applications: Create, delete, list, and show properties for applications on your HDInsight clusters. Core usage: View available core counts by region before deploying large clusters. Azure CLI benefits Cross platform: Use Azure CLI on Windows, macOS, Linux, or the Azure Cloud Shell in a browser to manage your HDInsight clusters with the same commands and syntax across platforms. Tab completion and interactive mode: Autocomplete command and parameter names as well as subscription-specific details like resource group names, cluster names, and storage account names. Don’t remember your 88-character storage account key off the top




Create alerts to proactively monitor your data factory pipelines

Data integration is complex and helps organizations combine data and business processes in hybrid data environments. The increase in volume, variety, and velocity of data has led to delays in monitoring and reacting to issues. Organizations want to reduce the risk of data integration activity failures and the impact it cause to other downstream processes. Manual approaches to monitoring data integration projects are inefficient and time consuming. As a result, organizations want to have automated processes to monitor and manage data integration projects to remove inefficiencies and catch issues before they affect the entire system. Organizations can now improve operational productivity by creating alerts on data integration events (success/failure) and proactively monitor with Azure Data Factory.

To get started, simply navigate to the Monitor tab in your data factory, select Alerts & Metrics, and then select New Alert Rule.

Select the target data factory metric for which you want to be alerted.

Then, configure the alert logic. You can specify various filters such as activity name, pipeline name, activity type, and failure type for the raised alerts. You can also specify the alert logic conditions and the evaluation criteria.

Finally, configure how you want to be




Virtual Network Service Endpoints for serverless messaging and big data

This blog was co-authored by Sumeet Mittal, Senior Program Manager, Azure Networking.

Earlier this year in July, we announced the public preview for Virtual Network Service Endpoints and Firewall rules for both Azure Event Hubs and Azure Service Bus. Today, we’re excited to announce that we are making these capabilities generally available to our customers.

This feature adds to the security and control Azure customers have over their cloud environments. Now, traffic from your virtual network to your Azure Service Bus Premium namespaces and Standard and Dedicated Azure Event Hubs namespaces can be kept secure from public Internet access and completely private on the Azure backbone network.

Virtual Network Service Endpoints do this by extending your virtual network private address space and the identity of your virtual network to your virtual networks. Customers dealing with PII (Financial Services, Insurance, etc.) or looking to further secure access to their cloud visible resources will benefit the most from this feature. For more details on the finer workings of Virtual Network service endpoints, refer to the documentation.

Firewall rules further allow a specific IP address or a specified range of IP addresses to access the resources.

Virtual Network Service Endpoints and Firewall rules




Microsoft open sources Trill to deliver insights on a trillion events a day

In today’s high-speed environment, being able to process massive amounts of data each millisecond is becoming a common business requirement. We are excited to be announcing that an internal Microsoft project known as Trill for processing “a trillion events per day” is now being open sourced to address this growing trend.

Here are just a few of the reasons why developers love Trill:

As a single-node engine library, any .NET application, service, or platform can easily use Trill and start processing queries. A temporal query language allows users to express complex queries over real-time and/or offline data sets. Trill’s high performance across its intended usage scenarios means users get results with incredible speed and low latency. For example, filters operate at memory bandwidth speeds up to several billions of events per second, while grouped aggregates operate at 10 to 100 million events per second. A rich history

Trill started as a research project at Microsoft Research in 2012, and since then, has been extensively described in research papers such as VLDB and the IEEE Data Engineering Bulletin. The roots of Trill’s language lie in Microsoft’s former service StreamInsight, a powerful platform allowing developers to develop and deploy complex event processing




Azure Functions now supported as a step in Azure Data Factory pipelines

Azure Functions is a serverless compute service that enables you to run code on-demand without having to explicitly provision or manage infrastructure. Using Azure Functions, you can run a script or piece of code in response to a variety of events. Azure Data Factory (ADF) is a managed data integration service in Azure that allows you to iteratively build, orchestrate, and monitor your Extract Transform Load (ETL) workflows. Azure Functions is now integrated with ADF, allowing you to run an Azure function as a step in your data factory pipelines.

Simply drag an “Azure Function activity” to the General section of your activity toolbox to get started.

You need to set up an Azure Function linked service in ADF to create a connection to your Azure Function app.

Provide the Azure Function name, method, headers, and body in the Azure Function activity inside your data factory pipeline.

You can also parameterize your function name using rich expression support in ADF. Get more information and detailed steps on using Azure Functions in Azure Data Factory pipelines.

Our goal is to continue adding features and improve the usability of Data Factory tools. Get started building pipelines easily and quickly