Category Archives : Big Data



Create alerts to proactively monitor your data factory pipelines

Data integration is complex and helps organizations combine data and business processes in hybrid data environments. The increase in volume, variety, and velocity of data has led to delays in monitoring and reacting to issues. Organizations want to reduce the risk of data integration activity failures and the impact it cause to other downstream processes. Manual approaches to monitoring data integration projects are inefficient and time consuming. As a result, organizations want to have automated processes to monitor and manage data integration projects to remove inefficiencies and catch issues before they affect the entire system. Organizations can now improve operational productivity by creating alerts on data integration events (success/failure) and proactively monitor with Azure Data Factory.

To get started, simply navigate to the Monitor tab in your data factory, select Alerts & Metrics, and then select New Alert Rule.

Select the target data factory metric for which you want to be alerted.

Then, configure the alert logic. You can specify various filters such as activity name, pipeline name, activity type, and failure type for the raised alerts. You can also specify the alert logic conditions and the evaluation criteria.

Finally, configure how you want to be




Virtual Network Service Endpoints for serverless messaging and big data

This blog was co-authored by Sumeet Mittal, Senior Program Manager, Azure Networking.

Earlier this year in July, we announced the public preview for Virtual Network Service Endpoints and Firewall rules for both Azure Event Hubs and Azure Service Bus. Today, we’re excited to announce that we are making these capabilities generally available to our customers.

This feature adds to the security and control Azure customers have over their cloud environments. Now, traffic from your virtual network to your Azure Service Bus Premium namespaces and Standard and Dedicated Azure Event Hubs namespaces can be kept secure from public Internet access and completely private on the Azure backbone network.

Virtual Network Service Endpoints do this by extending your virtual network private address space and the identity of your virtual network to your virtual networks. Customers dealing with PII (Financial Services, Insurance, etc.) or looking to further secure access to their cloud visible resources will benefit the most from this feature. For more details on the finer workings of Virtual Network service endpoints, refer to the documentation.

Firewall rules further allow a specific IP address or a specified range of IP addresses to access the resources.

Virtual Network Service Endpoints and Firewall rules




Microsoft open sources Trill to deliver insights on a trillion events a day

In today’s high-speed environment, being able to process massive amounts of data each millisecond is becoming a common business requirement. We are excited to be announcing that an internal Microsoft project known as Trill for processing “a trillion events per day” is now being open sourced to address this growing trend.

Here are just a few of the reasons why developers love Trill:

As a single-node engine library, any .NET application, service, or platform can easily use Trill and start processing queries. A temporal query language allows users to express complex queries over real-time and/or offline data sets. Trill’s high performance across its intended usage scenarios means users get results with incredible speed and low latency. For example, filters operate at memory bandwidth speeds up to several billions of events per second, while grouped aggregates operate at 10 to 100 million events per second. A rich history

Trill started as a research project at Microsoft Research in 2012, and since then, has been extensively described in research papers such as VLDB and the IEEE Data Engineering Bulletin. The roots of Trill’s language lie in Microsoft’s former service StreamInsight, a powerful platform allowing developers to develop and deploy complex event processing




Azure Functions now supported as a step in Azure Data Factory pipelines

Azure Functions is a serverless compute service that enables you to run code on-demand without having to explicitly provision or manage infrastructure. Using Azure Functions, you can run a script or piece of code in response to a variety of events. Azure Data Factory (ADF) is a managed data integration service in Azure that allows you to iteratively build, orchestrate, and monitor your Extract Transform Load (ETL) workflows. Azure Functions is now integrated with ADF, allowing you to run an Azure function as a step in your data factory pipelines.

Simply drag an “Azure Function activity” to the General section of your activity toolbox to get started.

You need to set up an Azure Function linked service in ADF to create a connection to your Azure Function app.

Provide the Azure Function name, method, headers, and body in the Azure Function activity inside your data factory pipeline.

You can also parameterize your function name using rich expression support in ADF. Get more information and detailed steps on using Azure Functions in Azure Data Factory pipelines.

Our goal is to continue adding features and improve the usability of Data Factory tools. Get started building pipelines easily and quickly




An Azure Function orchestrates a real-time, serverless, big data pipeline

Although it’s not a typical use case for Azure Functions, a single Azure function is all it took to fully implement an end-to-end, real-time, mission-critical data pipeline for a fraud detection scenario. And it was done with a serverless architecture. Two blogs recently described this use case, “Considering Azure Functions for a serverless data streaming scenario,” and “A fast, serverless, big data pipeline powered by a single Azure Function.”

Pipeline requirements

A large bank wanted to build a solution to detect fraudulent transactions. The solution was built on an architectural pattern common for big data analytic pipelines, with massive volumes of real-time data ingested into a cloud service where a series of data transformation activities provided input for a machine learning model to deliver predictions. Latency and response times are critical in a fraud detection solution, so the pipeline had to be very fast and scalable. End-to-end evaluation of each transaction had to complete and provide a fraud assessment in less than two seconds.

Requirements for the pipeline included the following:

Ability to scale and efficiently process bursts of event activity totaling 8+ million transactions daily. Daily parsing and processing of 4 million complex JSON files. Events and transactions




Azure HDInsight integration with Data Lake Storage Gen2 preview – ACL and security update

Today we are sharing an update to the Azure HDInsight integration with Azure Data Lake Storage Gen 2. This integration will enable HDInsight customers to drive analytics from the data stored in Azure Data Lake Storage Gen 2 using popular open source frameworks such as Apache Spark, Hive, MapReduce, Kafka, Storm, and HBase in a secure manner.

Azure Data Lake Storage Gen2

Azure Data Lake Storage Gen2 is the only data lake designed specifically for enterprises to run large scale analytics workloads in the cloud. It unifies the core capabilities from the first generation of Azure Data Lake with a Hadoop compatible file system endpoint now directly integrated into Azure Blob Storage. This enhancement combines the scale and cost benefits of object storage with the reliability and performance typically associated only with on-premises file systems. This new file system includes a full hierarchical namespace that makes files and folders first class citizens, translating to faster, more reliable analytics job execution.

Azure Data Lake Storage Gen2 also includes limitless storage ensuring capacity to meet the needs of even the largest, most complex workloads. In addition, Azure Data Lake Storage Gen2 delivers on native integration with Azure Active Directory and support POSIX




Get up to speed with Azure HDInsight: The comprehensive guide

Azure HDInsight is an easy, cost-effective, enterprise-grade service for open source analytics. With HDInsight, you get managed clusters for various Apache big data technologies, such as Spark, MapReduce, Kafka, Hive, HBase, Storm and ML Services backed by a 99.9% SLA. In addition, you can take advantage of HDInsight’s rich ISV application ecosystem to tailor the solution for your specific scenario.

HDInsight covers a wide variety of big data technologies, and we have received many requests for a detailed guide. Whether you want to just get started with HDInsight, or become a Big Data expert, this post has you covered with all the latest resources.

Latest content

The HDInsight team has been working hard releasing new features, including the launch of HDInsight 4.0. We make major product announcements on the Azure HDInsight and Big Data blogs. Here is a selection of the most recent updates:

Launch of HDInsight 4.0 at Microsoft Ignite 2018 (Session Video) Azure HDInsight brings next generation Apache Hadoop 3.0 and enterprise security to the cloud Deep dive into Azure HDInsight 4.0 HDInsight Enterprise Security Package now generally available Exciting new capabilities on Azure HDInsight 6-part best practice guide for on premises Hadoop to cloud migration Azure Toolkit




Azure Data Lake Storage Gen2 preview – More features, more performance, better availability

Since we announced the limited public preview of Azure Data Lake Storage (ADLS) Gen2 in June, the response has been resounding. Customers participating in the ADLS Gen2 preview have directly benefitted from the scale, performance, security, manageability, and cost-effectiveness inherent in the ADLS Gen2 offering. Today, we are very pleased to announce significant updates to the preview that will allow an even greater experience for customers.

Today’s announcements include additional features that preview customers have been asking for:

Enterprise-class security features integrated into Azure Databricks and Azure HDInsight (available shortly) Azure Storage Explorer support to view and manage data in ADLS Gen2 accounts, including data exploration and access control management Support for connecting external tables in SQL Data Warehouse, including when Storage Firewalls are active on the account Power BI and SQL Data Warehouse supporting the Common Data Model for entities stored in ADLS Gen2 Storage Firewall and Virtual Network rules integration for all analytics services Encryption of data at rest using either Microsoft or customer supplied keys as well as encryption in transit via TLS 1.2 Ability to mount an ADLS Gen2 filesystem into the Databricks File System (DBFS)

Additionally, as of today, the ADLS Gen2 public preview is




Azure Stream Analytics on IoT Edge now generally available

Today, we are announcing the general availability of Azure Stream Analytics (ASA) on IoT Edge, empowering developers to deploy near-real-time analytical intelligence closer to IoT devices, unlocking the full value of device-generated data. With this release, Azure Stream Analytics enables developers to build truly hybrid architectures for stream processing, where device-specific or site-specific analytics can run on containers on IoT Edge and complement large scale cross-devices analytics running in the cloud.

Why run stream analytics on the Edge?

Azure Stream Analytics on IoT Edge complements our cloud offering by unlocking the power and ease-of-use of Azure Stream Analytics (ASA) for new scenarios, such as:

Low-latency command and control: For example, manufacturing safety systems need to be able to respond to operational data with ultra-low latency. With ASA on IoT Edge, you can analyze sensor data in near real time and issue commands to stop a machine or trigger alerts when you detect anomalies. Limited connectivity to the cloud: Mission critical systems, such as remote mining equipment, connected vessels, or offshore drilling, need to analyze and react to data even when cloud connectivity is intermittent. With ASA on IoT Edge, your streaming logic runs independently of the network connectivity and you




Considering Azure Functions for a serverless data streaming scenario

In the blog post “A fast, serverless, big data pipeline powered by a single Azure Function” we discussed a fraud detection solution delivered to a banking customer. This solution required complete processing of a streaming pipeline for telemetry data in real-time using a serverless architecture. This blog post describes the evaluation process and the decision to use Microsoft Azure Functions.


A large bank wanted to build a solution to detect fraudulent transactions submitted through its mobile banking channel. The solution is built on a common big data pipeline pattern where high volumes of real-time data are ingested into a cloud service and a series of data transformations and extraction activities occur. This results in the creation of a feature matrix and the use of advanced analytics. For the bank, the pipeline had to be very fast and scalable allowing end-to-end evaluation of each transaction to finish in fewer than two seconds.

Pipeline requirements include the following:

Scalable and responsive to extreme bursts of ingested event activity. Up to 4 million events and 8 million plus transactions daily. Events were ingested as complex JSON files, each containing from two to five individual bank transactions. Each JSON file had to be