Category Archives : Big Data

12

Jul

Lightning fast query performance with Azure SQL Data Warehouse

Azure SQL Data Warehouse is a fast, flexible and secure analytics platform for enterprises of all sizes. Today we announced significant query performance improvements for Azure SQL Data Warehouse (SQL DW) customers enabled through enhancements in the distributed query execution layer.

Analytics workload performance is determined by two major factors, I/O bandwidth to storage and repartitioning speed, also known as shuffle speed. In this previous blog post, we described how SQL DW caches relevant data to take advantage of NVMe based local storage. In this blog post, we will go under the hood of SQL DW, to see how the shuffling speed has improved.

Data movement is an operation where parts of the distributed tables are moved to different nodes during query execution. This operation is required where the data is not available on the target node, most commonly when the tables do not share the distribution key. The most common data movement operation is shuffle. During shuffle, for each input row, SQL DW computes a hash value using the join columns and then sends that row to the node that owns that hash value. Either one or both sides of join can participate in the shuffle. The diagram below

11

Jul

Kafka 1.0 on HDInsight lights up real time analytics scenarios

Data engineers love Kafka on HDInsight as a high-throughput, low-latency ingestion platform in their real time data pipeline. They already leverage Kafka features such as message compression, configurable retention policy, and log compaction. With the release of Apache Kafka 1.0 on HDInsight, customers now get key features that make it easy to implement the most demanding scenarios. Here is a quick introduction:

Idempotent producers so that you don’t have to deduplicate

Consider a cellular billing system, in which the producer writes the amount of data consumed by users to a Kafka topic called data-consumption-events. If the broker or the connection fails, the producer will not get an acknowledgment of a message write and will retry that message. This will lead to duplicate writes to the system, causing users to be overbilled.

In critical scenarios like above, data engineers had to write and maintain custom deduplication logic, such as hashing and saving message ids. However, with idempotent producers turned on, Kafka handles that logic for you. Records include unique producer ids and the sequence number of the message. Kafka brokers will only accept a message from a producer if the sequence number is exactly one more than the committed sequence number

10

Jul

Azure HDInsight now supports Apache Spark 2.3

Apache Spark 2.3.0 is now available for production use on the managed big data service Azure HDInsight. Ranging from bug fixes (more than 1400 tickets were fixed in this release) to new experimental features, Apache Spark 2.3.0 brings advancements and polish to all areas of its unified data platform.

Data engineers relying on Python UDFs get 10 times to a 100 times more speed, thanks to revamped object serialization between Spark runtime and Python. Data Scientist will be delighted by better integration of Deep Learning frameworks like TensorFlow with Spark Machine Learning pipelines. Business Analysts will find liberating availability of fast vectorized reader for ORC file format which finally makes interactive analytics in Spark practical over this popular columnar data format. Developers building real-time applications may be interested in experimenting with new Continuous Processing mode in Spark Structured Streaming which brings event processing latency to millisecond level.

Vectorized object serialization in Python UDFs

It is worth mentioning that PySpark is already fast and takes advantage of the vectorized data processing in core Spark engine as long as you are using DataFrame APIs. This is good news as it represents majority of the use cases if you follow best practices for

09

Jul

Power BI Embedded dashboards with Azure Stream Analytics

Azure Stream Analytics is a fully managed “serverless” PaaS service in Azure built for running real-time analytics on fast moving streams of data. Today, a significant portion of Stream Analytics customers use Power BI for real-time dynamic dashboarding. Support for Power BI Embedded has been a repeated ask from many of our customers, and today we are excited to share that it is now generally available.

What is Power BI Embedded?

Power BI Embedded simplifies how ISVs and developers can quickly add stunning visuals, reports, and dashboards to their apps. By enabling easy-to-navigate data exploration in their apps, ISVs help their customers make quick, informed decisions in context. This also enables faster time to market and competitive differentiation for all parties.

Additionally, Power BI Embedded enables users to work within the familiar development environments, Visual Studio or Azure.

Using Azure Stream Analytics with Power BI Embedded

Using Power BI with Azure Stream Analytics allows users of Power BI Embedded dashboards to easily visualize insights from streaming data within the context of the apps they use every day. With Power BI Embedded, users can also embed real-time dashboards right in their organization’s web apps.

No changes are required for your existing

09

Jul

Power BI Embedded dashboards with Azure Stream Analytics

Azure Stream Analytics is a fully managed “serverless” PaaS service in Azure built for running real-time analytics on fast moving streams of data. Today, a significant portion of Stream Analytics customers use Power BI for real-time dynamic dashboarding. Support for Power BI Embedded has been a repeated ask from many of our customers, and today we are excited to share that it is now generally available.

What is Power BI Embedded?

Power BI Embedded simplifies how ISVs and developers can quickly add stunning visuals, reports, and dashboards to their apps. By enabling easy-to-navigate data exploration in their apps, ISVs help their customers make quick, informed decisions in context. This also enables faster time to market and competitive differentiation for all parties.

Additionally, Power BI Embedded enables users to work within the familiar development environments, Visual Studio or Azure.

Using Azure Stream Analytics with Power BI Embedded

Using Power BI with Azure Stream Analytics allows users of Power BI Embedded dashboards to easily visualize insights from streaming data within the context of the apps they use every day. With Power BI Embedded, users can also embed real-time dashboards right in their organization’s web apps.

No changes are required for your existing

03

Jul

IP filtering for Event Hubs and Service Bus

For scenarios in which Azure Event Hubs or Azure Service Bus is only accessible from certain well-known sites, the IP Filter feature enables you to configure rules for accepting or rejecting traffic originated from specify IP addresses, for instance the addresses that come under corporate NAT gateway. The Azure team is happy to announce the public preview of IP Filtering for Service Bus Premium and Event Hubs Standard and Dedicated price plans.

This feature allows users to control which IPs are accessing their resources. Some characteristics of this feature:

Rules allow you to specify accept and reject actions on IP masks. The rules work with IPv4 addresses. Rules are applied to the namespace level. You can have multiple rules and they are applied in order. The first rule that matches the IP address determines the accept or reject action. Requests from IPs that are rejected receive an unauthorized response.

Today these features are available in the Azure portal as shown in the screenshot. You can find them at the Event Hubs or Service Bus namespace level or via an ARM template.

The below ARM template shows how you can use this feature. This template takes the following parameters:

ipFilterRuleName

03

Jul

IP filtering for Event Hubs and Service Bus

For scenarios in which Azure Event Hubs or Azure Service Bus is only accessible from certain well-known sites, the IP Filter feature enables you to configure rules for accepting or rejecting traffic originated from specify IP addresses, for instance the addresses that come under corporate NAT gateway. The Azure team is happy to announce the public preview of IP Filtering for Service Bus Premium and Event Hubs Standard and Dedicated price plans.

This feature allows users to control which IPs are accessing their resources. Some characteristics of this feature:

Rules allow you to specify accept and reject actions on IP masks. The rules work with IPv4 addresses. Rules are applied to the namespace level. You can have multiple rules and they are applied in order. The first rule that matches the IP address determines the accept or reject action. Requests from IPs that are rejected receive an unauthorized response.

Today these features are available in the Azure portal as shown in the screenshot. You can find them at the Event Hubs or Service Bus namespace level or via an ARM template.

The below ARM template shows how you can use this feature. This template takes the following parameters:

ipFilterRuleName

02

Jul

Azure Event Hubs and Service Bus VNET Service Endpoints in public preview

This blog was co-authored by Anitha Adusumilli , Principal Program Manager, Azure Networking and Sumeet Mittal, Program Manager, Azure Networking.

Azure Event Hubs, a highly reliable and easily scalable data streaming platform as a service (PaaS) offering has been prolific this year with new features such as Availability Zones and a big investment into Open Source with enabling support for Apache Kafka. Azure Service Bus, a feature cloud messaging PaaS offering that also just offered support for Availability Zones has also been busy. Today, both services are announcing a public preview of Virtual Network Service Endpoints.

This new feature adds to the security and control Azure customers have over their workload environments today. Now, traffic from your VNET to your Premium Service Bus namespaces and Standard or Dedicated Azure Event Hubs namespaces can be kept secure from public Internet access and completely private on the Azure backbone network.

Azure Event Hubs and Service Bus are joining the growing list of Azure services that have enabled Virtual Network Service Endpoints.

Important info Offered with Dedicated and Standard Event Hubs pricing plans as well as Premium Service Bus. The feature is offered at no cost, aside from the usual Event Hubs and

02

Jul

Monitor Azure Data Factory pipelines using Operations Management Suite

Data Integration solutions can be complex with many moving parts involving complex data factories with multiple pipelines. Monitoring provides data to ensure that your data factory pipelines stay up and running in a healthy state. It also helps you to stave off potential problems or troubleshoot past ones. In addition, you can use monitoring data to gain deep insights about your application. This knowledge can help you to improve application performance or maintainability, or automate actions that would otherwise require manual intervention.

Azure Data Factory (ADF) integration with Azure Monitor allows you to route your data factory metrics to Operations and Management (OMS) Suite. Now, you can monitor the health of your data factory pipelines using ‘Azure Data Factory Analytics’ OMS service pack available in Azure marketplace.

Azure Data Factory OMS pack provides you a summary of overall health of your Data Factory, with options to drill into details and to troubleshoot unexpected behavior patterns. With rich, out of the box views you can get insights into key processing including:

At a glance summary of data factory pipeline, activity and trigger runs Ability to drill into data factory activity runs by type Summary of data factory top pipeline,

28

Jun

The emerging big data architectural pattern
The emerging big data architectural pattern

Why lambda?

Lambda architecture is a popular pattern in building Big Data pipelines. It is designed to handle massive quantities of data by taking advantage of both a batch layer (also called cold layer) and a stream-processing layer (also called hot or speed layer).

The following are some of the reasons that have led to the popularity and success of the lambda architecture, particularly in big data processing pipelines.

Speed and business challenges

The ability to process data at high speed in a streaming context is necessary for operational needs, such as transaction processing and real-time reporting. Some examples are fault/fraud detection, connected/smart cars/factory/hospitals/city, sentiment analysis, inventory control, network/security monitoring, and many more.

Typically, batch processing, involving massive amounts of data, and related correlation and aggregation is important for business reporting. This is to understand how the business is performing, what the trends are, and what corrective or additive measure can be executed to improve business or customer experience.

Product challenges

One of the triggers that lead to the very existence of lambda architecture was to make the most of the technology and tool set available. Existing batch processing systems, such as data warehouse, data lake, Spark/Hadoop, and more, could