Category Archives : Big Data

21

Mar

Data integration with ADLS Gen2 and Azure Data Explorer using Data Factory

Microsoft announced the general availability of Azure Data Lake Storage (ADLS) Gen2 and Azure Data Explorer in early February, which arms Azure with unmatched price performance and security as one of the best clouds for analytics. Azure Data Factory (ADF), is a fully-managed data integration service, that empowers you to copy data from over 80 data sources with a simple drag-and-drop experience and operationalize and manage the ETL/ELT flows with flexible control flow, rich monitoring, and continuous integration and continuous delivery (CI/CD) capabilities. In this blog post, we’re excited to update you on the latest integration in Azure Data Factory with ADLS Gen2 and Azure Data Explorer. You can now meet the advanced needs of your analytics workloads by leveraging these services.

Ingest and transform data with ADLS Gen2

Azure Data Lake Storage is a no-compromises data lake platform that combines the rich feature set of advanced data lake solutions with the economics, global scale, and enterprise grade security of Azure Blob Storage. Our recent post provides you with a comprehensive insider view on this powerful service.

Azure Data Factory supports ADLS Gen2 as a preview connector since ADLS Gen2 limited public preview. Now the connector has also reached general availability along

Share

14

Mar

Now available for preview: Workload importance for Azure SQL Data Warehouse

Azure SQL Data Warehouse is a fast, flexible and secure analytics platform for enterprises of all sizes. Today we are announcing the preview availability of workload importance on the Gen2 platform to help customers manage resources more efficiently. Workload importance gives data engineers the ability to use importance to classify requests. Requests with higher importance are guaranteed quicker access to resources which helps meet SLAs.

“More with less” is often the motto when it comes to operating data warehousing solutions. The ability to easily scale up compute resources gives data engineers tremendous flexibility. However, when there is budget pressure and scaling down is required, problems can arise.  Workload importance allows high business value work to meet SLAs in a shared environment with fewer resources.

An example of workload importance is shown below. The CEO’s request was submitted last and classified with high importance. Because the CEO’s request has high importance, it is granted access to resources before the Analyst requests allowing it to complete sooner.

Get started now classifying requests with importance

Classifying requests is done with the new CREATE WORKLOAD CLASSIFIER syntax. Below is an example that maps the login for the ExecutiveReports role to ABOVE_NORMAL importance and

Share

13

Mar

Monitoring on HDInsight Part 1: An Overview
Monitoring on HDInsight Part 1: An Overview

Azure HDInsight offers several ways to monitor your Hadoop, Spark, or Kafka clusters. Monitoring on HDInsight can be broken down into three main categories:

Cluster health and availability Resource utilization and performance Job status and logs

Two main monitoring tools are offered on Azure HDInsight, Apache Ambari which is included with all HDInsight clusters and optional integration with Azure Monitor logs, which can be enabled on all HDInsight clusters. While these tools contain some of the same information, each has advantages in certain scenarios. Read on for an overview of the best way to monitor various aspects of your HDInsight clusters using these tools.

Cluster health and availability

Azure HDInsight is a high-availability service that has redundant gateway nodes, head nodes, and ZooKeeper nodes to keep your HDInsight clusters running smoothly. While this ensures that a single failure will not affect the functionality of a cluster, you may still want to monitor cluster health so you are alerted when an issue does arise. Monitoring cluster health refers to monitoring whether all nodes in your cluster and the components that run on them are available and functioning correctly. Ambari is the recommended way to monitor the health for any given HDInsight

Share

05

Mar

Rerun activities inside your Azure Data Factory pipelines
Rerun activities inside your Azure Data Factory pipelines

Data Integration is complex with many moving parts. It helps organizations to combine data and complex business processes in hybrid data environments. Failures are very common in data integration workflows. This can happen due to data not arriving on time, functional code issues in your pipelines, infrastructure issues etc. A common requirement is ability to rerun failed activities inside your data integration workflows. In addition to this, sometimes, you want to rerun activities to re-process the data due to some error upstream in data processing. Azure Data Factory now allows you to rerun activities inside your pipelines. You can rerun the entire pipeline or choose to rerun downstream from a particular activity inside your data factory pipelines.

Simply navigate to the ‘Monitor’ section in data factory user experience, select your pipeline run, click ‘View activity runs’ under the ‘Action’ column, select the activity and click ‘Rerun from activity <activityname>’

You can also view the rerun history for all your pipeline runs inside the data factory. Simply click on the toggle to ‘View All Rerun History’.

You can also view rerun history for a particular pipeline run by clicking ‘View Rerun History’ under the ‘Actions’ column. This allows

Share

04

Mar

Service Fabric Processor in public preview

Microsoft clients for Azure Event Hubs have always had two levels of abstraction. There is the low-level client, which includes event sender and receiver classes which allow for maximum control by the application, but also force the application to understand the configuration of the Event Hub and maintain an event receiver connected to each partition. Built on top of that low-level client is a higher-level library, Event Processor Host, which hides most of those details for the receiving side. Event Processor Host automatically distributes ownership of Event Hub partitions across multiple host instances and delivers events to a processing method provided by the application.

Service Fabric is another Microsoft-provided library, which is a generalized framework for dividing an application into shards and distributing those shards across multiple compute nodes. Many customers are using Service Fabric for their applications, and some of those applications need to receive events from an Event Hub. It is possible to use Event Processor Host within a Service Fabric application, but it is also inelegant and redundant. The combination means that there are two separate layers attempting to distribute load across nodes, and neither one is aware of the other. It also introduces a dependency on

Share

20

Feb

Use GraphQL with Hasura and Azure Database for PostgreSQL
Use GraphQL with Hasura and Azure Database for PostgreSQL

Azure Database for PostgreSQL provides a fully managed, enterprise-ready community PostgreSQL database as a service. The PostgreSQL community edition helps you easily migrate existing apps to the cloud or develop cloud-native applications, using languages and frameworks of your choice. The service offers industry leading innovations such as built-in high availability, backed with 99.99 percent SLA, without the need to set up replicas and enabling customers to save over two times the cost. The capability also allows customers to scale compute up or down in seconds, helping you easily adjust to changes in workload demands.

Additionally, built-in intelligent features such as Query Performance Insight and performance recommendations help customers further lower their total cost of ownership (TCO) by providing customized recommendations and insights to optimize the performance of their Postgres databases. These benefits coupled with unparalleled security and compliance, Microsoft Azure’s industry leading global reach, and Azure IP Advantage, empower customers to focus on their business and applications rather than the database.

As part of the broader Postgres community, our aim is to contribute to and partner with others in the community to bring new features to Azure Database for PostgreSQL users. You can now take advantage of the Hasura GraphQL

Share

13

Feb

Anomaly detection using built-in machine learning models in Azure Stream Analytics

Built-in machine learning (ML) models for anomaly detection in Azure Stream Analytics significantly reduces the complexity and costs associated with building and training machine learning models. This feature is now available for public preview worldwide.

What is Azure Stream Analytics?

Azure Stream Analytics is a fully managed serverless PaaS offering on Azure that enables customers to analyze and process fast moving streams of data, and deliver real-time insights for mission critical scenarios. Developers can use a simple SQL language (extensible to include custom code) to author and deploy powerful analytics processing logic that can scale-up and scale-out to deliver insights with milli-second latencies.

Traditional way to incorporate anomaly detection capabilities in stream processing

Many customers use Azure Stream Analytics to continuously monitor massive amounts of fast-moving streams of data in order to detect issues that do not conform to expected patterns and prevent catastrophic losses. This in essence is anomaly detection.

For anomaly detection, customers traditionally relied on either sub-optimal methods of hard coding control limits in their queries, or used custom machine learning models. Development of custom learning models not only requires time, but also high levels of data science expertise along with nuanced data pipeline engineering skills. Such

Share

12

Feb

Maximize throughput with repartitioning in Azure Stream Analytics
Maximize throughput with repartitioning in Azure Stream Analytics

Customers love Azure Stream Analytics for its ease of analyzing streams of data in movement, with the ability to set up a running pipeline within five minutes. Optimizing throughput has always been a challenge when trying to achieve high performance in a scenario that can’t be fully parallelized. This occurs when you don’t control the partition key of the input stream, or your source “sprays” input across multiple partitions that later need to be merged. You can now use a new extension of Azure Stream Analytics SQL to specify the number of partitions of a stream when reshuffling the data. This new capability unlocks performance and aids in maximizing throughput in such scenarios.

The new extension of Azure Stream Analytics SQL includes a keyword INTO that allows you to specify the number of partitions for a stream when performing reshuffling using a PARTITION BY statement. This new keyword, and the functionality it provides, is a key feature to achieve high performance throughput for the above scenarios, as well as to better control the data streams after a shuffle. To learn more about what’s new in Azure Stream Analytics, please see, “Eight new features in Azure Stream Analytics.”

What is repartitioning?

Share

11

Feb

Controlling costs in Azure Data Explorer using down-sampling and aggregation

Azure Data Explorer (ADX) is an outstanding service for continuous ingestion and storage of high velocity telemetry data from cloud services and IoT devices. Leveraging its first-rate performance for querying billions of records, the telemetry data can be further analyzed for various insights such as monitoring service health, production processes, and usage trends. Depending on data velocity and retention policy, data size can rapidly scale to petabytes of data and increase the costs associated with data storage. A common solution for storage of large datasets for a long period of time is to store the data with differing resolution. The most recent data is stored at maximum resolution, meaning all events are stored in raw format. While the historic data is stored at reduced resolution, being filtered and/or aggregated. This solution is often used for time series databases to control hot storage costs.

In this blog, I’ll use the GitHub events public dataset as the playground. For more information read about how to stream GitHub events into your own ADX cluster by reading the blog, “Exploring GitHub events with Azure Data Explorer.” I’ll describe how ADX users can take advantage of stored functions, the “.set-or-append” command, and the Microsoft Flow

Share

11

Feb

Get started quickly using templates in Azure Data Factory
Get started quickly using templates in Azure Data Factory

Cloud data integration helps organizations integrate data of various forms and unify complex processes in a hybrid data environment. A number of times different organizations have similar data integration needs and require repeat business processes. Data Engineers or data developers in these organizations want to quickly get started with building data integration solutions and avoid building same workflows repeatedly. Today, we are announcing the support for templates in Azure Data Factory (ADF) to get started quickly with building data factory pipelines and improve developer productivity along with reducing development time for repeat processes. The template feature enables a ‘Template gallery’ for our customers that contains use-case based templates, data movement templates, SSIS templates or transformation templates that you can use to get hands-on with building your data factory pipelines.

Simply click Create pipeline from template on the Overview page or click +-> Pipeline from template on the Author page in your data factory UX to get started.

Select any template from the gallery and provide the necessary inputs to use the template. You can also read detailed description about the template or visualize the end to end data factory pipeline.

You can also create new connections

Share