Category Archives : Big Data

26

Mar

Blob storage interface on Data Box is now generally available

The blob storage interface on the Data Box has been in preview since September 2018 and we are happy to announce that it’s now generally available. This is in addition to the server message block (SMB) and network file system (NFS) interface already generally available on the Data Box.

The blob storage interface allows you to copy data into the Data Box via REST. In essence, this interface makes the Data Box appear like an Azure storage account. Applications that write to Azure blob storage can be configured to work with the Azure Data Box in exactly the same way. 

This enables very interesting scenarios, especially for big data workloads. Migrating large HDFS stores to Azure as part of a Apache Hadoop® migration is a popular ask. Using the blob storage interface of the Data Box, you can now easily use common copy tools like DistCp to directly point to the Data Box, and access it as though it was another HDFS file system! Since most Hadoop installations come pre-loaded with the Azure Storage driver, most likely you will not have to make changes to your existing infrastructure to use this capability. Another key benefit of migrating via the blob storage

Share

25

Mar

Clean up files by built-in delete activity in Azure Data Factory

Azure Data Factory (ADF) is a fully-managed data integration service in Azure that allows you to iteratively build, orchestrate, and monitor your Extract Transform Load (ETL) workflows. In the journey of data integration process, you will need to periodically clean up files from the on-premises or the cloud storage server when the files become out of date. For example, you may have a staging area or landing zone, which is an intermediate storage area used for data processing during your ETL process. The data staging area sits between the data source stores and the data destination store. Given the data in staging areas are transient by nature, you need to periodically clean up the data in the staging area after the ETL process has being completed.

We are excited to share ADF built-in delete activity, which can be part of your ETL workflow to deletes undesired files without writing code. You can use ADF to delete folder or files from Azure Blob Storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, File System, FTP Server, sFTP Server, and Amazon S3.

You can find ADF delete activity under the “Move & Transform” section from the ADF UI to get

Share

25

Mar

Incrementally copy new files by LastModifiedDate with Azure Data Factory

Azure Data Factory (ADF) is the fully-managed data integration service for analytics workloads in Azure. Using ADF, users can load the lake from 80 plus data sources on-premises and in the cloud, use a rich set of transform activities to prep, cleanse, and process the data using Azure analytics engines, while also landing the curated data into a data warehouse for getting innovative analytics and insights.

When you start to build the end to end data integration flow the first challenge is to extract data from different data stores, where incrementally (or delta) loading data after an initial full load is widely used at this stage. Now, ADF provides a new capability for you to incrementally copy new or changed files only by LastModifiedDate from a file-based store. By using this new feature, you do not need to partition the data by time-based folder or file name. The new or changed file will be automatically selected by its metadata LastModifiedDate and copied to the destination store.

The feature is available when loading data from Azure Blob Storage, Azure Data Lake Storage Gen1, Azure Data Lake Storage Gen2, Amazon S3, File System, SFTP, and HDFS.

The resources for this feature are

Share

21

Mar

Data integration with ADLS Gen2 and Azure Data Explorer using Data Factory

Microsoft announced the general availability of Azure Data Lake Storage (ADLS) Gen2 and Azure Data Explorer in early February, which arms Azure with unmatched price performance and security as one of the best clouds for analytics. Azure Data Factory (ADF), is a fully-managed data integration service, that empowers you to copy data from over 80 data sources with a simple drag-and-drop experience and operationalize and manage the ETL/ELT flows with flexible control flow, rich monitoring, and continuous integration and continuous delivery (CI/CD) capabilities. In this blog post, we’re excited to update you on the latest integration in Azure Data Factory with ADLS Gen2 and Azure Data Explorer. You can now meet the advanced needs of your analytics workloads by leveraging these services.

Ingest and transform data with ADLS Gen2

Azure Data Lake Storage is a no-compromises data lake platform that combines the rich feature set of advanced data lake solutions with the economics, global scale, and enterprise grade security of Azure Blob Storage. Our recent post provides you with a comprehensive insider view on this powerful service.

Azure Data Factory supports ADLS Gen2 as a preview connector since ADLS Gen2 limited public preview. Now the connector has also reached general availability along

Share

14

Mar

Now available for preview: Workload importance for Azure SQL Data Warehouse

Azure SQL Data Warehouse is a fast, flexible and secure analytics platform for enterprises of all sizes. Today we are announcing the preview availability of workload importance on the Gen2 platform to help customers manage resources more efficiently. Workload importance gives data engineers the ability to use importance to classify requests. Requests with higher importance are guaranteed quicker access to resources which helps meet SLAs.

“More with less” is often the motto when it comes to operating data warehousing solutions. The ability to easily scale up compute resources gives data engineers tremendous flexibility. However, when there is budget pressure and scaling down is required, problems can arise.  Workload importance allows high business value work to meet SLAs in a shared environment with fewer resources.

An example of workload importance is shown below. The CEO’s request was submitted last and classified with high importance. Because the CEO’s request has high importance, it is granted access to resources before the Analyst requests allowing it to complete sooner.

Get started now classifying requests with importance

Classifying requests is done with the new CREATE WORKLOAD CLASSIFIER syntax. Below is an example that maps the login for the ExecutiveReports role to ABOVE_NORMAL importance and

Share

13

Mar

Monitoring on HDInsight Part 1: An Overview
Monitoring on HDInsight Part 1: An Overview

Azure HDInsight offers several ways to monitor your Hadoop, Spark, or Kafka clusters. Monitoring on HDInsight can be broken down into three main categories:

Cluster health and availability Resource utilization and performance Job status and logs

Two main monitoring tools are offered on Azure HDInsight, Apache Ambari which is included with all HDInsight clusters and optional integration with Azure Monitor logs, which can be enabled on all HDInsight clusters. While these tools contain some of the same information, each has advantages in certain scenarios. Read on for an overview of the best way to monitor various aspects of your HDInsight clusters using these tools.

Cluster health and availability

Azure HDInsight is a high-availability service that has redundant gateway nodes, head nodes, and ZooKeeper nodes to keep your HDInsight clusters running smoothly. While this ensures that a single failure will not affect the functionality of a cluster, you may still want to monitor cluster health so you are alerted when an issue does arise. Monitoring cluster health refers to monitoring whether all nodes in your cluster and the components that run on them are available and functioning correctly. Ambari is the recommended way to monitor the health for any given HDInsight

Share

05

Mar

Rerun activities inside your Azure Data Factory pipelines
Rerun activities inside your Azure Data Factory pipelines

Data Integration is complex with many moving parts. It helps organizations to combine data and complex business processes in hybrid data environments. Failures are very common in data integration workflows. This can happen due to data not arriving on time, functional code issues in your pipelines, infrastructure issues etc. A common requirement is ability to rerun failed activities inside your data integration workflows. In addition to this, sometimes, you want to rerun activities to re-process the data due to some error upstream in data processing. Azure Data Factory now allows you to rerun activities inside your pipelines. You can rerun the entire pipeline or choose to rerun downstream from a particular activity inside your data factory pipelines.

Simply navigate to the ‘Monitor’ section in data factory user experience, select your pipeline run, click ‘View activity runs’ under the ‘Action’ column, select the activity and click ‘Rerun from activity <activityname>’

You can also view the rerun history for all your pipeline runs inside the data factory. Simply click on the toggle to ‘View All Rerun History’.

You can also view rerun history for a particular pipeline run by clicking ‘View Rerun History’ under the ‘Actions’ column. This allows

Share

04

Mar

Service Fabric Processor in public preview

Microsoft clients for Azure Event Hubs have always had two levels of abstraction. There is the low-level client, which includes event sender and receiver classes which allow for maximum control by the application, but also force the application to understand the configuration of the Event Hub and maintain an event receiver connected to each partition. Built on top of that low-level client is a higher-level library, Event Processor Host, which hides most of those details for the receiving side. Event Processor Host automatically distributes ownership of Event Hub partitions across multiple host instances and delivers events to a processing method provided by the application.

Service Fabric is another Microsoft-provided library, which is a generalized framework for dividing an application into shards and distributing those shards across multiple compute nodes. Many customers are using Service Fabric for their applications, and some of those applications need to receive events from an Event Hub. It is possible to use Event Processor Host within a Service Fabric application, but it is also inelegant and redundant. The combination means that there are two separate layers attempting to distribute load across nodes, and neither one is aware of the other. It also introduces a dependency on

Share

20

Feb

Use GraphQL with Hasura and Azure Database for PostgreSQL
Use GraphQL with Hasura and Azure Database for PostgreSQL

Azure Database for PostgreSQL provides a fully managed, enterprise-ready community PostgreSQL database as a service. The PostgreSQL community edition helps you easily migrate existing apps to the cloud or develop cloud-native applications, using languages and frameworks of your choice. The service offers industry leading innovations such as built-in high availability, backed with 99.99 percent SLA, without the need to set up replicas and enabling customers to save over two times the cost. The capability also allows customers to scale compute up or down in seconds, helping you easily adjust to changes in workload demands.

Additionally, built-in intelligent features such as Query Performance Insight and performance recommendations help customers further lower their total cost of ownership (TCO) by providing customized recommendations and insights to optimize the performance of their Postgres databases. These benefits coupled with unparalleled security and compliance, Microsoft Azure’s industry leading global reach, and Azure IP Advantage, empower customers to focus on their business and applications rather than the database.

As part of the broader Postgres community, our aim is to contribute to and partner with others in the community to bring new features to Azure Database for PostgreSQL users. You can now take advantage of the Hasura GraphQL

Share

13

Feb

Anomaly detection using built-in machine learning models in Azure Stream Analytics

Built-in machine learning (ML) models for anomaly detection in Azure Stream Analytics significantly reduces the complexity and costs associated with building and training machine learning models. This feature is now available for public preview worldwide.

What is Azure Stream Analytics?

Azure Stream Analytics is a fully managed serverless PaaS offering on Azure that enables customers to analyze and process fast moving streams of data, and deliver real-time insights for mission critical scenarios. Developers can use a simple SQL language (extensible to include custom code) to author and deploy powerful analytics processing logic that can scale-up and scale-out to deliver insights with milli-second latencies.

Traditional way to incorporate anomaly detection capabilities in stream processing

Many customers use Azure Stream Analytics to continuously monitor massive amounts of fast-moving streams of data in order to detect issues that do not conform to expected patterns and prevent catastrophic losses. This in essence is anomaly detection.

For anomaly detection, customers traditionally relied on either sub-optimal methods of hard coding control limits in their queries, or used custom machine learning models. Development of custom learning models not only requires time, but also high levels of data science expertise along with nuanced data pipeline engineering skills. Such

Share