Category Archives : Big Data



Microsoft Azure Data welcomes attendees to ACM SIGMOD/PODS 2018

Hello SIGMOD attendees!

Welcome to Houston, and to what is shaping up to be a great conference.  I wanted to take this opportunity to share with you some of the exciting work in data that’s going on in the Azure Data team at Microsoft, and to invite you to take a closer look.

Microsoft has long been a leader in database management with SQL Server, recognized as the top DBMS by Gartner for the past three years in a row.  The emergence of the cloud and edge as the new frontiers for computing, and thus data management, is an exciting direction—data is now dispersed within and beyond the enterprise, on-prem, on-cloud, and on edge devices, and we must enable intelligent analysis, transactions, and responsible governance for all data everywhere, from the moment it is created to the moment it is deleted, through the entire life-cycle of ingestion, updates, exploration, data prep, analysis, serving, and archival. 

These trends require us to fundamentally re-think data management.  Transactional replication can span continents.  Data is not just relational.  Interactive, real-time, and streaming applications with enterprise level SLAs are becoming common.  Machine learning is a foundational analytic task and must be supported while ensuring that




Process more files than ever and use Parquet with Azure Data Lake Analytics

Azure Data Lake Analytics (ADLA) is a serverless PaaS service in Azure to prepare and transform large amounts of data stored in Azure Data Lake Store or Azure Blob Storage at unparalleled scale.

ADLA now offers some new, unparalleled capabilities for processing files of any formats including Parquet at tremendous scale.

Previously: Handling tens of thousands of files is painful!

Many of our customers tell us that handling a large number of files is challenging – if not downright painful in all the big data systems that they have tried. Figure 1 shows the distribution of files in common data lake systems. Most files are less than one GB, although a few may be huge.

Figure 1: The pain of many small files

ADLA has been developed from a system that was originally designed to operate on very large files that have internal structure that help with scale-out, but it only operated on a couple of hundred to about 3,000 files. It also over-allocated resources when processing small files by giving one extract vertex to a file (a vertex is a compute container that will execute a specific part of the script on a partition of the data and




Azure Data Lake Tools for VSCode supports Azure blob storage integration

We are pleased to announce the integration of VSCode explorer with Azure blob storage. If you are a data scientist and want to explore the data in your Azure blob storage, please try the Data Lake Explorer blob storage integration. If you are a developer and want to access and manage your Azure blob storage files, please try the Data Lake Explorer blob storage integration. The Data Lake Explorer allows you easily navigate to your blob storage, access and manage your blob container, folder and files.  

Summary of new features

Blob container – Refresh, Delete Blob Container and Upload Blob 

Folder in blob – Refresh and Upload Blob 

File in blob – Preview/Edit, Download, Delete, Create EXTRACT Script (only available for CSV, TSV and TXT files), as well as Copy Relative Path, and Copy Full Path

How to install or update

Install Visual Studio Code and download Mono 4.2.x (for Linux and Mac). Then get the latest Azure Data Lake Tools by going to the VSCode Extension repository or the VSCode Marketplace and searching Azure Data Lake Tools.

For more information about Azure Data Lake Tool for VSCode, please use




Gain application insights for Big Data solutions using Unravel data on Azure HDInsight

Unravel on HDInsight enables developers and IT Admins to manage performance, auto scaling & cost optimization better than ever.

We are pleased to announce Unravel on Azure HDInsight Application Platform. Azure HDInsight is a fully-managed open-source big data analytics service for enterprises. You can use popular open-source frameworks (Hadoop, Spark, LLAP, Kafka, HBase, etc.) to cover broad range of scenarios such as ETL, Data Warehousing, Machine Learning, IoT and more. Unravel provides comprehensive application performance management (APM) for these scenarios and more. The application helps customers analyze, optimize, and troubleshoot application performance issues and meet SLAs in a seamless, easy to use, and frictionless manner. Some customers report up to 200 percent more jobs at 50 percent lower cost using Unravel’s tuning capability on HDInsight.

To learn more please join Pranav Rastogi, Program Manager at Microsoft Azure Big Data, and Shivnath Babu, CTO at Unravel, in a webinar on June 13 for how to build fast and reliable big data apps on Azure while keeping cloud expenses within your budget.

How complex is guaranteeing an SLA on a Big Data solution?

The inherent complexity of big data systems, disparate set of tools for monitoring, and lack of expertise in optimizing




An update on the integration of Avere Systems into the Azure family

It has been three months since we closed on the acquisition of Avere Systems. Since that time, we’ve been hard at work integrating the Avere and Microsoft families, growing our presence in Pittsburgh and meeting with customers and partners at The National Association of Broadcasters Show.

It’s been exciting to hear how Avere has helped businesses address a broad range of compute and data challenges, helping produce blockbuster movies and life-saving drug therapies faster than ever before with hybrid and public cloud options. I’ve also appreciated having the opportunity to address our customers questions and concerns and thought it might be helpful to share the most common ones with the broader Azure/Avere community:

When will Avere be available on Microsoft Azure? We are on track to release Microsoft Avere vFXT to the Azure Marketplace later this year.  With this technology Azure customers will be able to run compute-intensive applications completely on Azure or to take advantage of our scale on an as-needed basis. Will Microsoft continue to support the Avere FXT physical appliance? Yes, we will continue to invest in, upgrade and support the Microsoft Avere FXT physical appliance, which customers tell us is particularly important for their on-premise and hybrid environments.




Enhance productivity using Azure Data Factory Visual Tools

With Azure Data Factory (ADF) visual tools, we listened to your feedback and enabled a rich, interactive visual authoring and monitoring experience. It allows you to iteratively create, configure, test, deploy and monitor data integration pipelines without any friction. The main goal of the ADF visual tools is to allow you to be productive with ADF by getting pipelines up and running quickly without requiring to write a single line of code.

We continue to add new features to increase productivity and efficiency for both new and advanced users with intuitive experiences. You can get started by clicking the Author and Monitor tile in your provisioned v2 data factory blade.

Check out some of the exciting new features enabled with data factory visual tools since public preview (January 2018):

Latest data factory updates

Follow exciting new updates to the data factory service.

View Data Factory deployment region, and resource group. Then, switch to another data factory that you have access to.

Visual authoring More data connectors

Ingest data at scale from more than 70 on-premises/cloud data sources in a serverless fashion.

New activities in toolbox Notebook Activity: Ingest data at scale using more than 70 on-premises/cloud




Azure Event Hubs for Kafka Ecosystems in public preview
Azure Event Hubs for Kafka Ecosystems in public preview

Organizations need data driven strategies to increase competitive advantage. Customers want to stream data or analyze in real-time to get valuable insights faster. To meet these big data needs, you need a massively scalable distributed event driven messaging platform with multiple producers and consumers Apache Kafka and Azure Event Hubs provide such distributed platforms.

How is Event Hubs different from Kafka?

Kafka and Event Hubs are both designed to handle large scale stream ingestion driven by real-time events. Conceptually, both are a distributed, partitioned, and replicated commit log service. Both use partitioned consumer model offering huge scalability for concurrent consumers. Both use a client side cursor concept and scale very high workloads.

Apache Kafka is a software that is installed and run. Azure Event Hubs is a fully managed service in the cloud. While Kafka is popular with its wide eco system and its on-premises and cloud presence, Event Hubs offers you the freedom of not having to manage servers or networks or worry about configuring brokers.

Talk to Event Hubs, like you would with Kafka and unleash the power of PaaS!

Today we are happy to marry both these powerful distributed streaming platforms to offer you Event Hubs for




Python, Node.js, Go client libraries for Azure Event Hubs in public preview

Azure Event Hubs is expanding its ecosystem to support more languages. Azure Event Hubs is a highly scalable data-streaming platform processing millions of events per second. Event Hubs uses Advanced Message Queuing Protocol (AMQP 1.0) to enable interoperability and compatibility across platforms. Now, with the addition of new clients, you can easily get started with Event Hubs.

We are happy to have the new client libraries for Go, Python, and Node.js in public preview. Do your application logging or click stream analytics pipelines, live Dashboarding, or any telemetry processing with our rich ecosystem offering language of your choice.

Ingest and consume events/logs from your Python applications or stream with your Node.js applications or simply integrate with your Go applications. You now have a wide palette to choose from based on your needs.

The following updates will provide more insights into the public preview of the new client libraries.

Event Hubs for Go, this new package offers you easy-to-use Send and Receive functions which communicates with the Event Hubs service using the AMQP 1.0 protocol as implemented by the What more? It also offers the Event Processor Host to manage load balancing and lease management for the consumers. The readme helps




Secure credential management for ETL workloads using Azure Key Vault and Data Factory

Secure credential management is essential to protect data in the cloud. With Azure Key Vault, you can encrypt keys and small secrets like passwords that use keys. Azure Data Factory is now integrated with Azure Key Vault. You can store credentials for your data stores and computes referred in Azure Data Factory ETL (Extract Transform Load) workloads in an Azure Key Vault. Simply create Azure Key Vault linked service and refer to the secret stored in the Key vault in your data factory pipelines.

Azure Data Factory will now automatically pull your credentials for your data stores and computes from Azure Key Vault during pipeline execution. Using Key Vault, you don’t need to provision, configure, patch, and maintain key management software. Just provision new vaults and keys in minutes. Centrally manage keys, secrets, policies and refer to the keys in your data pipelines in data factory. You keep control over your keys by simply granting permission for your own and data factory service to use them as needed. Data Factory never has direct access to keys. Developers manage keys used for Dev/Test and seamlessly migrate to producing the keys that are managed by security operations.

With Azure Key




Azure Toolkit for Eclipse integrates with HDInsight Ambari and supports Spark 2.2

To provide more authentication options, Azure Toolkit for Eclipse now supports integration with HDInsight clusters through Ambari for job submission, cluster resource browse and storage files navigate. You can easily link or unlink any cluster by using an Ambari-managed username and password, which is independent of your Azure sign-in credentials.  The Ambari connection applies to normal Spark and Hive hosted within HDInsight on Azure. These additions give you more flexibility in how you connect to your HDInsight clusters in addition to your Azure subscriptions while also simplifying your experiences in submitting Spark jobs.

With this release, you can benefit the new functionalities and consume the new libraries & APIs from Spark 2.2 in Azure Toolkit for Eclipse. You can create, author and submit a Spark 2.2 project to Spark 2.2 cluster.  With the backward compatibility of Spark 2.2, you can also submit your existing Spark 2.0 and Spark 2.1 projects to a Spark 2.2 cluster.

How to link a cluster Click Link a cluster from Azure Explorer.

Enter Cluster Name, Storage Account, Storage Key, then select a container from Storage Container, at last, input Username and Password. Click the OK button to link cluster.

Please note that you