Category Archives : Big Data



Implementation patterns for big data and data warehouse on Azure

To help our customers with their adoption of Azure services for big data and data warehousing workloads we have identified some common adoption patterns which are reference architectures for success. So, what patterns do we have for our modern data warehouse play?

Modern data warehouse

This is the convergence of relational and non-relational, or structured and unstructured data orchestrated by Azure Data Factory coming together in Azure Blob Storage to act as the primary data source for Azure services. The value of having the relational data warehouse layer is to support the business rules, security model, and governance which are often layered here. The de-normalization of the data in the relational model is purposeful as it aligns data models and schemas to support various internal business organizations and applications. Azure Databricks can also cleanse data prior to loading into Azure SQL Data Warehouse. It enables an optional analytical path in addition to the Azure Analysis Services layer for business intelligence applications such as Power BI or other business applications.

Advanced analytics on big data

Here we introduce advanced analytical capabilities through our Azure Databricks platforms with Azure Machine Learning. We still have all the greatness of Azure Data Factory,




Azure Event Hubs integration with Apache Spark now generally available

The Event Hubs team is happy to announce the general availability of our integration with Apache Spark. Now, Event Hubs users can use Spark to easily build end-to-end streaming applications. The Event Hubs connector for Spark supports Spark Core, Spark Streaming, and Structured Streaming for Spark 2.1, Spark 2.2, and Spark 2.3.

For users new to Spark, Spark Streaming and Structured Streaming are scalable, fault-tolerant stream processing engines. These processing engines allow users to process huge amounts of data using complex algorithms expressed with high-level functions like map, reduce, join, and window. This data can then be pushed to file systems, databases, or even back to Event Hubs.

Setting up a stream is easy, check it out:

import org.apache.spark.eventhubs._ import org.apache.spark.sql.SparkSession val eventHubsConf = EventHubsConf(“{EVENT HUB CONNECTION STRING FROM AZURE PORTAL}”) .setStartingPosition(EventPosition.fromEndOfStream) // Create a stream that reads data from the specified Event Hub. val spark = SparkSession.builder.appName(“SimpleStream”).getOrCreate() val eventHubStream = spark.readStream .format(“eventhubs”) .options(eventHubsConf.toMap) .load()

It’s as easy as that! Once your events are streaming into Spark, you can process them as you wish. Spark provides a variety of processing options, such as graph analysis and machine learning. Our documentation has more details on linking our connector with your




Azure Databricks, industry-leading analytics platform powered by Apache Spark™

This blog post was co-authored by Ali Ghodsi, CEO, Databricks.

The confluence of cloud, data, and AI is driving unprecedented change. The ability to utilize data and turn it into breakthrough insights is foundational to innovation today. Our goal is to empower organizations to unleash the power of data and reimagine possibilities that will improve our world.

To enable this journey, we are excited to announce the general availability of Azure Databricks, a fast, easy, and collaborative Apache® Spark™-based analytics platform optimized for Azure.

Fast, easy, and collaborative

Over the past five years, Apache Spark has emerged as the open source standard for advanced analytics, machine learning, and AI on Big Data. With a massive community of over 1,000 contributors and rapid adoption by enterprises, we see Spark’s popularity continue to rise.

Azure Databricks is designed in collaboration with Databricks whose founders started the Spark research project at UC Berkeley, which later became Apache Spark. Our goal with Azure Databricks is to help customers accelerate innovation and simplify the process of building Big Data & AI solutions by combining the best of Databricks and Azure.

To meet this goal, we developed Azure Databricks with three design principles.

First, enhance




Azure cloud data and AI services training roundup

Looking to transform your business by improving your on-premises environments? Accelerating your move to the cloud, and gaining transformative insights from your data? Here’s your opportunity to learn from the experts and ask the questions that help your organization move forward.

Join us for one or all of these training sessions to take a deep dive into a variety of topics. Including products like Azure Cosmos DB, along with Microsoft innovations in artificial intelligence, advanced analytics, and big data. 

Azure Cosmos DB

Engineering experts are leading a seven-part training series on Azure Cosmos DB, complete with interactive Q&As. In addition to a high-level technical deep dive, this series covers a wide array of topics, including:

Graph API Table API Building Mongo DB apps

By the end of this series, you’ll be able to build serverless applications and conduct real-time analytics using Azure Cosmos DB, Azure Functions, and Spark. Register to attend the whole Azure Cosmos DB series, or register for the sessions that interest you.

Artificial Intelligence (AI)

Learn to create the next generation of applications spanning an intelligent cloud as well as an intelligent edge powered by AI. Microsoft offers a comprehensive set of flexible AI services for any




Azure Data Lake tools for VS Code now supports job view and job monitoring

We are happy to announce that job monitoring and job view have been added into the Azure Data Lake Tools for Visual Studio Code. Now, you can perform real-time monitoring for the jobs you submit. You can also view job summary and job details for historical jobs as well as  download any of the input or output data and resources files associated with the job.

Key Customer Benefits Monitor job progress in real-time within VSCode for both local and ADL jobs. Display job summary and data details for historical jobs. Resubmit previously run Enable jobs resubmission for an old job. Download job inputs, outputs and resource data files. View the job U-SQL script for a submitted job. Summary of key new features

Job View Page: Display job summary and job progress within VSCode. 

Data Page: Display job input, output and resources files. Support file download.

Show Historical Jobs: Use command ADL: Show Jobs for both local and ADL historical jobs.

Set Default Context: Use command ADL: Set Default Context to set default context for current working folder.

How to install or update

Install Visual




HDInsight Tools for VSCode integrates with Ambari and HDInsight Enterprise Secure Package

To provide more authentication options, HDInsight Tools for VSCode now can be connected to HDInsight cluster through Ambari for job submissions. You can easily link (HDInsight: Link a cluster) or unlink (HDInsight: Unlink a cluster) a normal cluster by using Ambari managed username and password, which is independent of your Azure signing process. The Ambari connection applies to Spark and Hive clusters  in all the Azure environments which host HDInsight services.

To support HDInsight Enterprise Secure Package (in preview), you can also connect to the secured cluster through domain username (e.g. This connection is applicable for both traditional blob storage (WASB) or Azure Data Lake Storage (ADLS) as underlying storage. Once you connect to the secured  HDInsight cluster, you can use the signed in domain credentials for all you job submissions. 

This addition grants you more flexibilities to connect to your HDInsight clusters in addition to your Azure subscriptions and greatly simplify your experiences in submitting your Hive and Spark jobs.

How to link a cluster Open the command palette by selecting CTRL+SHIFT+P, and then enter HDInsight: Link a cluster.

Enter HDInsight cluster URL -> input Username -> input Password -> select cluster type – –>




StorSimple Data Manager now generally available

We are excited to announce the general availability of the StorSimple Data Manager. This feature allows you to transform data from StorSimple format into the native format in Azure blobs or Azure Files. Once your data is transformed, you can use services like Azure Media Services, Azure Machine Learning, HDInsight, Azure Search, and more.

StorSimple devices use the cloud as a tier of storage and sends data to the cloud in a highly efficient and secure manner. Data is stored in the cloud tier in this deduped, compressed, and encrypted format. A side effect of this is that this data is not readily consumable by cloud services that you might want to use. Azure offers a rich bouquet of services and our goal is to let you use the service of your choice on your data to unleash its potential.

Using this service, you can transform data stored in your 8000 series StorSimple devices into Azure blobs or Azure Files. All the file data that you store on-premises on your StorSimple device will show up as individual blobs or files in Azure. You can use the Azure portal, .NET applications, or Azure Automation to trigger these transformations. You can




Microsoft partners with National Science Foundation to empower data science breakthroughs

Over the past decade, Microsoft has partnered with the National Science Foundation (NSF) on three separate programs, first in 2010, and more recently through a commitment of $6M in cloud credits across two NSF supported data science programs – with the Big Data Regional Innovation Hubs and as part of the NSF BigData solicitation.

The engagement with NSF has helped Microsoft reach diverse research groups such as the Big Data Hubs1 that brings together communities of data scientists to spark and nurture collaborations between domain experts, researchers, communities, state partners, nonprofits, and industry.

As of today, Microsoft has provided 17 cloud credit awards to Principal Investigators (PIs) who benefit from NSF supported programs. These collaborations are already seeing some interesting breakthroughs across the human body, microbial diseases, and even everyday communication –

Franco Pestilli, Assistant Professor in Psychology, Neuroscience and Cognitive Science, Indiana University is an Azure awardee and PI through the Midwest Big Data Hub2 – his group has built a platform called Brainlife using the Azure award, with the goal of fostering collaboration with sixty-six different global scientific communities such as developmental and learning sciences, network science, computer science, engineering, psychology, statistics, traumatic brain injury, vision science. Chirag




Microsoft Azure Data Lake Storage in Storage Explorer – public preview

Providing a rich GUI for Azure Data Lake Storage resources management has been a top customer ask for a long time, we are thrilled to announce the public preview for supporting Azure Data Lake Storage (ADLS) in the Azure Storage Explorer (ASE). With the release of ADLS resources in ASE, you can freely navigate ADLS resources, you can upload and download folders and files, you can copy and paste files across folders or ADLS accounts and you can easily perform CRUD operations for your folders and files. Azure Storage Explorer not only offers a traditional desktop explorer GUI for dragging, uploading, downloading, copying and moving your ADLS folders and files, but also provides a unified developer experiences of displaying file properties, viewing folder statistics and adding quick access. With this extension you are now able to browse ADLS resources along-side existing experiences for Azure Blobs, tables, files, queues and Cosmos DB in ASE.

Key customer benefits Offers a one-stop shop to manage Azure Storage Resources including ADLS Enables direct connect through Azure AD Authentication Provides traditional explorer experiences for file movement, file/folder upload and download with great scalability Delivers better accessibility for file navigation and data management capability with reliable




New Azure Data Factory self-paced hands-on lab for UI

A few weeks back, we announced the public preview release of the new browser-based V2 UI experience for Azure Data Factory. We’ve since partnered with Pragmatic Works, who have been long-time experts in the Microsoft data integration and ETL space, to create a new set of hands on labs that you can now use to learn how to build those DI patterns using ADF V2.

In that repo, you will find data files and scripts in the Deployment folder. There are also lab manual folders for each lab module as well an overview presentation to walk you through the labs. Below you will find more details on each module.

The repo also includes a series of PowerShell and database scripts as well as Azure ARM templates that will generate resource groups that the labs need in order for you to successfully build out an end-to-end scenario, including some sample data that you can use for Power BI reports in the final Lab Module 9.

Here is how the individual labs are divided:

Lab 1 – Setting up ADF and Resources, Start here to get all of the ARM resource groups and database backup files loaded properly. Lab 2 – Lift