Category Archives : Big Data



Azure Data Lake tools for VS Code now supports job view and job monitoring

We are happy to announce that job monitoring and job view have been added into the Azure Data Lake Tools for Visual Studio Code. Now, you can perform real-time monitoring for the jobs you submit. You can also view job summary and job details for historical jobs as well as  download any of the input or output data and resources files associated with the job.

Key Customer Benefits Monitor job progress in real-time within VSCode for both local and ADL jobs. Display job summary and data details for historical jobs. Resubmit previously run Enable jobs resubmission for an old job. Download job inputs, outputs and resource data files. View the job U-SQL script for a submitted job. Summary of key new features

Job View Page: Display job summary and job progress within VSCode. 

Data Page: Display job input, output and resources files. Support file download.

Show Historical Jobs: Use command ADL: Show Jobs for both local and ADL historical jobs.

Set Default Context: Use command ADL: Set Default Context to set default context for current working folder.

How to install or update

Install Visual



HDInsight Tools for VSCode integrates with Ambari and HDInsight Enterprise Secure Package

To provide more authentication options, HDInsight Tools for VSCode now can be connected to HDInsight cluster through Ambari for job submissions. You can easily link (HDInsight: Link a cluster) or unlink (HDInsight: Unlink a cluster) a normal cluster by using Ambari managed username and password, which is independent of your Azure signing process. The Ambari connection applies to Spark and Hive clusters  in all the Azure environments which host HDInsight services.

To support HDInsight Enterprise Secure Package (in preview), you can also connect to the secured cluster through domain username (e.g. This connection is applicable for both traditional blob storage (WASB) or Azure Data Lake Storage (ADLS) as underlying storage. Once you connect to the secured  HDInsight cluster, you can use the signed in domain credentials for all you job submissions. 

This addition grants you more flexibilities to connect to your HDInsight clusters in addition to your Azure subscriptions and greatly simplify your experiences in submitting your Hive and Spark jobs.

How to link a cluster Open the command palette by selecting CTRL+SHIFT+P, and then enter HDInsight: Link a cluster.

Enter HDInsight cluster URL -> input Username -> input Password -> select cluster type – –>



StorSimple Data Manager now generally available

We are excited to announce the general availability of the StorSimple Data Manager. This feature allows you to transform data from StorSimple format into the native format in Azure blobs or Azure Files. Once your data is transformed, you can use services like Azure Media Services, Azure Machine Learning, HDInsight, Azure Search, and more.

StorSimple devices use the cloud as a tier of storage and sends data to the cloud in a highly efficient and secure manner. Data is stored in the cloud tier in this deduped, compressed, and encrypted format. A side effect of this is that this data is not readily consumable by cloud services that you might want to use. Azure offers a rich bouquet of services and our goal is to let you use the service of your choice on your data to unleash its potential.

Using this service, you can transform data stored in your 8000 series StorSimple devices into Azure blobs or Azure Files. All the file data that you store on-premises on your StorSimple device will show up as individual blobs or files in Azure. You can use the Azure portal, .NET applications, or Azure Automation to trigger these transformations. You can



Microsoft partners with National Science Foundation to empower data science breakthroughs

Over the past decade, Microsoft has partnered with the National Science Foundation (NSF) on three separate programs, first in 2010, and more recently through a commitment of $6M in cloud credits across two NSF supported data science programs – with the Big Data Regional Innovation Hubs and as part of the NSF BigData solicitation.

The engagement with NSF has helped Microsoft reach diverse research groups such as the Big Data Hubs1 that brings together communities of data scientists to spark and nurture collaborations between domain experts, researchers, communities, state partners, nonprofits, and industry.

As of today, Microsoft has provided 17 cloud credit awards to Principal Investigators (PIs) who benefit from NSF supported programs. These collaborations are already seeing some interesting breakthroughs across the human body, microbial diseases, and even everyday communication –

Franco Pestilli, Assistant Professor in Psychology, Neuroscience and Cognitive Science, Indiana University is an Azure awardee and PI through the Midwest Big Data Hub2 – his group has built a platform called Brainlife using the Azure award, with the goal of fostering collaboration with sixty-six different global scientific communities such as developmental and learning sciences, network science, computer science, engineering, psychology, statistics, traumatic brain injury, vision science. Chirag



Microsoft Azure Data Lake Storage in Storage Explorer – public preview

Providing a rich GUI for Azure Data Lake Storage resources management has been a top customer ask for a long time, we are thrilled to announce the public preview for supporting Azure Data Lake Storage (ADLS) in the Azure Storage Explorer (ASE). With the release of ADLS resources in ASE, you can freely navigate ADLS resources, you can upload and download folders and files, you can copy and paste files across folders or ADLS accounts and you can easily perform CRUD operations for your folders and files. Azure Storage Explorer not only offers a traditional desktop explorer GUI for dragging, uploading, downloading, copying and moving your ADLS folders and files, but also provides a unified developer experiences of displaying file properties, viewing folder statistics and adding quick access. With this extension you are now able to browse ADLS resources along-side existing experiences for Azure Blobs, tables, files, queues and Cosmos DB in ASE.

Key customer benefits Offers a one-stop shop to manage Azure Storage Resources including ADLS Enables direct connect through Azure AD Authentication Provides traditional explorer experiences for file movement, file/folder upload and download with great scalability Delivers better accessibility for file navigation and data management capability with reliable



New Azure Data Factory self-paced hands-on lab for UI

A few weeks back, we announced the public preview release of the new browser-based V2 UI experience for Azure Data Factory. We’ve since partnered with Pragmatic Works, who have been long-time experts in the Microsoft data integration and ETL space, to create a new set of hands on labs that you can now use to learn how to build those DI patterns using ADF V2.

In that repo, you will find data files and scripts in the Deployment folder. There are also lab manual folders for each lab module as well an overview presentation to walk you through the labs. Below you will find more details on each module.

The repo also includes a series of PowerShell and database scripts as well as Azure ARM templates that will generate resource groups that the labs need in order for you to successfully build out an end-to-end scenario, including some sample data that you can use for Power BI reports in the final Lab Module 9.

Here is how the individual labs are divided:

Lab 1 – Setting up ADF and Resources, Start here to get all of the ARM resource groups and database backup files loaded properly. Lab 2 – Lift



Accelerated Spark on GPU-enabled clusters in Azure

The ability to run Spark on a GPU enabled cluster demonstrates a unique convergence of big data and high-performance computing (HPC) technologies. In the past several years, we’ve seen the GPU market explode as companies all over the world integrate AI and other HPC workflows into their businesses. Tensorflow, a framework designed to utilize GPUs for numerical computation and neural networks has skyrocketed into popularity, a testament to the rise of AI and consequently the demand for GPUs. Simultaneously, the need for big data and powerful data processing engines has never been greater as hundreds of companies start to collect data in the petabyte range.

By providing infrastructure for high performance hardware such as GPUs with big data engines such as Spark, data scientists and data engineers can enable many scenarios that would otherwise be difficult to achieve.

Along with the recent release of our latest GPU SKUs, I’m excited to share that we now support running Spark on a GPU-enabled cluster using the Azure Distributed Data Engineering Toolkit (AZTK). In a single command, AZTK allows you to provision on demand GPU-enabled Spark clusters on top of Azure Batch’s infrastructure, helping you take your high performance implementations that are usually



ADF v2: Visual Tools enabled in public preview
ADF v2: Visual Tools enabled in public preview

ADF v2 public preview was announced at Microsoft Ignite on Sep 25, 2017. With ADF v2, we added flexibility to ADF app model and enabled control flow constructs that now facilitates looping, branching, conditional constructs, on-demand executions and flexible scheduling in various programmatic interfaces like Python, .Net, Powershell, REST APIs, ARM templates. One of the consistent pieces of customer feedback we received, is to enable a rich interactive visual authoring and monitoring experience allowing users to create, configure, test, deploy and monitor data integration pipelines without any friction. We listened to your feedback and are happy to announce the release of visual tools for ADF v2. The main goal of the ADF visual tools is to allow you to be productive with ADF by getting pipelines up & running quickly without requiring to write a single line of code. You can use a simple and intuitive code free interface to drag and drop activities on a pipeline canvas, perform test runs, debug iteratively, deploy & monitor your pipeline runs. With this release, we are also providing guided tours on how to use the enabled visual authoring & monitoring features and also an ability to give us valuable feedback.



Using Qubole Data Service on Azure to analyze retail customer feedback

It has been a busy season for many retailers. During this time, retailers are using Azure to analyze various types of data to help accelerate purchasing decisions. The Azure cloud not only gives retailers the compute capacity to handle peak times, but also the data analytic tools to better understand their customers.

Many retailers have a treasure trove of information in the thousands, or millions, of product reviews provided by their customers. Often, it takes time for particular reviews to show their value because customers “vote” for helpful or not helpful reviews over time. Using machine learning, retailers can automate identifying useful reviews in near real-time and leverage that insight quickly to build additional business value.

But how might a retailer without deep big data and machine learning expertise even begin to conduct this type of advanced analytics on such a large quantity of unstructured data? We will be holding a workshop in January to show you how easy that can be through the use of Azure and Qubole’s big data service.

Using these technologies, anyone can quickly spin up a data platform and train a machine learning model utilizing Natural Language Processing (NLP) to identify the most useful reviews.



Azure HDInsight Performance Benchmarking: Interactive Query, Spark, and Presto

Fast SQL query processing at scale is often a key consideration for our customers. In this blog post we compare HDInsight Interactive Query, Spark, and Presto using the industry standard TPCDS benchmarks. These benchmarks are run using out of the box default HDInsight configurations, with no special optimizations. For customers wanting to run these benchmarks, please follow the easy to use steps outlined on GitHub.

Summary of the results HDInsight Interactive Query is faster than Spark. HDInsight Spark is faster than Presto. Text caching in Interactive Query, without converting data to ORC or Parquet, is equivalent to warm Spark performance. Interactive query is most suitable to run on large scale data as this was the only engine which could run all TPCDS 99 queries without any modifications at 100TB scale. Interactive Query preforms well with high concurrency. About TPCDS

The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. According to TPCDS, the benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi-user mode and