Category Archives : Big Data



StorSimple Data Manager now generally available

We are excited to announce the general availability of the StorSimple Data Manager. This feature allows you to transform data from StorSimple format into the native format in Azure blobs or Azure Files. Once your data is transformed, you can use services like Azure Media Services, Azure Machine Learning, HDInsight, Azure Search, and more.

StorSimple devices use the cloud as a tier of storage and sends data to the cloud in a highly efficient and secure manner. Data is stored in the cloud tier in this deduped, compressed, and encrypted format. A side effect of this is that this data is not readily consumable by cloud services that you might want to use. Azure offers a rich bouquet of services and our goal is to let you use the service of your choice on your data to unleash its potential.

Using this service, you can transform data stored in your 8000 series StorSimple devices into Azure blobs or Azure Files. All the file data that you store on-premises on your StorSimple device will show up as individual blobs or files in Azure. You can use the Azure portal, .NET applications, or Azure Automation to trigger these transformations. You can



Microsoft partners with National Science Foundation to empower data science breakthroughs

Over the past decade, Microsoft has partnered with the National Science Foundation (NSF) on three separate programs, first in 2010, and more recently through a commitment of $6M in cloud credits across two NSF supported data science programs – with the Big Data Regional Innovation Hubs and as part of the NSF BigData solicitation.

The engagement with NSF has helped Microsoft reach diverse research groups such as the Big Data Hubs1 that brings together communities of data scientists to spark and nurture collaborations between domain experts, researchers, communities, state partners, nonprofits, and industry.

As of today, Microsoft has provided 17 cloud credit awards to Principal Investigators (PIs) who benefit from NSF supported programs. These collaborations are already seeing some interesting breakthroughs across the human body, microbial diseases, and even everyday communication –

Franco Pestilli, Assistant Professor in Psychology, Neuroscience and Cognitive Science, Indiana University is an Azure awardee and PI through the Midwest Big Data Hub2 – his group has built a platform called Brainlife using the Azure award, with the goal of fostering collaboration with sixty-six different global scientific communities such as developmental and learning sciences, network science, computer science, engineering, psychology, statistics, traumatic brain injury, vision science. Chirag



Microsoft Azure Data Lake Storage in Storage Explorer – public preview

Providing a rich GUI for Azure Data Lake Storage resources management has been a top customer ask for a long time, we are thrilled to announce the public preview for supporting Azure Data Lake Storage (ADLS) in the Azure Storage Explorer (ASE). With the release of ADLS resources in ASE, you can freely navigate ADLS resources, you can upload and download folders and files, you can copy and paste files across folders or ADLS accounts and you can easily perform CRUD operations for your folders and files. Azure Storage Explorer not only offers a traditional desktop explorer GUI for dragging, uploading, downloading, copying and moving your ADLS folders and files, but also provides a unified developer experiences of displaying file properties, viewing folder statistics and adding quick access. With this extension you are now able to browse ADLS resources along-side existing experiences for Azure Blobs, tables, files, queues and Cosmos DB in ASE.

Key customer benefits Offers a one-stop shop to manage Azure Storage Resources including ADLS Enables direct connect through Azure AD Authentication Provides traditional explorer experiences for file movement, file/folder upload and download with great scalability Delivers better accessibility for file navigation and data management capability with reliable



New Azure Data Factory self-paced hands-on lab for UI

A few weeks back, we announced the public preview release of the new browser-based V2 UI experience for Azure Data Factory. We’ve since partnered with Pragmatic Works, who have been long-time experts in the Microsoft data integration and ETL space, to create a new set of hands on labs that you can now use to learn how to build those DI patterns using ADF V2.

In that repo, you will find data files and scripts in the Deployment folder. There are also lab manual folders for each lab module as well an overview presentation to walk you through the labs. Below you will find more details on each module.

The repo also includes a series of PowerShell and database scripts as well as Azure ARM templates that will generate resource groups that the labs need in order for you to successfully build out an end-to-end scenario, including some sample data that you can use for Power BI reports in the final Lab Module 9.

Here is how the individual labs are divided:

Lab 1 – Setting up ADF and Resources, Start here to get all of the ARM resource groups and database backup files loaded properly. Lab 2 – Lift



Accelerated Spark on GPU-enabled clusters in Azure

The ability to run Spark on a GPU enabled cluster demonstrates a unique convergence of big data and high-performance computing (HPC) technologies. In the past several years, we’ve seen the GPU market explode as companies all over the world integrate AI and other HPC workflows into their businesses. Tensorflow, a framework designed to utilize GPUs for numerical computation and neural networks has skyrocketed into popularity, a testament to the rise of AI and consequently the demand for GPUs. Simultaneously, the need for big data and powerful data processing engines has never been greater as hundreds of companies start to collect data in the petabyte range.

By providing infrastructure for high performance hardware such as GPUs with big data engines such as Spark, data scientists and data engineers can enable many scenarios that would otherwise be difficult to achieve.

Along with the recent release of our latest GPU SKUs, I’m excited to share that we now support running Spark on a GPU-enabled cluster using the Azure Distributed Data Engineering Toolkit (AZTK). In a single command, AZTK allows you to provision on demand GPU-enabled Spark clusters on top of Azure Batch’s infrastructure, helping you take your high performance implementations that are usually



ADF v2: Visual Tools enabled in public preview
ADF v2: Visual Tools enabled in public preview

ADF v2 public preview was announced at Microsoft Ignite on Sep 25, 2017. With ADF v2, we added flexibility to ADF app model and enabled control flow constructs that now facilitates looping, branching, conditional constructs, on-demand executions and flexible scheduling in various programmatic interfaces like Python, .Net, Powershell, REST APIs, ARM templates. One of the consistent pieces of customer feedback we received, is to enable a rich interactive visual authoring and monitoring experience allowing users to create, configure, test, deploy and monitor data integration pipelines without any friction. We listened to your feedback and are happy to announce the release of visual tools for ADF v2. The main goal of the ADF visual tools is to allow you to be productive with ADF by getting pipelines up & running quickly without requiring to write a single line of code. You can use a simple and intuitive code free interface to drag and drop activities on a pipeline canvas, perform test runs, debug iteratively, deploy & monitor your pipeline runs. With this release, we are also providing guided tours on how to use the enabled visual authoring & monitoring features and also an ability to give us valuable feedback.



Using Qubole Data Service on Azure to analyze retail customer feedback

It has been a busy season for many retailers. During this time, retailers are using Azure to analyze various types of data to help accelerate purchasing decisions. The Azure cloud not only gives retailers the compute capacity to handle peak times, but also the data analytic tools to better understand their customers.

Many retailers have a treasure trove of information in the thousands, or millions, of product reviews provided by their customers. Often, it takes time for particular reviews to show their value because customers “vote” for helpful or not helpful reviews over time. Using machine learning, retailers can automate identifying useful reviews in near real-time and leverage that insight quickly to build additional business value.

But how might a retailer without deep big data and machine learning expertise even begin to conduct this type of advanced analytics on such a large quantity of unstructured data? We will be holding a workshop in January to show you how easy that can be through the use of Azure and Qubole’s big data service.

Using these technologies, anyone can quickly spin up a data platform and train a machine learning model utilizing Natural Language Processing (NLP) to identify the most useful reviews.



Azure HDInsight Performance Benchmarking: Interactive Query, Spark, and Presto

Fast SQL query processing at scale is often a key consideration for our customers. In this blog post we compare HDInsight Interactive Query, Spark, and Presto using the industry standard TPCDS benchmarks. These benchmarks are run using out of the box default HDInsight configurations, with no special optimizations. For customers wanting to run these benchmarks, please follow the easy to use steps outlined on GitHub.

Summary of the results HDInsight Interactive Query is faster than Spark. HDInsight Spark is faster than Presto. Text caching in Interactive Query, without converting data to ORC or Parquet, is equivalent to warm Spark performance. Interactive query is most suitable to run on large scale data as this was the only engine which could run all TPCDS 99 queries without any modifications at 100TB scale. Interactive Query preforms well with high concurrency. About TPCDS

The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. According to TPCDS, the benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi-user mode and



New connectors available in Azure Data Factory V2
New connectors available in Azure Data Factory V2

We keep enriching the breadth of connectivity in Azure Data Factory to enable customers to ingest data from various data sources into Azure when building modern data warehouse solutions or data-driven SaaS applications. Today, we are excited to announce that Azure Data Factory newly enabled copying data from the following data stores using Copy Activity in V2. You can always find the full supported connector list from supported data stores, and click into each connector topic there to learn more details.

Amazon Marketplace Web Service (Beta) Azure Database for PostgreSQL Concur (Beta) Couchbase (Beta) Drill (Beta) Google BigQuery (Beta) Greenplum (Beta) HBase Hive HubSpot (Beta) Impala (Beta) Jira (Beta) Magento (Beta) MariaDB Marketo (Beta) Oracle Eloqua (Beta) Paypal (Beta) Phoenix Presto (Beta) QuickBooks (Beta) SAP Cloud for Customer (C4C) ServiceNow (Beta) Shopify (Beta) Spark Square (Beta) Xero (Beta) Zoho (Beta)

If you are using PowerShell or .NET/Python SDK to author, make sure you upgrade to the December version to use these new features. And for hybrid copy scenario, note these connectors are supported since Self-hosted Integration Runtime version 3.2.

You are invited to give them a try and provide us feedback. We hope you find them helpful in your scenario.



XBox – Analytics on petabytes of gaming data with Azure HDInsight

This blog post was co-authored by Karan Gulati, Senior Software Engineer, XBOX and Daniel Hagen, Senior Software Engineer, XBOX.

Microsoft Studios produces some of the world’s most popular game titles including the Halo, Minecraft, and Forza Motorsport series. The Xbox product services team manage thousands of datasets and hundreds of active pipelines consuming hundreds of gigabytes of data each hour for first party studios. Game developers need to know the health of their game through measuring acquisition, retention, player progression, and general usage over time. This presents a textbook big data problem where data needs to be cleaned, formatted, aggregated and reported on, better known as ETL (Extract Transform Load).

HDInsight – Fully managed, full spectrum open source analytics service for enterprises

Azure HDInsight is a fully-managed cloud service for customers to do analytics at a massive scale using the most popular open-source frameworks such as Hadoop, MapReduce, Hive, LLAP, Presto, Spark, Kafka, and R. HDInsight enables a broad range of customer scenarios such as batch & ETL, data warehousing, machine learning, IoT and streaming over massive volumes of data at a high scale using Open Source Frameworks.

Key HDInsight benefits

Cloud native: The only service in the