Category Archives : Big Data



Using Qubole Data Service on Azure to analyze retail customer feedback

It has been a busy season for many retailers. During this time, retailers are using Azure to analyze various types of data to help accelerate purchasing decisions. The Azure cloud not only gives retailers the compute capacity to handle peak times, but also the data analytic tools to better understand their customers.

Many retailers have a treasure trove of information in the thousands, or millions, of product reviews provided by their customers. Often, it takes time for particular reviews to show their value because customers “vote” for helpful or not helpful reviews over time. Using machine learning, retailers can automate identifying useful reviews in near real-time and leverage that insight quickly to build additional business value.

But how might a retailer without deep big data and machine learning expertise even begin to conduct this type of advanced analytics on such a large quantity of unstructured data? We will be holding a workshop in January to show you how easy that can be through the use of Azure and Qubole’s big data service.

Using these technologies, anyone can quickly spin up a data platform and train a machine learning model utilizing Natural Language Processing (NLP) to identify the most useful reviews.



Azure HDInsight Performance Benchmarking: Interactive Query, Spark, and Presto

Fast SQL query processing at scale is often a key consideration for our customers. In this blog post we compare HDInsight Interactive Query, Spark, and Presto using the industry standard TPCDS benchmarks. These benchmarks are run using out of the box default HDInsight configurations, with no special optimizations. For customers wanting to run these benchmarks, please follow the easy to use steps outlined on GitHub.

Summary of the results HDInsight Interactive Query is faster than Spark. HDInsight Spark is faster than Presto. Text caching in Interactive Query, without converting data to ORC or Parquet, is equivalent to warm Spark performance. Interactive query is most suitable to run on large scale data as this was the only engine which could run all TPCDS 99 queries without any modifications at 100TB scale. Interactive Query preforms well with high concurrency. About TPCDS

The TPC Benchmark DS (TPC-DS) is a decision support benchmark that models several generally applicable aspects of a decision support system, including queries and data maintenance. According to TPCDS, the benchmark provides a representative evaluation of performance as a general purpose decision support system. A benchmark result measures query response time in single user mode, query throughput in multi-user mode and



New connectors available in Azure Data Factory V2
New connectors available in Azure Data Factory V2

We keep enriching the breadth of connectivity in Azure Data Factory to enable customers to ingest data from various data sources into Azure when building modern data warehouse solutions or data-driven SaaS applications. Today, we are excited to announce that Azure Data Factory newly enabled copying data from the following data stores using Copy Activity in V2. You can always find the full supported connector list from supported data stores, and click into each connector topic there to learn more details.

Amazon Marketplace Web Service (Beta) Azure Database for PostgreSQL Concur (Beta) Couchbase (Beta) Drill (Beta) Google BigQuery (Beta) Greenplum (Beta) HBase Hive HubSpot (Beta) Impala (Beta) Jira (Beta) Magento (Beta) MariaDB Marketo (Beta) Oracle Eloqua (Beta) Paypal (Beta) Phoenix Presto (Beta) QuickBooks (Beta) SAP Cloud for Customer (C4C) ServiceNow (Beta) Shopify (Beta) Spark Square (Beta) Xero (Beta) Zoho (Beta)

If you are using PowerShell or .NET/Python SDK to author, make sure you upgrade to the December version to use these new features. And for hybrid copy scenario, note these connectors are supported since Self-hosted Integration Runtime version 3.2.

You are invited to give them a try and provide us feedback. We hope you find them helpful in your scenario.



XBox – Analytics on petabytes of gaming data with Azure HDInsight

This blog post was co-authored by Karan Gulati, Senior Software Engineer, XBOX and Daniel Hagen, Senior Software Engineer, XBOX.

Microsoft Studios produces some of the world’s most popular game titles including the Halo, Minecraft, and Forza Motorsport series. The Xbox product services team manage thousands of datasets and hundreds of active pipelines consuming hundreds of gigabytes of data each hour for first party studios. Game developers need to know the health of their game through measuring acquisition, retention, player progression, and general usage over time. This presents a textbook big data problem where data needs to be cleaned, formatted, aggregated and reported on, better known as ETL (Extract Transform Load).

HDInsight – Fully managed, full spectrum open source analytics service for enterprises

Azure HDInsight is a fully-managed cloud service for customers to do analytics at a massive scale using the most popular open-source frameworks such as Hadoop, MapReduce, Hive, LLAP, Presto, Spark, Kafka, and R. HDInsight enables a broad range of customer scenarios such as batch & ETL, data warehousing, machine learning, IoT and streaming over massive volumes of data at a high scale using Open Source Frameworks.

Key HDInsight benefits

Cloud native: The only service in the



Announcing Apache Kafka for Azure HDInsight general availability

Apache Kafka on the Azure HDInsight was added last year as a preview service to help enterprises create real-time big data pipelines. Since then, large companies such as Toyota, Adobe, Bing Ads, and GE have been using this service in production to process over a million events per sec to power scenarios for connected cars, fraud detection, clickstream analysis, and log analytics. HDInsight has worked very closely with these customers to understand the challenges of running a robust, real-time production pipeline at an enterprise scale. Using our learnings, we have implemented key features in the managed Kafka service on HDInsight, which is now generally available.

A fully managed Kafka service for the enterprise use case

Running big data streaming pipelines is hard. Doing so with open source technologies for the enterprise is even harder. Apache Kafka, a key open source technology, has emerged as the de-facto technology for ingesting large streaming events in a scalable, low-latency, and low-cost fashion. Enterprises want to leverage this technology, however, there are many challenges with installing, managing, and maintaining a streaming pipeline. Open source bits lack support and in-house talent needs to be well versed with these technologies to ensure the highest levels of



Azure HDInsight Integration with Azure Log Analytics is now generally available

I am excited to announce the general availability of HDInsight Integration with Azure Log Analytics.

Azure HDInsight is a fully managed cloud service for customers to do analytics at scale using the most popular open-source engines such as Hadoop, Hive/LLAP, Presto, Spark, Kafka, Storm, HBase etc. ​

Thousands of our customers run their big data analytical applications on HDInsight at global scale. The ability to monitor this infrastructure, detect failures quickly and take quick remedial action is key to ensuring a better customer experience.

Log Analytics is part of Microsoft Azure’s overall monitoring solution. Log Analytics helps you monitors cloud and on-premises environments to maintain availability and performance.

Our integration with log analytics will make it easier for our customers to operate their big data production workloads more effective and simple manner.

Monitor & debug full spectrum of big data open source engines at global scale

Typical big data pipelines utilize multiple open source engines such as Kafka for Ingestion, Spark streaming or Storm for stream processing, Hive & Spark for ETL, Interactive Query [LLAP] for blazing fast querying of big data.

Additionally, these pipelines may be running in different datacenters across the globe.

With new HDInsight monitoring