Category Archives : Data Science

30

May

Getting started with Apache Spark on Azure Databricks
Getting started with Apache Spark on Azure Databricks

Data is growing at an astounding rate, with an estimated 2.5 quintillion bytes being created everyday. Data analysts predict that by 2020, the world’s collected data will quadruple. In the sea of all this data, we are continually exploring new ways of analyzing and interpreting data in a way that’s productive, meaningful and insightful.

Designed in collaboration with the original founders of Apache® Spark™, Azure Databricks combines the best of Databricks and Microsoft Azure to help customers accelerate innovation with streamlined workflows, an interactive workspace and one-click set up. Azure Databricks is an analytics engine built for large scale data processing that enables collaboration between data scientists, data engineers and business analysts.

Azure Databricks can be used to run workloads faster and write applications in the language of your choice, whether that’s Scala, SQL, R or Python. When in sync with Azure Databricks, businesses can innovate within the safe, protected cloud environment of Microsoft Azure and benefit from the native integration with other Azure services such as Power BI, Azure SQL Data Warehouse, and Azure Cosmos DB.

When you’re getting started with Apache Spark on Azure Databricks, you’ll have questions that are unique to your businesses implementation and use case.

10

May

Spark + AI Summit: Data scientists and engineers put their trust in Microsoft Azure Databricks

Microsoft will have a major presence at Spark + AI Summit, 2018, in San Francisco, the premier event for the Apache Spark community. Rohan Kumar, Corporate Vice President of Azure Data, will deliver a keynote on how Azure Databricks combines the best of Apache® Spark™ analytics platform and Microsoft Azure Data Services to help customers unleash the power of data and reimagine possibilities that will improve our world.

Azure Databricks, a fast, easy, and collaborative Apache Spark-based analytics platform optimized for Azure, was made generally available in March 2018. To learn more about the announcement, read Rohan Kumar’s blog about how Azure Databricks can help customers accelerate innovation and simplify the process of building Big Data & AI solutions. At Spark + AI Summit, we have a number of sessions showcasing the great work our customers and partners are doing and how Azure Databricks is helping them achieve productivity at scale.

Sign up for training on Spark!

On Monday, June 4, 2018 there are a number of full-day training courses on Apache Spark ranging from beginner to advanced that will enhance your skill set and even prepare you for certification on Spark.

Apache Spark essentials

This 1-day course is for

30

Apr

Region expansion for the next generation of SQL Data Warehouse

Azure SQL Data Warehouse (SQL DW) is a fast, flexible and secure, cloud data warehouse tuned for running complex queries fast and across petabytes of data. Continuing to deliver on this promise, we have announced the general availability of the next generation of SQL DW which includes an average of five times the performance boost, five times the increase in compute scalability, and four times the increase in concurrency. The release of Azure SQL DW Compute Optimized Gen2 tier comes with an expansion of 14 additional regions bringing the global region footprint of SQL DW Gen2 to 20 surpassing all other major cloud providers. The following regions are available:

Australia East

Australia Southeast

Canada Central

Central India

Central US

East Asia

East US

East US 2

Japan East

Japan West

Korea South

North Central US

North Europe

South Central US

South India

Southeast Asia

UK South

West Europe

West US

West US 2

With more global regions than any other

10

Apr

Three critical analytics use cases with Microsoft Azure Databricks

Data science and machine learning can be applied to solve many common business scenarios, yet there are many barriers preventing organizations from adopting them. Collaboration between data scientists, data engineers, and business analysts and curating data, structured and unstructured, from disparate sources are two examples of such barriers – and we haven’t even gotten to the complexity involved when trying to do these things with large volumes of data.  

Recommendation engines, clickstream analytics, and intrusion detection are common scenarios that many organizations are solving across multiple industries. They require machine learning, streaming analytics, and utilize massive amounts of data processing that can be difficult to scale without the right tools. Companies like Lennox International, E.ON, and renewables.AI are just a few examples of organizations that have deployed Apache Spark™ to solve these challenges using Microsoft Azure Databricks.

Your company can enable data science with high-performance analytics too. Designed in collaboration with the original creators of Apache Spark, Azure Databricks is a fast, easy, and collaborative Apache Spark™ based analytics platform optimized for Azure. Azure Databricks is integrated with Azure through one-click setup and provides streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business

22

Mar

Azure Event Hubs integration with Apache Spark now generally available

The Event Hubs team is happy to announce the general availability of our integration with Apache Spark. Now, Event Hubs users can use Spark to easily build end-to-end streaming applications. The Event Hubs connector for Spark supports Spark Core, Spark Streaming, and Structured Streaming for Spark 2.1, Spark 2.2, and Spark 2.3.

For users new to Spark, Spark Streaming and Structured Streaming are scalable, fault-tolerant stream processing engines. These processing engines allow users to process huge amounts of data using complex algorithms expressed with high-level functions like map, reduce, join, and window. This data can then be pushed to file systems, databases, or even back to Event Hubs.

Setting up a stream is easy, check it out:

import org.apache.spark.eventhubs._ import org.apache.spark.sql.SparkSession val eventHubsConf = EventHubsConf(“{EVENT HUB CONNECTION STRING FROM AZURE PORTAL}”) .setStartingPosition(EventPosition.fromEndOfStream) // Create a stream that reads data from the specified Event Hub. val spark = SparkSession.builder.appName(“SimpleStream”).getOrCreate() val eventHubStream = spark.readStream .format(“eventhubs”) .options(eventHubsConf.toMap) .load()

It’s as easy as that! Once your events are streaming into Spark, you can process them as you wish. Spark provides a variety of processing options, such as graph analysis and machine learning. Our documentation has more details on linking our connector with your

22

Mar

Unlock your data’s potential with Azure SQL Data Warehouse and Azure Databricks

Getting the most out of your data is critical for any business in a competitive environment. Businesses need the ability to get the right data into the right hands at the right time. Azure Databricks and Azure SQL Data Warehouse can help you do just that through a Modern Data Warehouse.

Azure SQL Data Warehouse is an elastic, globally available, cloud data warehouse that leverages Massively Parallel Processing (MPP) to quickly run complex queries across petabytes of data. Azure SQL Data Warehouse provides a familiar interface for your analysts who know SQL and want to drive action in your business.

Azure Databricks combines the best of Databricks and Azure to help customers accelerate innovation with one-click set up, streamlined workflows, and an interactive workspace that enables collaboration between data scientists, data engineers, and business analysts powered by Apache Spark.

With the general availability of the Azure Databricks Service comes built-in support for Azure SQL Data Warehouse. This enables any data scientist or data engineer to have a seamless experience connecting their Azure Databricks Cluster and their Azure SQL Data Warehouse when building advanced ETL (extract, transform, and load data) for Modern Data Warehouse Architectures or accessing relational data for Machine

13

Feb

Microsoft partners with National Science Foundation to empower data science breakthroughs

Over the past decade, Microsoft has partnered with the National Science Foundation (NSF) on three separate programs, first in 2010, and more recently through a commitment of $6M in cloud credits across two NSF supported data science programs – with the Big Data Regional Innovation Hubs and as part of the NSF BigData solicitation.

The engagement with NSF has helped Microsoft reach diverse research groups such as the Big Data Hubs1 that brings together communities of data scientists to spark and nurture collaborations between domain experts, researchers, communities, state partners, nonprofits, and industry.

As of today, Microsoft has provided 17 cloud credit awards to Principal Investigators (PIs) who benefit from NSF supported programs. These collaborations are already seeing some interesting breakthroughs across the human body, microbial diseases, and even everyday communication –

Franco Pestilli, Assistant Professor in Psychology, Neuroscience and Cognitive Science, Indiana University is an Azure awardee and PI through the Midwest Big Data Hub2 – his group has built a platform called Brainlife using the Azure award, with the goal of fostering collaboration with sixty-six different global scientific communities such as developmental and learning sciences, network science, computer science, engineering, psychology, statistics, traumatic brain injury, vision science. Chirag

31

Jan

Three new reasons to love the TSI explorer
Three new reasons to love the TSI explorer

Today we’re pleased to announce three new Time Series Insights (TSI) explorer capabilities that we think our users are going to love. 

First, we are delighted to share that the TSI explorer, the visualization service of TSI, is now generally available and backed by our SLA.  Second, we’ve made the TSI explorer more accessible and easier to use for those with visual and fine-motor disabilities. And finally, we’ve made it easy to export aggregate event data to other analytics tools like Microsoft Excel. 

Now that the TSI explorer is generally available, users will notice that the explorer is backed by TSI’s service level agreement (SLA), and we’ve removed the preview moniker from the backsplash when the explorer is loading. We have many customers using TSI in production environments and we’re thrilled to offer them the same SLA that backs the rest of the product. The ActionPoint IoT-PREDICT solution is a great example of one of those customers using the TSI explorer to enable their customers to explore and analyze time series data quickly. Check out their solution below.

There are no limits to what people can achieve when technology reflects the diversity of everyone who uses it. Transparency, accountability, and

25

Jan

Accelerated Spark on GPU-enabled clusters in Azure

The ability to run Spark on a GPU enabled cluster demonstrates a unique convergence of big data and high-performance computing (HPC) technologies. In the past several years, we’ve seen the GPU market explode as companies all over the world integrate AI and other HPC workflows into their businesses. Tensorflow, a framework designed to utilize GPUs for numerical computation and neural networks has skyrocketed into popularity, a testament to the rise of AI and consequently the demand for GPUs. Simultaneously, the need for big data and powerful data processing engines has never been greater as hundreds of companies start to collect data in the petabyte range.

By providing infrastructure for high performance hardware such as GPUs with big data engines such as Spark, data scientists and data engineers can enable many scenarios that would otherwise be difficult to achieve.

Along with the recent release of our latest GPU SKUs, I’m excited to share that we now support running Spark on a GPU-enabled cluster using the Azure Distributed Data Engineering Toolkit (AZTK). In a single command, AZTK allows you to provision on demand GPU-enabled Spark clusters on top of Azure Batch’s infrastructure, helping you take your high performance implementations that are usually

04

Jan

Using Qubole Data Service on Azure to analyze retail customer feedback

It has been a busy season for many retailers. During this time, retailers are using Azure to analyze various types of data to help accelerate purchasing decisions. The Azure cloud not only gives retailers the compute capacity to handle peak times, but also the data analytic tools to better understand their customers.

Many retailers have a treasure trove of information in the thousands, or millions, of product reviews provided by their customers. Often, it takes time for particular reviews to show their value because customers “vote” for helpful or not helpful reviews over time. Using machine learning, retailers can automate identifying useful reviews in near real-time and leverage that insight quickly to build additional business value.

But how might a retailer without deep big data and machine learning expertise even begin to conduct this type of advanced analytics on such a large quantity of unstructured data? We will be holding a workshop in January to show you how easy that can be through the use of Azure and Qubole’s big data service.

Using these technologies, anyone can quickly spin up a data platform and train a machine learning model utilizing Natural Language Processing (NLP) to identify the most useful reviews.