Category Archives : Big Data

09

Aug

How Microsoft drives exabyte analytics on the world’s largest YARN cluster

At Microsoft, like many companies using data for competitive advantage, opportunities for insight abound and our analytics needs were scaling fast – almost out of control. We invested in Yet Another Resource Manager (YARN) to meet the demands of an exabyte-scale analytics platform and ended up creating the world’s largest YARN Cluster.

In big data, how big is really big?

Yarn is known to scale to thousands of nodes, but what happens when you need to tens of thousands of nodes? The Cloud & Information Service at Microsoft is a highly specialized team of experts that work on applied research and science initiatives focusing on data processing and distributed systems. This blog explains how CISL and the Microsoft Big Data team met the challenge of complex scale and resource management – and ended up implementing the world’s largest YARN cluster to drive its exabyte-sized analytics.

Exabyte-size analytics

For more than a decade Microsoft has depended on internal version of the publicly available Azure Data Lake for its own super-sized analytics. The volume of data and complexity of calculation has caused it to scale to several larger clusters. To the best of our knowledge, Microsoft is currently running the largest Yarn

Share

07

Aug

Azure HDInsight Interactive Query: Ten tools to analyze big data faster

Customers use HDInsight Interactive Query (also called Hive LLAP, or Low Latency Analytical Processing) to query data stored in Azure storage & Azure Data Lake Storage in super-fast manner. Interactive query makes it easy for developers and data scientist to work with the big data using BI tools they love the most. HDInsight Interactive Query supports several tools to access big data in easy fashion. In this blog we have listed most popular tools used by our customers:

Microsoft Power BI

Microsoft Power BI Desktop has a native connector to perform direct query against  HDInsight Interactive Query cluster. You can explore and visualize the data in interactive manner. To learn more see Visualize Interactive Query Hive data with Power BI in Azure HDInsight and Visualize big data with Power BI in Azure HDInsight.

Apache Zeppelin

Apache Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. You can access Interactive Query from Apache Zeppelin using a JDBC interpreter. To learn more please see Use Zeppelin to run Hive queries in Azure HDInsight.

Visual Studio Code

With HDInsight Tools for VS Code, you can submit interactive queries as well at look at job information in HDInsight interactive query

Share

07

Aug

Azure HDInsight Interactive Query: Ten tools to analyze big data faster

Customers use HDInsight Interactive Query (also called Hive LLAP, or Low Latency Analytical Processing) to query data stored in Azure storage & Azure Data Lake Storage in super-fast manner. Interactive query makes it easy for developers and data scientist to work with the big data using BI tools they love the most. HDInsight Interactive Query supports several tools to access big data in easy fashion. In this blog we have listed most popular tools used by our customers:

Microsoft Power BI

Microsoft Power BI Desktop has a native connector to perform direct query against  HDInsight Interactive Query cluster. You can explore and visualize the data in interactive manner. To learn more see Visualize Interactive Query Hive data with Power BI in Azure HDInsight and Visualize big data with Power BI in Azure HDInsight.

Apache Zeppelin

Apache Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. You can access Interactive Query from Apache Zeppelin using a JDBC interpreter. To learn more please see Use Zeppelin to run Hive queries in Azure HDInsight.

Visual Studio Code

With HDInsight Tools for VS Code, you can submit interactive queries as well at look at job information in HDInsight interactive query

Share

06

Aug

Accelerate healthcare initiatives with Azure UK NHS blueprints

Today, the healthcare industry is confronting many complex and daunting challenges that include demands to:

Increase patient engagement. Take advantage of big data, analytics, artificial Intelligence (AI), and machine learning (ML). Integrate consumer health apps, wearables, and the Internet of Medical Things (IoMT). Combat cybersecurity threats, breaches, and ransomware.

In the midst of this, however, healthcare organizations must continue to:

Deliver the best patient care. Improve patient outcomes. Reduce healthcare costs (now 7 percent of GDP in the UK and almost 18 percent of GDP in the United States). Enhance patient and clinician experiences.

And all with limited budget and resources!

Cloud computing can help healthcare organizations focus on patient care and reducing costs, and it enables IT to be more flexible, agile, scalable, and secure as the healthcare industry changes and grows.

A key challenge to adopting cloud computing is that healthcare needs solutions, not IT projects. Healthcare organizations of every size often have limited IT and cybersecurity resources burdened with maintaining existing IT infrastructure.

So how can they create new solutions?

Rx: Blueprints

To rapidly acquire new capabilities and implement new solutions, healthcare IT and developers can now take advantage of industry-specific Azure Blueprints. These are packages that

Share

06

Aug

Accelerate healthcare initiatives with Azure UK NHS blueprints

Today, the healthcare industry is confronting many complex and daunting challenges that include demands to:

Increase patient engagement. Take advantage of big data, analytics, artificial Intelligence (AI), and machine learning (ML). Integrate consumer health apps, wearables, and the Internet of Medical Things (IoMT). Combat cybersecurity threats, breaches, and ransomware.

In the midst of this, however, healthcare organizations must continue to:

Deliver the best patient care. Improve patient outcomes. Reduce healthcare costs (now 7 percent of GDP in the UK and almost 18 percent of GDP in the United States). Enhance patient and clinician experiences.

And all with limited budget and resources!

Cloud computing can help healthcare organizations focus on patient care and reducing costs, and it enables IT to be more flexible, agile, scalable, and secure as the healthcare industry changes and grows.

A key challenge to adopting cloud computing is that healthcare needs solutions, not IT projects. Healthcare organizations of every size often have limited IT and cybersecurity resources burdened with maintaining existing IT infrastructure.

So how can they create new solutions?

Rx: Blueprints

To rapidly acquire new capabilities and implement new solutions, healthcare IT and developers can now take advantage of industry-specific Azure Blueprints. These are packages that

Share

26

Jul

How to enhance HDInsight security with service endpoints

HDInsight enterprise customers work with some of the most sensitive data in the world. They want to be able to lock down access to this data at the networking layer as well. However, while service endpoints have been available in Azure data sources, HDInsight customers couldn’t leverage this additional layer of security for their big data pipelines due to the lack of interoperability between HDInsight and other data stores. As we have recently announced, HDInsight is now excited to support service endpoints for Azure Blob Storage, Azure SQL databases and Azure Cosmos DB.

With this enhanced level of security at the networking layer, customers can now lock down their big data storage accounts to their specified Virtual Networks (VNETs) and still use HDInsight clusters seamlessly to access and process that data.

In the rest of this post we will explore how to enable service endpoints and point out important HDInsight configurations for Azure Blob Storage, Azure SQL DB, and Azure CosmosDB.

Azure Blob Storage:

When using Azure Blob Storage with HDInsight, you can configure selected VNETs on a blob storage firewall settings. This will ensure that only traffic from those subnets can access this storage account.

It is important to

Share

26

Jul

Avoid Big Data pitfalls with Azure HDInsight and these partner solutions

According to a Gartner 2017 prediction, “60 percent of big data projects will fail to go beyond piloting and experimentation, these projects will be abandoned”.

Whether you worked on an analytical project or are starting one, it is a challenge on any cloud. You need to juggle the intricacies of cloud provider services, open source frameworks and the apps in the ecosystem. Apache Hadoop & Spark are very vibrant open source ecosystems which have enabled enterprises to digitally transform their businesses using data. According to Matt Turck VC at FirstMark, it has been an exciting but complex year in the data world. “The data tech ecosystem has continued to fire on all cylinders.  If nothing else, data is probably even more front and center in 2018, in both business and personal conversations”.

However, with great power comes greater responsibility from the ecosystem. There is a lot more than just using open source or a managed platform to a successful project. You have to deal with:

The complexity of combining all the open source frameworks. Architecting a data lake to get insights for data engineers, data scientists and BI users. Meeting enterprise regulations such as security, access control, data sovereignty &

Share

23

Jul

IoT: the catalyst for better risk management in insurance

Thought leader Matteo Carbone has titled his book All the Insurance Players Will Be Insurtech. He means that insurance companies that embrace digital transformation and technologies will lead the industry. Those technologies include the Internet of Things (IoT), Artificial Intelligence (AI), Machine Learning (ML), and Big Data. Carbone believes that the use of new technologies gives insurers “superpowers” to assess risk more accurately, manage risk continually, and mitigate risk in real-time.

The process of getting superpowers is the process of converting IoT data into actionable insights, and using those insights to reduce risk through prevention and mitigation of claim events. As the powers grow, so do the benefits to insurance customers and providers. Insurers can also increase the pace of  customer interaction. It is the growth in the number of interactions produces more data points, and the same data is used to prevent or mitigate risk, while driving the sale of additional services outside the traditional insurance value chain. Remote monitoring and emergency alert services also provide peace-of-mind to the customer. These are only the start of additional services. Insurance companies are now selling a matrix of other services layered on top of the base policy. The income of

Share

23

Jul

IoT: the catalyst for better risk management in insurance

Thought leader Matteo Carbone has titled his book All the Insurance Players Will Be Insurtech. He means that insurance companies that embrace digital transformation and technologies will lead the industry. Those technologies include the Internet of Things (IoT), Artificial Intelligence (AI), Machine Learning (ML), and Big Data. Carbone believes that the use of new technologies gives insurers “superpowers” to assess risk more accurately, manage risk continually, and mitigate risk in real-time.

The process of getting superpowers is the process of converting IoT data into actionable insights, and using those insights to reduce risk through prevention and mitigation of claim events. As the powers grow, so do the benefits to insurance customers and providers. Insurers can also increase the pace of  customer interaction. It is the growth in the number of interactions produces more data points, and the same data is used to prevent or mitigate risk, while driving the sale of additional services outside the traditional insurance value chain. Remote monitoring and emergency alert services also provide peace-of-mind to the customer. These are only the start of additional services. Insurance companies are now selling a matrix of other services layered on top of the base policy. The income of

Share

23

Jul

Build secure Oozie workflows in Azure HDInsight with Enterprise Security Package

Customers love to use Hadoop and often rely on Oozie, a workflow and coordination scheduler for Hadoop to accelerate and ease their big data implementation. Oozie is integrated with the Hadoop stack, and it supports several types of Hadoop jobs. However, for users of Azure HDInsight with domain joined clusters, Oozie was not a supported option. To get around this limitation customers had to run Oozie on a regular cluster. This was costly with extra administrative overhead. Today we are happy to announce that customers can now use Oozie in domain-joined Hadoop clusters too.

In domain-joined clusters, authentication happens through Kerberos and fine-grained authorization is through Ranger policies. Oozie supports impersonation of users and a basic authorization model for workflow jobs.

Moreover, Hive server 2 actions submitted as part of an Oozie workflow get logged and are auditable through Ranger too. Fine-grained authorization through ranger will be enforced on the Oozie jobs, only when Ranger policies are present, otherwise coarse-grained authorization based on HDFS (only available on ADLS Gen1) is enforced.

Learn more about how to create an Oozie workflow and submit jobs in a domain joined cluster, and how to use Oozie with Hadoop to define and run a

Share