Category Archives : Big Data

27

Aug

Sharing a self-hosted Integration Runtime infrastructure with multiple Data Factories

The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments. If you need to perform data integration and orchestration securely in a private network environment, which does not have a direct line-of-sight from the public cloud environment, you can install a self-hosted IR on premises behind your corporate firewall, or inside a virtual private network.

Untill now, you were required to create at least one such compute infrastructure in every Data Factory by design for hybrid and on-premise data integration capabilities. Which implies if you have ten such data factories being used by different project teams to access on-premise data stores and orchestrate inside VNet, you would have to create ten self-hosted IR infrastructures, adding additional cost and management concerns to the IT teams.

With the new capability of self-hosted IR sharing, you can share the same self-hosted IR infrastructure across data factories. This lets you reuse the same highly available and scalable self-hosted IR infrastructure from different data factories within the same Azure Active Directory tenant. We are introducing a new concept of a Linked self-hosted IR which references another self-hosted IR infrastructure. This does not introduce

Share

16

Aug

Azure #HDInsight Interactive Query: simplifying big data analytics architecture

Fast Interactive BI, data security and end user adoption are three critical challenges for successful big data analytics implementations. Without right architecture and tools, many big data and analytics projects fail to catch on with common BI users and enterprise security architects. In this blog we will discuss architectural approaches that will help you architect big data solution for fast interactive queries, simplified security model and improved user adoption with BI users.

Traditional approach to fast interactive BI

Deep analytical queries processed on Hadoop systems have traditionally been slow. MapReduce jobs or hive queries are used for heavy processing of large datasets, however, not suitable for the fast response time required by interactive BI usage.

Faced with user dissatisfaction due to lack of query interactivity, data architects used techniques such as building OLAP cubes on top of Hadoop. An OLAP cube is a mechanism to store all the different dimensions, measures and hierarchies up front. Processing the cube usually takes place at the pre-specified interval. Post processing, results are available in advance, so once the BI tool queries the cube it just needs to locate the result, thereby limiting the query response time and making it a fast and interactive

Share

16

Aug

Azure #HDInsight Apache Phoenix now supports Zeppelin
Azure #HDInsight Apache Phoenix now supports Zeppelin

The HDInsight team is excited to announce Apache Zeppelin Support for Apache Phoenix

Phoenix in Azure HDInsight

Apache Phoenix is an open source, massively parallel relational database layer built on HBase. Phoenix allows you to use SQL like queries over HBase. Phoenix uses JDBC drivers underneath to enable users to create, delete, alter SQL tables, indexes, views and sequences, upset rows individually and in bulk. Phoenix uses NOSQL native compilation rather than using MapReduce to compile queries, enabling the creation of low-latency applications on top of HBase.

Apache Phoenix enables OLTP and operational analytics in Hadoop for low latency applications by combining the best of both worlds. In Azure HDInsight Apache Phoenix is delivered as a 1st class Open Source framework.

Why use Apache Phoenix in Azure HDInsight?

HDInsight is the best place for you to run Apache Phoenix and other Open Source Big Data Applications. HDInsight makes Apache Phoenix even better in following ways:

Out of the box highly tuned Apache Phoenix cluster in minutes

In Azure, several large customers runs their mission critical HBase/Phoenix workloads, over the period of time services becomes more and more intelligent about right configurations for running the HBase workloads as efficiently as possible.

Share

09

Aug

Azure Data Factory Visual tools now supports GitHub integration

GitHub is a development platform that allows you to host and review code, manage projects and build software alongside millions of other developers from open source to business. Azure Data Factory (ADF) is a managed data integration service in Azure that allows you to iteratively build, orchestrate, and monitor your Extract Transform Load (ETL) workflows. You can now integrate your Azure Data Factory with GitHub. The ADF visual authoring integration with GitHub allows you to collaborate with other developers, do source control, versioning of your data factory assets (pipelines, datasets, linked services, triggers, and more). Simply click ‘Set up Code Repository’ and select ‘GitHub’ from the Repository Type dropdown to get started.

ADF-GitHub integration allows you to use either public Github or GitHub Enterprise depending on your requirements. You can use OAuth authentication to login to your GitHub account. ADF automatically pulls the repositories in your GitHub account that you can select. You can then choose the branch that developers in your team can use to do collaboration. You can also easily import all your current data factory resources to your GitHub repository.

Once you enable ADF-GitHub integration, you can now save your data factory resources

Share

09

Aug

Azure Data Factory Visual tools now supports GitHub integration

GitHub is a development platform that allows you to host and review code, manage projects and build software alongside millions of other developers from open source to business. Azure Data Factory (ADF) is a managed data integration service in Azure that allows you to iteratively build, orchestrate, and monitor your Extract Transform Load (ETL) workflows. You can now integrate your Azure Data Factory with GitHub. The ADF visual authoring integration with GitHub allows you to collaborate with other developers, do source control, versioning of your data factory assets (pipelines, datasets, linked services, triggers, and more). Simply click ‘Set up Code Repository’ and select ‘GitHub’ from the Repository Type dropdown to get started.

ADF-GitHub integration allows you to use either public Github or GitHub Enterprise depending on your requirements. You can use OAuth authentication to login to your GitHub account. ADF automatically pulls the repositories in your GitHub account that you can select. You can then choose the branch that developers in your team can use to do collaboration. You can also easily import all your current data factory resources to your GitHub repository.

Once you enable ADF-GitHub integration, you can now save your data factory resources

Share

09

Aug

How Microsoft drives exabyte analytics on the world’s largest YARN cluster

At Microsoft, like many companies using data for competitive advantage, opportunities for insight abound and our analytics needs were scaling fast – almost out of control. We invested in Yet Another Resource Manager (YARN) to meet the demands of an exabyte-scale analytics platform and ended up creating the world’s largest YARN Cluster.

In big data, how big is really big?

Yarn is known to scale to thousands of nodes, but what happens when you need to tens of thousands of nodes? The Cloud & Information Service at Microsoft is a highly specialized team of experts that work on applied research and science initiatives focusing on data processing and distributed systems. This blog explains how CISL and the Microsoft Big Data team met the challenge of complex scale and resource management – and ended up implementing the world’s largest YARN cluster to drive its exabyte-sized analytics.

Exabyte-size analytics

For more than a decade Microsoft has depended on internal version of the publicly available Azure Data Lake for its own super-sized analytics. The volume of data and complexity of calculation has caused it to scale to several larger clusters. To the best of our knowledge, Microsoft is currently running the largest Yarn

Share

09

Aug

How Microsoft drives exabyte analytics on the world’s largest YARN cluster

At Microsoft, like many companies using data for competitive advantage, opportunities for insight abound and our analytics needs were scaling fast – almost out of control. We invested in Yet Another Resource Manager (YARN) to meet the demands of an exabyte-scale analytics platform and ended up creating the world’s largest YARN Cluster.

In big data, how big is really big?

Yarn is known to scale to thousands of nodes, but what happens when you need to tens of thousands of nodes? The Cloud & Information Service at Microsoft is a highly specialized team of experts that work on applied research and science initiatives focusing on data processing and distributed systems. This blog explains how CISL and the Microsoft Big Data team met the challenge of complex scale and resource management – and ended up implementing the world’s largest YARN cluster to drive its exabyte-sized analytics.

Exabyte-size analytics

For more than a decade Microsoft has depended on internal version of the publicly available Azure Data Lake for its own super-sized analytics. The volume of data and complexity of calculation has caused it to scale to several larger clusters. To the best of our knowledge, Microsoft is currently running the largest Yarn

Share

07

Aug

Azure HDInsight Interactive Query: Ten tools to analyze big data faster

Customers use HDInsight Interactive Query (also called Hive LLAP, or Low Latency Analytical Processing) to query data stored in Azure storage & Azure Data Lake Storage in super-fast manner. Interactive query makes it easy for developers and data scientist to work with the big data using BI tools they love the most. HDInsight Interactive Query supports several tools to access big data in easy fashion. In this blog we have listed most popular tools used by our customers:

Microsoft Power BI

Microsoft Power BI Desktop has a native connector to perform direct query against  HDInsight Interactive Query cluster. You can explore and visualize the data in interactive manner. To learn more see Visualize Interactive Query Hive data with Power BI in Azure HDInsight and Visualize big data with Power BI in Azure HDInsight.

Apache Zeppelin

Apache Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. You can access Interactive Query from Apache Zeppelin using a JDBC interpreter. To learn more please see Use Zeppelin to run Hive queries in Azure HDInsight.

Visual Studio Code

With HDInsight Tools for VS Code, you can submit interactive queries as well at look at job information in HDInsight interactive query

Share

07

Aug

Azure HDInsight Interactive Query: Ten tools to analyze big data faster

Customers use HDInsight Interactive Query (also called Hive LLAP, or Low Latency Analytical Processing) to query data stored in Azure storage & Azure Data Lake Storage in super-fast manner. Interactive query makes it easy for developers and data scientist to work with the big data using BI tools they love the most. HDInsight Interactive Query supports several tools to access big data in easy fashion. In this blog we have listed most popular tools used by our customers:

Microsoft Power BI

Microsoft Power BI Desktop has a native connector to perform direct query against  HDInsight Interactive Query cluster. You can explore and visualize the data in interactive manner. To learn more see Visualize Interactive Query Hive data with Power BI in Azure HDInsight and Visualize big data with Power BI in Azure HDInsight.

Apache Zeppelin

Apache Zeppelin interpreter concept allows any language/data-processing-backend to be plugged into Zeppelin. You can access Interactive Query from Apache Zeppelin using a JDBC interpreter. To learn more please see Use Zeppelin to run Hive queries in Azure HDInsight.

Visual Studio Code

With HDInsight Tools for VS Code, you can submit interactive queries as well at look at job information in HDInsight interactive query

Share

06

Aug

Accelerate healthcare initiatives with Azure UK NHS blueprints

Today, the healthcare industry is confronting many complex and daunting challenges that include demands to:

Increase patient engagement. Take advantage of big data, analytics, artificial Intelligence (AI), and machine learning (ML). Integrate consumer health apps, wearables, and the Internet of Medical Things (IoMT). Combat cybersecurity threats, breaches, and ransomware.

In the midst of this, however, healthcare organizations must continue to:

Deliver the best patient care. Improve patient outcomes. Reduce healthcare costs (now 7 percent of GDP in the UK and almost 18 percent of GDP in the United States). Enhance patient and clinician experiences.

And all with limited budget and resources!

Cloud computing can help healthcare organizations focus on patient care and reducing costs, and it enables IT to be more flexible, agile, scalable, and secure as the healthcare industry changes and grows.

A key challenge to adopting cloud computing is that healthcare needs solutions, not IT projects. Healthcare organizations of every size often have limited IT and cybersecurity resources burdened with maintaining existing IT infrastructure.

So how can they create new solutions?

Rx: Blueprints

To rapidly acquire new capabilities and implement new solutions, healthcare IT and developers can now take advantage of industry-specific Azure Blueprints. These are packages that

Share