Category Archives : Big Data

18

Oct

Parameterize connections to your data stores in Azure Data Factory

Azure Data Factory (ADF) enables you to do hybrid data movement from 70 plus data stores in a serverless fashion. Often users want to connect to multiple data stores of the same type. For example, you might want to connect to 10 different databases in your Azure SQL Server and the only difference between those 10 databases is the database name. You can now parameterize the linked service in your Azure Data Factory. In this case, you can parameterize the database name in your ADF linked service instead of creating 10 separate linked services corresponding to the 10 Azure SQL databases. This reduces overhead and improves manageability for your data factories. You can then dynamically pass the database names at runtime. Simply create a new linked service and click Add Dynamic Content underneath the property that you want to parameterize in your linked service.

You can also parameterize other properties of your linked service like server name, username, and more. We recommend not to parameterize passwords or secrets. Store all connection strings in Azure Key Vault instead, and parameterize the “Secret Name” instead. The user experience also guides you in case you type incorrect syntax to parameterize the

Share

16

Oct

Apache Spark jobs gain up to 9x speed up with HDInsight IO Cache

Today, we are pleased to reveal the preview of HDInsight IO Cache, a new transparent data caching feature of Azure HDInsight that provides customers with up to a 9x performance improvement for Apache Spark jobs. We know from our customers that when it comes to analytics cost efficiency of managed cloud-based Apache Hadoop and Spark services is one of their major attractors. HDInsight IO Cache allows us to improve this key value proposition even further by improving performance without a corresponding increase in costs.

Architecture

Azure HDInsight is a cloud platform service for open source analytics that aims to bring the best open source projects and integrate them natively on Azure. There are many open source caching projects that exists in the ecosystem: Alluxio, Ignite, and RubiX to name a few prominent ones.

HDInsight IO Cache is based on RubiX. RubiX is one of the more recent projects and has a distinct architecture. Unlike other caching projects, it doesn’t reserve operating memory for caching purposes. Instead, it leverages recent advances in SSD technology to their fullest potential to make explicit memory management unnecessary. Modern SSDs routinely provide more than 1GB per second of bandwidth. Coupled with automatic operating system in-memory

Share

08

Oct

A fast, serverless, big data pipeline powered by a single Azure Function

A single Azure function is all it took to fully implement an end-to-end, real-time, mission critical data pipeline. And it was done with a serverless architecture. Serverless architectures simplify the building, deployment, and management of cloud scale applications. Instead of worrying about data infrastructure like server procurement, configuration, and management a data engineer can focus on the tasks it takes to ensure an end-to-end and highly functioning data pipeline.

This blog describes an Azure function and how it efficiently coordinated a data ingestion pipeline that processed over eight million transactions per day.

Scenario

A large bank wanted to build a solution to detect fraudulent transactions submitted through mobile phone banking applications. The solution requires a big data pipeline approach. High volumes of real-time data are ingested into a cloud service, where a series of data transformation and extraction activities occur. This results in the creation of a feature data set, and the use of advanced analytics. For the bank, the pipeline had to be very fast and scalable, end-to-end evaluation of each transaction had to complete in less than two seconds.

Telemetry from the bank’s multiple application gateways, stream in as embedded events in complex JSON files. The ingestion technology

Share

03

Oct

Bring Your Own Keys for Apache Kafka on HDInsight
Bring Your Own Keys for Apache Kafka on HDInsight

One of the biggest security and compliance requirements for enterprise customers is to encrypt their data at rest using their own encryption key. This is even more critical in a post-GDPR world. Today, we’re announcing the public preview of Bring Your Own Key (BYOK) for data at rest in Apache Kafka on Azure HDInsight.

Azure HDInsight clusters already provide several levels of security. At the perimeter level, traffic can be controlled via Virtual Networks and Network Security Groups. Kerberos authentication and Apache Ranger provide the ability to finely control access to Kafka topics. Further, all managed disks are protected via Azure Storage Service Encryption (SSE). However, for some customers it is vital that they own and manage the keys used to encrypt the data at rest. Some customers achieve this by encrypting all Kafka messages in their producer applications and decrypting them in their consumer applications. This process is cumbersome and involves custom logic. Moreover, it doesn’t allow for usage of community supported connectors.

With HDInsight Kafka’s support for Bring Your Own Key (BYOK), encryption at rest is a one step process handled during cluster creation. Customers should use a user-assigned managed identity with the Azure Key Vault (AKV) to

Share

02

Oct

Cloud Scale Analytics meets Office 365 data – empowered by Azure Data Factory

Office 365 holds a wealth of information about how people work and how they interact and collaborate with each other, and this valuable asset enables intelligent applications to derive value and to optimize organizational productivity. Today application developers use Microsoft Graph API to access Office 365 in a transactional way. This approach however is not efficient if you need to analyze over large amount of Office artifacts across a long time horizon. Further, Office 365 data is isolated from other business data and systems, leading to data silos and untapped opportunity for additional insights.

 

Azure offers a rich set of hyperscale analytics services with enterprise-grade security and are available in data centers worldwide. By marrying Office 365 data and Azure, Office 365 data can be available in Azure and developers can harness the full power of Azure to build highly scalable and secure applications against the combination of Office 365 data and other business data.

 

This week at Ignite we announced the Public Preview of Microsoft Graph data connect, which enables secured, governed, and scalable access of Office 365 data in Azure. With this offering, for the very first time, all your data – organizational, customer, transactional, external

Share

02

Oct

Enabling real-time data warehousing with Azure SQL Data Warehouse

Gaining insights rapidly from data is critical to competitiveness in today’s business world. Azure SQL Data Warehouse (SQL DW), Microsoft’s fully managed analytics platform leverages Massively Parallel Processing (MPP) to run complex interactive SQL queries at every level of scale.

Users today expect data within minutes, a departure from traditional analytics systems which used to operate on data latency of a single day or more. With the requirement for faster data, users need ways of moving data from source systems into their analytical stores in a simple, quick, and transparent fashion. In order to deliver on modern analytics strategies, it is necessary that users are acting on current information. This means that users must enable the continuous movement from enterprise data, from on-premise to cloud and everything in-between.

SQL Data Warehouse is happy to announce that Striim now fully supports SQL Data Warehouse as a target for Striim for Azure. Striim enables continuous non-intrusive performant ingestion of all your enterprise data from a variety of sources in real time. This means that users can use intelligent pipelines for change data capture from sources such as Oracle Exadata straight into SQL Data Warehouse. Striim can also be used to move fast

Share

01

Oct

HDInsight Enterprise Security Package now generally available
HDInsight Enterprise Security Package now generally available

Enterprise Security Package GA for HDInsight 3.6

The HDInsight team is excited to announce the general availability of Enterprise Security Package (ESP) for Apache Spark, Apache Hadoop and Interactive Query clusters in HDInsight 3.6. When enterprise customers share clusters between multiple employees, Hadoop admins must ensure those employees have the right set of accesses and permissions to perform big data operations. In enterprises, multi-user access with granular authorization using the same identities in the enterprise is a complex and lengthy process. Enabling ESP with the new experience provides authentication and authorization for these clusters in a more streamlined and secure manner.

For authentication, open source Apache Hadoop relies on Kerberos. Customers can enable Azure AD Domain Services (AAD-DS) as the main domain controller and use that for domain joining of the clusters. The same identities available in AAD-DS will then be able to login to the cluster.

For authorization, customers can set Apache Ranger policies to get fine-grained authorization in their clusters. Apache Hive and Yarn Ranger plugins are available for setting these policies.

To learn more about ESP and how to enable it, see our documentation.

Public preview of ESP for Apache Kafka and HBase

We are also expanding

Share

27

Sep

Spark Debugging and Diagnosis Toolset for Azure HDInsight
Spark Debugging and Diagnosis Toolset for Azure HDInsight

Debugging and diagnosing large, distributed big data sets is a hard and time-consuming process. Debugging big data queries and pipelines has become more critical for enterprises and includes debugging across many executors, fixing complex data flow issues, diagnosing data patterns, and debugging problems with cluster resources. The lack of enterprise-ready Spark job management capabilities constrains the ability of enterprise developers to collaboratively troubleshoot, diagnose and optimize the performance of workflows.

Microsoft is now bringing its decade-long experience of running and debugging millions of big data jobs to the open source world of Apache Spark. Today, we are delighted to announce the public preview of the Spark Diagnosis Toolset for HDInsight for clusters running Spark 2.3 and up. We are adding a set of diagnosis features to the default Spark history server user experience in addition to our previously released Job Graph and Data tabs. The new diagnosis features assist you in identifying low parallelization, detecting and running data skew analysis, gaining insights on stage data distribution, and viewing executor allocation and usage.

Data and time skew detection and analysis

Development productivity is the key for making enterprises technology teams successful. The Azure HDInsight developer toolset brings industry-leading development practices to

Share

25

Sep

Azure HDInsight and Starburst bring Presto to Microsoft Azure customers

This blog post was co-authored by Matthew Fuller Co-Founder & VP at Starburst

Microsoft and Starburst are excited to announce that Starburst Presto has been added to the Azure HDInsight Application Platform. With the Azure HDInsight Application Platform, Microsoft has enabled a broad set of big data and advanced analytics solutions so customers can deploy them with a single click.

Presto is a fast and scalable distributed SQL query engine. Architected for the separation of storage and compute, Presto can easily query data in Azure Blob Storage, Azure Data Lake Storage, SQL and NoSQL databases, and other data sources.

Adding Presto gives HDInsight users two things:

A fast, scalable, interactive SQL interface to data in Azure Blob and Azure Data Lake Storage.

An easy way to create queries that integrate data in Azure Blob and Azure Data Lake Storage with other sources by leveraging Presto’s vast portfolio of data connectors.

The new Presto option complements other existing open source components on HDInsight such as HBase, Storm, Spark, R, Kafka, and Interactive Query. This further enables customers to use the open source tools most suited for their workloads.

Starburst Presto distribution delivers fast performance (enabled via cost-based

Share

25

Sep

Deep dive into Azure HDInsight 4.0
Deep dive into Azure HDInsight 4.0

We are thrilled to announce that HDInsight 4.0 is now available in public preview. HDInsight 4.0 brings latest Apache Hadoop 3.0 innovations representing over 5 years of work from the open source community and our partner Hortonworks across key apache frameworks to solve ever-growing big data and advanced analytics challenges. With this release, we are bringing new enhancements to all big data open source frameworks on HDInsight.

This blog highlights new capabilities we are enabling for Apache Hive 3.0, Hive Spark Integration, Apache HBase and Apache Phoenix.

Apache Hive 3.0 improvements for fast queries and transactions

To date, driving comprehensive BI on historic and real-time data at scale remains a complex and challenging task. Many organizations have stitched together multiple open source and proprietary tools in order to build a workable BI solution. These solutions often require tedious data movement, complex pipeline management, or continuous manual data tearing to keep data hot. They are often complex to build, difficult to operate, and hard to scale.

Our customers are increasingly looking for simpler yet powerful, enterprise-grade solutions. We at Microsoft are obsessed with the idea of enabling real-time analytics directly on top of data lakes, reducing the need for data movement for

Share