Category Archives : Big Data

15

Nov

Azure Toolkit for IntelliJ – Spark Interactive Console
Azure Toolkit for IntelliJ – Spark Interactive Console

We are pleased to reveal the release of Spark Interactive Console in Azure Toolkit for IntelliJ. This new component facilitates Spark job authoring, and enables you to run code interactively in a shell-like environment within IntelliJ.

The Spark console includes Spark Local Console and Spark Livy Interactive Session. When you run the Spark console, instances of SparkSession and SparkContext are automatically instantiated like in Spark shell. You can use ‘spark’ to access the SparkSession and use ‘sc’ to access the SparkContext. The Spark local console allows you to run your code interactively and validate your code logic locally. You can also check your programming variables and perform other scripting operations locally before submitting to the cluster. The Spark Livy interactive session establishes an interactive communication channel with your cluster so you can check on file schemas, preview data, and run ad-hoc queries while you are programming your Spark job. You can also easily switch the Livy interactive session against different Spark clusters.

The Spark console has a Language Service built-in for Scala programming. You can leverage the language service features, such as IntelliSense and autocomplete, to look up a Spark object (i.e., Spark context and Spark session) properties, query hive

Share

08

Nov

Tips and tricks for migrating on-premises Hadoop infrastructure to Azure HDInsight

Today, we are excited to share our tips and tricks series on how to migrate on-premises Hadoop infrastructure to Microsoft Azure HDInsight.

Every day, thousands of customers run their mission-critical big data workloads on Azure HDInsight. Many of our customers migrate workloads to HDInsight from on-premises due to its enterprise-grade capabilities and support for open source workloads such as Hive, Spark, Hadoop,  Kafka, HBase, Phoenix, Storm, and R.

This six-part guide takes you through the migration process and shows you not only how to move your Hadoop workloads to Azure HDInsight, but also shares best practices on how to optimize your architecture, infrastructure, storage, and more. This guide was written in collaboration with the Azure Customer Advisory team based on a wealth of experience from helping many customers with Hadoop migrations. 

Motivation and benefits covers the benefits of migrating on-premises Hadoop ecosystem components to HDInsight and how to plan for the migration. Architecture best practices provides best practices for the architecture of HDInsight systems and addresses different types of workloads. Infrastructure best practices goes into detailed recommendations for managing the infrastructure of HDInsight clusters. Storage best practices gives recommendations for data storage in HDInsight systems. Data migration best practices

Share

06

Nov

Secure incoming traffic to HDInsight clusters in a virtual network with private endpoint

We are excited to announce the general availability of private endpoint in HDInsight clusters deployed in a virtual network. This feature enables enterprises to better isolate access to their HDInsight clusters from the public internet and enhance their security at the networking layer.

Previously, when customers deployed an HDI cluster in a virtual network, there was only one public endpoint available in the form of https://<CLUSTERNAME>.azurehdinsight.net. This endpoint resolves to a public IP for accessing the cluster. Customers who wanted to restrict the incoming traffic had to use network security group (NSG) rules. Specifically, they had to white-list the IPs of both the HDInsight management traffic as well as the end users who wanted to access the cluster. These end users might have already been located inside the virtual network, but they had to be white-listed to be able to reach the public endpoint. It was hard to identify and white-list these end users’ dynamic IPs, as they would often change.

With the introduction of private endpoint, customers can now use NSG rules to separate access from the public internet and end users that are within the virtual network’s trusted boundary. The virtual network can be extended to the on-premise

Share

29

Oct

Local testing with live data means faster development with Azure Stream Analytics

We are excited to announce that live data local testing is now available for public preview in Azure Stream Analytics Visual Studio tools. Have you ever thought of being able to test Azure Stream Analytics queries logic with live data without running in the cloud? Are you excited by the possibility of not having to wait for your queries to deploy and other round-trip delays? Well, this enhancement lets you test your queries locally while using live data streams from cloud sources such as Azure Event Hubs, IoT Hub or Blob storage. You can even use all the Azure Stream Analytics time policies in the local Visual Studio environment. Being able to start and stop queries in seconds and local debugging significantly improves development productivity by saving precious time on the inner loop of query logic testing.

Live data local testing experience

The new local testing runtime can read live streaming data from the cloud or from a local static file. It works the same as the cloud runtime Azure Stream Analytics and therefore supports the same time policies needed for many testing scenarios. The query runs in a simulated environment suitable for a single server development environment and

Share

18

Oct

Parameterize connections to your data stores in Azure Data Factory

Azure Data Factory (ADF) enables you to do hybrid data movement from 70 plus data stores in a serverless fashion. Often users want to connect to multiple data stores of the same type. For example, you might want to connect to 10 different databases in your Azure SQL Server and the only difference between those 10 databases is the database name. You can now parameterize the linked service in your Azure Data Factory. In this case, you can parameterize the database name in your ADF linked service instead of creating 10 separate linked services corresponding to the 10 Azure SQL databases. This reduces overhead and improves manageability for your data factories. You can then dynamically pass the database names at runtime. Simply create a new linked service and click Add Dynamic Content underneath the property that you want to parameterize in your linked service.

You can also parameterize other properties of your linked service like server name, username, and more. We recommend not to parameterize passwords or secrets. Store all connection strings in Azure Key Vault instead, and parameterize the “Secret Name” instead. The user experience also guides you in case you type incorrect syntax to parameterize the

Share

16

Oct

Apache Spark jobs gain up to 9x speed up with HDInsight IO Cache

Today, we are pleased to reveal the preview of HDInsight IO Cache, a new transparent data caching feature of Azure HDInsight that provides customers with up to a 9x performance improvement for Apache Spark jobs. We know from our customers that when it comes to analytics cost efficiency of managed cloud-based Apache Hadoop and Spark services is one of their major attractors. HDInsight IO Cache allows us to improve this key value proposition even further by improving performance without a corresponding increase in costs.

Architecture

Azure HDInsight is a cloud platform service for open source analytics that aims to bring the best open source projects and integrate them natively on Azure. There are many open source caching projects that exists in the ecosystem: Alluxio, Ignite, and RubiX to name a few prominent ones.

HDInsight IO Cache is based on RubiX. RubiX is one of the more recent projects and has a distinct architecture. Unlike other caching projects, it doesn’t reserve operating memory for caching purposes. Instead, it leverages recent advances in SSD technology to their fullest potential to make explicit memory management unnecessary. Modern SSDs routinely provide more than 1GB per second of bandwidth. Coupled with automatic operating system in-memory

Share

08

Oct

A fast, serverless, big data pipeline powered by a single Azure Function

A single Azure function is all it took to fully implement an end-to-end, real-time, mission critical data pipeline. And it was done with a serverless architecture. Serverless architectures simplify the building, deployment, and management of cloud scale applications. Instead of worrying about data infrastructure like server procurement, configuration, and management a data engineer can focus on the tasks it takes to ensure an end-to-end and highly functioning data pipeline.

This blog describes an Azure function and how it efficiently coordinated a data ingestion pipeline that processed over eight million transactions per day.

Scenario

A large bank wanted to build a solution to detect fraudulent transactions submitted through mobile phone banking applications. The solution requires a big data pipeline approach. High volumes of real-time data are ingested into a cloud service, where a series of data transformation and extraction activities occur. This results in the creation of a feature data set, and the use of advanced analytics. For the bank, the pipeline had to be very fast and scalable, end-to-end evaluation of each transaction had to complete in less than two seconds.

Telemetry from the bank’s multiple application gateways, stream in as embedded events in complex JSON files. The ingestion technology

Share

03

Oct

Bring Your Own Keys for Apache Kafka on HDInsight
Bring Your Own Keys for Apache Kafka on HDInsight

One of the biggest security and compliance requirements for enterprise customers is to encrypt their data at rest using their own encryption key. This is even more critical in a post-GDPR world. Today, we’re announcing the public preview of Bring Your Own Key (BYOK) for data at rest in Apache Kafka on Azure HDInsight.

Azure HDInsight clusters already provide several levels of security. At the perimeter level, traffic can be controlled via Virtual Networks and Network Security Groups. Kerberos authentication and Apache Ranger provide the ability to finely control access to Kafka topics. Further, all managed disks are protected via Azure Storage Service Encryption (SSE). However, for some customers it is vital that they own and manage the keys used to encrypt the data at rest. Some customers achieve this by encrypting all Kafka messages in their producer applications and decrypting them in their consumer applications. This process is cumbersome and involves custom logic. Moreover, it doesn’t allow for usage of community supported connectors.

With HDInsight Kafka’s support for Bring Your Own Key (BYOK), encryption at rest is a one step process handled during cluster creation. Customers should use a user-assigned managed identity with the Azure Key Vault (AKV) to

Share

02

Oct

Cloud Scale Analytics meets Office 365 data – empowered by Azure Data Factory

Office 365 holds a wealth of information about how people work and how they interact and collaborate with each other, and this valuable asset enables intelligent applications to derive value and to optimize organizational productivity. Today application developers use Microsoft Graph API to access Office 365 in a transactional way. This approach however is not efficient if you need to analyze over large amount of Office artifacts across a long time horizon. Further, Office 365 data is isolated from other business data and systems, leading to data silos and untapped opportunity for additional insights.

 

Azure offers a rich set of hyperscale analytics services with enterprise-grade security and are available in data centers worldwide. By marrying Office 365 data and Azure, Office 365 data can be available in Azure and developers can harness the full power of Azure to build highly scalable and secure applications against the combination of Office 365 data and other business data.

 

This week at Ignite we announced the Public Preview of Microsoft Graph data connect, which enables secured, governed, and scalable access of Office 365 data in Azure. With this offering, for the very first time, all your data – organizational, customer, transactional, external

Share

02

Oct

Enabling real-time data warehousing with Azure SQL Data Warehouse

Gaining insights rapidly from data is critical to competitiveness in today’s business world. Azure SQL Data Warehouse (SQL DW), Microsoft’s fully managed analytics platform leverages Massively Parallel Processing (MPP) to run complex interactive SQL queries at every level of scale.

Users today expect data within minutes, a departure from traditional analytics systems which used to operate on data latency of a single day or more. With the requirement for faster data, users need ways of moving data from source systems into their analytical stores in a simple, quick, and transparent fashion. In order to deliver on modern analytics strategies, it is necessary that users are acting on current information. This means that users must enable the continuous movement from enterprise data, from on-premise to cloud and everything in-between.

SQL Data Warehouse is happy to announce that Striim now fully supports SQL Data Warehouse as a target for Striim for Azure. Striim enables continuous non-intrusive performant ingestion of all your enterprise data from a variety of sources in real time. This means that users can use intelligent pipelines for change data capture from sources such as Oracle Exadata straight into SQL Data Warehouse. Striim can also be used to move fast

Share