Category Archives : Big Data

14

May

Enhance productivity using Azure Data Factory Visual Tools

With Azure Data Factory (ADF) visual tools, we listened to your feedback and enabled a rich, interactive visual authoring and monitoring experience. It allows you to iteratively create, configure, test, deploy and monitor data integration pipelines without any friction. The main goal of the ADF visual tools is to allow you to be productive with ADF by getting pipelines up and running quickly without requiring to write a single line of code.

We continue to add new features to increase productivity and efficiency for both new and advanced users with intuitive experiences. You can get started by clicking the Author and Monitor tile in your provisioned v2 data factory blade.

Check out some of the exciting new features enabled with data factory visual tools since public preview (January 2018):

Latest data factory updates

Follow exciting new updates to the data factory service.

View Data Factory deployment region, and resource group. Then, switch to another data factory that you have access to.

Visual authoring More data connectors

Ingest data at scale from more than 70 on-premises/cloud data sources in a serverless fashion.

New activities in toolbox Notebook Activity: Ingest data at scale using more than 70 on-premises/cloud

09

May

Azure Event Hubs for Kafka Ecosystems in public preview
Azure Event Hubs for Kafka Ecosystems in public preview

Organizations need data driven strategies to increase competitive advantage. Customers want to stream data or analyze in real-time to get valuable insights faster. To meet these big data needs, you need a massively scalable distributed event driven messaging platform with multiple producers and consumers Apache Kafka and Azure Event Hubs provide such distributed platforms.

How is Event Hubs different from Kafka?

Kafka and Event Hubs are both designed to handle large scale stream ingestion driven by real-time events. Conceptually, both are a distributed, partitioned, and replicated commit log service. Both use partitioned consumer model offering huge scalability for concurrent consumers. Both use a client side cursor concept and scale very high workloads.

Apache Kafka is a software that is installed and run. Azure Event Hubs is a fully managed service in the cloud. While Kafka is popular with its wide eco system and its on-premises and cloud presence, Event Hubs offers you the freedom of not having to manage servers or networks or worry about configuring brokers.

Talk to Event Hubs, like you would with Kafka and unleash the power of PaaS!

Today we are happy to marry both these powerful distributed streaming platforms to offer you Event Hubs for

01

May

Python, Node.js, Go client libraries for Azure Event Hubs in public preview

Azure Event Hubs is expanding its ecosystem to support more languages. Azure Event Hubs is a highly scalable data-streaming platform processing millions of events per second. Event Hubs uses Advanced Message Queuing Protocol (AMQP 1.0) to enable interoperability and compatibility across platforms. Now, with the addition of new clients, you can easily get started with Event Hubs.

We are happy to have the new client libraries for Go, Python, and Node.js in public preview. Do your application logging or click stream analytics pipelines, live Dashboarding, or any telemetry processing with our rich ecosystem offering language of your choice.

Ingest and consume events/logs from your Python applications or stream with your Node.js applications or simply integrate with your Go applications. You now have a wide palette to choose from based on your needs.

The following updates will provide more insights into the public preview of the new client libraries.

Event Hubs for Go, this new package offers you easy-to-use Send and Receive functions which communicates with the Event Hubs service using the AMQP 1.0 protocol as implemented by the github.com/vcabbage/amqp. What more? It also offers the Event Processor Host to manage load balancing and lease management for the consumers. The readme helps

30

Apr

Secure credential management for ETL workloads using Azure Key Vault and Data Factory

Secure credential management is essential to protect data in the cloud. With Azure Key Vault, you can encrypt keys and small secrets like passwords that use keys. Azure Data Factory is now integrated with Azure Key Vault. You can store credentials for your data stores and computes referred in Azure Data Factory ETL (Extract Transform Load) workloads in an Azure Key Vault. Simply create Azure Key Vault linked service and refer to the secret stored in the Key vault in your data factory pipelines.

Azure Data Factory will now automatically pull your credentials for your data stores and computes from Azure Key Vault during pipeline execution. Using Key Vault, you don’t need to provision, configure, patch, and maintain key management software. Just provision new vaults and keys in minutes. Centrally manage keys, secrets, policies and refer to the keys in your data pipelines in data factory. You keep control over your keys by simply granting permission for your own and data factory service to use them as needed. Data Factory never has direct access to keys. Developers manage keys used for Dev/Test and seamlessly migrate to producing the keys that are managed by security operations.

With Azure Key

24

Apr

Azure Toolkit for Eclipse integrates with HDInsight Ambari and supports Spark 2.2

To provide more authentication options, Azure Toolkit for Eclipse now supports integration with HDInsight clusters through Ambari for job submission, cluster resource browse and storage files navigate. You can easily link or unlink any cluster by using an Ambari-managed username and password, which is independent of your Azure sign-in credentials.  The Ambari connection applies to normal Spark and Hive hosted within HDInsight on Azure. These additions give you more flexibility in how you connect to your HDInsight clusters in addition to your Azure subscriptions while also simplifying your experiences in submitting Spark jobs.

With this release, you can benefit the new functionalities and consume the new libraries & APIs from Spark 2.2 in Azure Toolkit for Eclipse. You can create, author and submit a Spark 2.2 project to Spark 2.2 cluster.  With the backward compatibility of Spark 2.2, you can also submit your existing Spark 2.0 and Spark 2.1 projects to a Spark 2.2 cluster.

How to link a cluster Click Link a cluster from Azure Explorer.

Enter Cluster Name, Storage Account, Storage Key, then select a container from Storage Container, at last, input Username and Password. Click the OK button to link cluster.

Please note that you

24

Apr

Azure Toolkit for IntelliJ integrates with HDInsight Ambari and supports Spark 2.2

To provide more authentication options, Azure Toolkit for IntelliJ now supports integration with HDInsight clusters through Ambari for job submission, cluster resource browse and storage files navigate. You can easily link or unlink any cluster by using an Ambari-managed username and password, which is independent of your Azure sign-in credentials.  The Ambari connection applies to normal Spark and Hive hosted within HDInsight on Azure. These additions give you more flexibility in how you connect to your HDInsight clusters in addition to your Azure subscriptions while also simplifying your experiences in submitting Spark jobs.

With this release, you can benefit the new functionalities and consume the new libraries & APIs from Spark 2.2 in Azure Toolkit for IntelliJ. You can create, author and submit a Spark 2.2 project to Spark 2.2 cluster.  With the backward compatibility of Spark 2.2, you can also submit your existing Spark 2.0 and Spark 2.1 projects to a Spark 2.2 cluster.

How to link a cluster Click Link a cluster from Azure Explorer.

Enter Cluster Name, Storage Account, Storage Key, then select a container from Storage Container, at last, input Username and Password.

Please note that you can use either Ambari username, pwd or

16

Apr

Iterative development and debugging using Data Factory
Iterative development and debugging using Data Factory

Data Integration is becoming more and more complex as customer requirements and expectations are continuously changing. There is increasingly a need among users to develop and debug their Extract Transform/Load (ETL) and Extract Load/Transform (ELT) workflows iteratively. Now, Azure Data Factory (ADF) visual tools allow you to do iterative development and debugging.

You can create your pipelines and do test runs using the Debug capability in the pipeline canvas without writing a single line of code. You can view the results of your test runs in the Output window of your pipeline canvas. Once your test run succeeds, you can add more activities to your pipeline and continue debugging in an iterative manner. You can also Cancel your test runs once they are in-progress. You are not required to publish your changes to the data factory service before clicking Debug. This is helpful in scenarios where you want to make sure that the new additions or changes work as expected before you update your data factory workflows in dev, test or prod environments.

Data Factory visual tools also allow you to do debugging until a particular activity in your pipeline canvas. Simply put a breakpoint on the activity until

12

Apr

Rubikloud leverages Azure SQL Data Warehouse to disrupt retail market with accessible AI

In the modern retail environment, consumers are well-informed and expect intuitive, engaging, and informative experiences when they shop. To keep up, retailers need solutions that can help them delight their customers with personalized experiences, empower their workforce to provide differentiated customer experiences, optimize their supply chain with intelligent operations and transform their products and services.

With global scale and intelligence built in to key services, Azure is the perfect platform to build powerful apps to delight retail customers, the possibilities are endless. With a single photo, retailers can create new access points for the customer on a device of their choice. Take a look at this example of what’s possible using Microsoft’s big data and advanced analytics products

AI can be complex, this is where Rubikloud comes in. Rubikloud is focused on accessible AI products for retailers and delivering on the promise of “intelligent decision automation”. They offer a set of SaaS products, Promotion Manager and Customer Lifecycle Manager, that help retailers automate and optimize mass promotional planning and loyalty marketing. These products help retailers reduce the complexities of promotion planning and store allocations and better predict their customers intention and behavior throughout their retail life cycle.

As Rubikloud

10

Apr

Announcing larger, higher scale storage accounts

One of the fastest areas of growth in cloud computing is around data storage. With a variety of workloads such as IoT telemetry, logging, media, genomics and archival driving cloud data growth, the need for scalable capacity, bandwidth, and transactions for storing and analyzing data for business insights, is more important than ever.

Up to 10x increase to Blob storage account scalability

We are excited to announce improvements in the capacity and scalability of standard Azure storage accounts, which greatly improves your experience building cloud-scale applications using Azure Storage. Effective immediately, via a request made to Azure Support, Azure Blob storage accounts or General Purpose v2 storage accounts can support the following larger limits. The defaults remain the same as before.

Resource

Default

New Limit

Max capacity for Blob storage accounts

500 TB

5PB (10x increase)

Max TPS/IOPS for Blob storage accounts

20K

50K (2.5x increase)

Max ingress for Blob storage accounts

5-20 Gbps (varies by region/ redundancy type)

50Gbps (up to 10x increase)

Max egress for Blob storage accounts

10-30 Gbps (varies

09

Apr

Continuous integration and deployment using Data Factory

Azure Data Factory (ADF) visual tools public preview was announced on Jan 16, 2018. With visual tools, you can iteratively build, debug, deploy, operationalize and monitor your big data pipelines. Now, you can follow industry leading best practices to do continuous integration and deployment for your ETL/ELT (extract, transform/load, load/transform) workflows to multiple environments (Dev, Test, PROD etc.). Essentially, you can incorporate the practice of testing for your codebase changes and push the tested changes to a Test or Prod environment automatically.

ADF visual interface now allows you to export any data factory as an ARM (Azure Resource Manager) template. You can click the ‘Export ARM template’ to export the template corresponding to a factory.

This will generate 2 files:

Template file: Template json containing all the data factory metadata (pipelines, datasets etc.) corresponding to your data factory. Configuration file: Contains environment parameters that will be different for each environment (Dev, Test, Prod etc.) like Storage connection, Azure Databricks cluster connection etc..

You will create a separate data factory per environment. You will then use the same template file for each environment and have one configuration file per environment. Clicking the ‘Import ARM Template’ button will take you to