Category Archives : Big Data

21

May

Drive higher utilization of Azure HDInsight clusters with Autoscale

We are excited to share the preview release of the Autoscale feature for Azure HDInsight. This feature enables enterprises to become more productive and cost-efficient by automatically scaling clusters up or down based on the load or a customized schedule. 

Let’s consider the scenario of a U.S. based health provider who is using Azure HDInsight to build a unified big data platform at corporate level to process various data for trend prediction or usage pattern analysis. To achieve their business goals, they operate multiple HDInsight clusters in production for real-time data ingestion, batch and interactive analysis.

Some clusters are customized to exact requirements, such as ISV/line of business applications and access control policies, which are subject to rigorous SLA requirements. Sizing such clusters is a hard problem by itself and operating them 24/7 at peak capacity is expensive. So once the clusters are created, IT admins either need to manually monitor the dynamic capacity requirements, scale the clusters up and down, or develop custom tools to do the same. These challenges prevent IT admins from being as productive as possible when building and operating cost-efficient big data analytics workloads.

With the new cluster Autoscaling feature, IT admins can have the

Share

16

May

Microsoft 365 boosts usage analytics with Azure Cosmos DB – Part 2

https://azure.microsoft.com/blog/microsoft-365-boosts-usage-analytics-with-azure-cosmos-db-part-2/

Share

16

May

Microsoft 365 boosts usage analytics with Azure Cosmos DB

https://azure.microsoft.com/blog/microsoft-365-boosts-usage-analytics-with-azure-cosmos-db/

Share

01

May

Azure Stack IaaS – part seven

https://azure.microsoft.com/blog/azure-stack-iaas-part-seven-2/

Share

01

May

Migrating big data workloads to Azure HDInsight
Migrating big data workloads to Azure HDInsight

https://azure.microsoft.com/blog/migrating-big-data-workloads-to-azure-hdinsight/

Share

24

Apr

5 tips to get more out of Azure Stream Analytics Visual Studio Tools

Azure Stream Analytics is an on-demand real-time analytics service to power intelligent action. Azure Stream Analytics tools for Visual Studio make it easier for you to develop, manage, and test Stream Analytics jobs. This year we provided two major updates in January and March, unleashing new useful features. In this blog we’ll introduce some of these capabilities and features to help you improve productivity.

Test partial scripts locally

In the latest March update we enhanced local testing capability. Besides running the whole script, now you can select part of the script and run it locally against the local file or live input stream. Click Run Locally or press F5/Ctrl+F5 to trigger the execution. Note that the selected portion of the larger script file must be a logically complete query to execute successfully.

Share inputs, outputs, and functions across multiple scripts

It is very common for multiple Stream Analytics queries to use the same inputs, outputs, or functions. Since these configurations and code are managed as files in Stream Analytics projects, you can define them only once and then use them across multiple projects. Right-click on the project name or folder node (inputs, outputs, functions, etc.) and then choose Add Existing

Share

18

Apr

Manage Azure HDInsight clusters using .NET, Python, or Java

We are pleased to announce the general availability of the new Azure HDInsight management SDKs for .NET, Python, and Java.

Highlights of this release More languages: In addition to .NET, you can now easily manage your HDInsight clusters using Python or Java. Manage HDInsight clusters: The SDK provides several useful operations to manage your HDInsight clusters, including the ability to create clusters, delete clusters, scale clusters, list existing clusters, get cluster details, update cluster tags, execute script actions, and more. Monitor HDInsight clusters: Manage your HDInsight cluster’s integration with Azure Monitor logs. HDInsight clusters can emit metrics into queryable tables in a Log Analytics workspace so you can monitor all of your clusters in one place. Use the SDK to enable, disable, or view the status of Azure Monitor Logs integration on a cluster. Script actions: Use the SDK to execute, delete, list, and view details for script actions on your HDInsight clusters.  Script actions allow you to run scripts as Ambari operations to configure and customize your cluster. Getting started

You can learn how to get started with the HDInsight management SDK in the language of your choice here:

.NET Getting Started Guide Python Getting Started Guide Java Getting

Share

02

Apr

Updates to geospatial functions in Azure Stream Analytics – Cloud and IoT edge

Azure Stream Analytics is a fully managed PaaS service that helps you run real-time analytics and complex event processing logic on telemetry from devices and applications. Numerous built-in functions available in Stream Analytics helps users build real-time applications using simple SQL language with utmost ease. By using these capabilities customers can quickly realize powerful applications for scenarios such as fleet monitoring, connected cars, mobile asset tracking, geofence monitoring, ridesharing, etc.

Today, we are excited to announce several enhancements to geospatial features. These features will help customers manage a much larger set of mobile assets and vehicle fleet easily, accurately, and more contextually than previously possible. These capabilities are available both in the cloud and on Azure IoT edge.

Here is a quick run-down of the new capabilities:

Geospatial indexing

Previously, to track ‘n’ number of assets in streaming data across ‘m’ number of geofence reference data points, in the geospatial context, translated into a cross join of every reference data entry with every streaming event thus resulting in an O(n*m) operation. This presented scale issues in scenarios where customers need to manage thousands of assets across hundreds of sites.

To address this limitation, Stream Analytics now supports indexing geospatial data

Share

02

Apr

Monitoring on Azure HDInsight Part 2: Cluster health and availability

This is the second blog post in a four-part series on Monitoring on Azure HDInsight. “Monitoring on Azure HDInsight Part 1: An Overview” discusses the three main monitoring categories: cluster health and availability, resource utilization and performance, and job status and logs. This blog covers the first of those topics, cluster health and availability, in more depth.

As a high-availability service, Azure HDInsight ensures that you can spend time focused on your workloads, not worrying about the availability of your cluster. To accomplish this, HDInsight clusters are equipped with two head nodes, two gateway nodes, and three ZooKeeper nodes, making sure there is no single point of failure for your cluster. Nevertheless, Azure HDInsight offers multiple ways to comprehensively monitor the status of your clusters’ nodes and the components that run on them. HDInsight clusters include both Apache Ambari, which provides health information at a glance and predefined alerts, as well as Azure Monitor logs integration, which allows the querying of metrics and logs as well as configurable alerts.

Apache Ambari                   

Apache Ambari, included on all HDInsight clusters, simplifies cluster management and monitoring cluster via an easy-to-use web UI and REST API. Today, Ambari is the best way to monitor

Share

01

Apr

Schema validation with Event Hubs
Schema validation with Event Hubs

Event Hubs is fully managed, real-time ingestion Azure service. It integrates seamlessly with other Azure services. It also allows Apache Kafka clients and applications to talk to Event Hubs without any code changes.

Apache Avro is a binary serialization format. It relies on schemas (defined in JSON format) that define what fields are present and their type. Since it’s a binary format, you can produce and consume Avro messages to and from the Event Hubs.

Event Hubs’ focus is on the data pipeline. It doesn’t validate the schema of the Avro events.

If it’s expected that the producer and consumers will not be in sync with the event schemas, there needs to be “source of truth” for the schema tracking, both for producers and consumers.

Confluent has a product for this. It’s called Schema Registry. Schema Registry is part of the Confluent’s open source offering.

Schema Registry can store schemas, list schemas, list all the versions of a given schema, retrieve a certain version of a schema, get the latest version of a schema, and it can do schema validation. It has a UI and you can manage schemas via its REST APIs as well.

What are my options

Share