Category Archives : Big Data



MileIQ and Azure Event Hubs: Billions of miles streamed
MileIQ and Azure Event Hubs: Billions of miles streamed

This post was co-authored by Shubha Vijayasarathy, Program Manager, Azure Messaging (Event Hubs)

With billions of miles logged, MileIQ provides stress-free logging and accurate mileage reports for millions of drivers. Logging and reporting miles driven is a necessity for independent contractors to organizations with employees who need to drive for work. MileIQ automates mileage logging to create accurate records of miles driven, minimizing the effort and time needed with manual calculations. Real-time mileage tracking produces over a million location signal events per hour, requiring fast and resilient event processing that scales.

MileIQ leverages Apache Kafka to ingest massive streams of data:

Event processing: Events that demand time-consuming processing are put into Kafka, and multiple processors consume and process these asynchronously. Communication among micro-services: Events are published by the event-owning micro-service on Kafka topics. The other micro-services, which are interested in these events, subscribe to these topics to consume the events. Data Analytics: As all the important events are published on Kafka, the data analytics team subscribes to the topics it is interested in and pulls all the data it requires for data processing. Growth Challenges

As with any successful venture, growth introduces operational challenges as infrastructure struggles to support the




Silo busting 2.0—Multi-protocol access for Azure Data Lake Storage

Cloud data lakes solve a foundational problem for big data analytics—providing secure, scalable storage for data that traditionally lives in separate data silos. Data lakes were designed from the start to break down data barriers and jump start big data analytics efforts. However, a final “silo busting” frontier remained, enabling multiple data access methods for all data—structured, semi-structured, and unstructured—that lives in the data lake.

Providing multiple data access points to shared data sets allow tools and data applications to interact with the data in their most natural way. Additionally, this allows your data lake to benefit from the tools and frameworks built for a wide variety of ecosystems. For example, you may ingest your data via an object storage API, process the data using the Hadoop Distributed File System (HDFS) API, and then ingest the transformed data using an object storage API into a data warehouse.

Single storage solution for every scenario

We are very excited to announce the preview of multi-protocol access for Azure Data Lake Storage! Azure Data Lake Storage is a unique cloud storage solution for analytics that offers multi-protocol access to the same data. Multi-protocol access to the same data, via Azure Blob storage API




New capabilities in Stream Analytics reduce development time for big data apps

Azure Stream Analytics is a fully managed PaaS offering that enables real-time analytics and complex event processing on fast moving data streams. Thanks to zero-code integration with over 15 Azure services, developers and data engineers can easily build complex pipelines for hot-path analytics within a few minutes. Today, at Inspire, we are announcing various new innovations in Stream Analytics that help further reduce time to value for solutions that are powered by real-time insights. These are as follows:

Bringing the power of real-time insights to Azure Event Hubs customers

Today, we are announcing one-click integration with Event Hubs. Available as a public preview feature, this allows an Event Hubs customer to visualize incoming data and start to write a Stream Analytics query with one click from the Event Hub portal. Once the query is ready, they will be able to operationalize it in few clicks and start deriving real time insights. This will significantly reduce the time and cost to develop real-time analytics solutions.

One-click integration between Event Hubs and Azure Stream Analytics

Augmenting streaming data with SQL reference data support

Reference data is a static or slow changing dataset used to augment real-time data streams to deliver more




Event-driven analytics with Azure Data Lake Storage Gen2

Most modern-day businesses employ analytics pipelines for real-time and batch processing. A common characteristic of these pipelines is that data arrives at irregular intervals from diverse sources. This adds complexity in terms of having to orchestrate the pipeline such that data gets processed in a timely fashion.

The answer to these challenges lies in coming up with a decoupled event-driven pipeline using serverless components that responds to changes in data as they occur.

An integral part of any analytics pipeline is the data lake. Azure Data Lake Storage Gen2 provides secure, cost effective, and scalable storage for the structured, semi-structured, and unstructured data arriving from diverse sources. Azure Data Lake Storage Gen2’s performance, global availability, and partner ecosystem make it the platform of choice for analytics customers and partners around the world. Next comes the event processing aspect. With Azure Event Grid, a fully managed event routing service, Azure Functions, a serverless compute engine, and Azure Logic Apps, a serverless workflow orchestration engine, it is easy to perform event-based processing and workflows responding to the events in real-time.

Today, we’re very excited to announce that Azure Data Lake Storage Gen2 integration with Azure Event Grid is in preview! This means



Jun is the third blog post in a four-part series on Monitoring on Azure HDInsight. Part 1 is an overview that discusses the three main monitoring categories: cluster health and availability, resource utilization and performance, and job status and logs. READ MORE




Compute and stream IoT insights with data-driven applications

There is a lot more data in the world than can possibly be captured with even the most robust, cutting-edge technology. Edge computing and the Internet of Things (IoT) are just two examples of technologies increasing the volume of useful data. There is so much data being created that the current telecom infrastructure will struggle to transport it and even the cloud may become strained to store it. Despite the advent of 5G in telecom, and the rapid growth of cloud storage, data growth will continue to outpace the capacities of both infrastructures. One solution is to build stateful, data-driven applications with technology from SWIM.AI.

The Azure platform offers a wealth of services for partners to enhance, extend, and build industry solutions. Here we describe how one Microsoft partner uses Azure to solve a unique problem.

Shared awareness and communications

The increase in volume has other consequences, especially when IoT devices must be aware of each other and communicate shared information. Peer-to-peer (P2P) communications between IoT assets can overwhelm a network and impair performance. Smart grids are an example of how sensors or electric meters are networked across a distribution grid to improve the overall reliability and cost of delivering




Build more accurate forecasts with new capabilities in automated machine learning

We are excited to announce new capabilities which are apart of time-series forecasting in Azure Machine Learning service. We launched preview of forecasting in December 2018, and we have been excited with the strong customer interest. We listened to our customers and appreciate all the feedback. Your responses helped us reach this milestone. Thank you.

Building forecasts is an integral part of any business, whether it’s revenue, inventory, sales, or customer demand. Building machine learning models is time-consuming and complex with many factors to consider, such as iterating through algorithms, tuning your hyperparameters and feature engineering. These choices multiply with time series data, with additional considerations of trends, seasonality, holidays and effectively splitting training data.

Forecasting within automated machine learning (ML) now includes new capabilities that improve the accuracy and performance of our recommended models:

New forecast function Rolling-origin cross validation Configurable Lags Rolling window aggregate features Holiday detection and featurization Expanded forecast function

We are introducing a new way to retrieve prediction values for the forecast task type. When dealing with time series data, several distinct scenarios arise at prediction time that require more careful consideration. For example, are you able to re-train the model for each forecast?




Announcing self-serve experience for Azure Event Hubs Clusters

For businesses today, data is indispensable. Innovative ideas in manufacturing, health care, transportation, and financial industries are often the result of capturing and correlating data from multiple sources. Now more than ever, the ability to reliably ingest and respond to large volumes of data in real time is the key to gaining competitive advantage for consumer and commercial businesses alike. To meet these big data challenges, Azure Event Hubs offers a fully managed and massively scalable distributed streaming platform designed for a plethora of use cases from telemetry processing to fraud detection.

Event Hubs has been immensely popular with Azure’s largest customers and now even more so with the recent release of Event Hubs for Apache Kafka. With this powerful new capability, customers can stream events from Kafka applications seamlessly into Event Hubs without having to run Zookeeper or manage Kafka clusters, all while benefitting from a fully managed platform-as-a-service (PaaS) with features like auto-inflate and geo-disaster recover. As the front door to Azure’s data pipeline, customers can also automatically Capture streaming events into Azure Storage or Azure Data Lake, or natively perform real-time analysis on data streams using Azure Stream Analytics.

For customers with the most demanding streaming




Visual data ops for Apache Kafka on Azure HDInsight, powered by Lenses

This blog was written in collaboration with Andrew Stevenson, CTO at Lenses.

Apache Kafka is one of the most popular open source streaming platforms today. However, deploying and running Kafka remains a challenge for most. Azure HDInsight addresses this challenge by providing:

Ease-of-use: Quickly deploy Kafka clusters in the cloud and integrate simply with other Azure services. Higher scale and lower total-cost-of-operations (TCO): With managed disks, compute and storage are separated, enabling you to have 100s of TBs on a cluster. Enhanced security: Bring your own key (BYOK) encryption, custom virtual networks, and topic level security with Apache Ranger.

But that’s not all – you can now successfully manage your streaming data operations, from visibility to monitoring, with Lenses, an overlay platform now generally available as part of the Azure HDInsight application ecosystem, right from within the Azure portal!

With Lenses, customers can now:

Easily look inside Kafka topics Inspect and modify streaming data using SQL Visualize application landscapes Look inside Kafka topics

A typical production Kafka cluster has thousands of topics. Imagine you want to get a high level view on all of these topics. You may want to understand the configuration of the various topics, such as




Drive higher utilization of Azure HDInsight clusters with Autoscale

We are excited to share the preview release of the Autoscale feature for Azure HDInsight. This feature enables enterprises to become more productive and cost-efficient by automatically scaling clusters up or down based on the load or a customized schedule. 

Let’s consider the scenario of a U.S. based health provider who is using Azure HDInsight to build a unified big data platform at corporate level to process various data for trend prediction or usage pattern analysis. To achieve their business goals, they operate multiple HDInsight clusters in production for real-time data ingestion, batch and interactive analysis.

Some clusters are customized to exact requirements, such as ISV/line of business applications and access control policies, which are subject to rigorous SLA requirements. Sizing such clusters is a hard problem by itself and operating them 24/7 at peak capacity is expensive. So once the clusters are created, IT admins either need to manually monitor the dynamic capacity requirements, scale the clusters up and down, or develop custom tools to do the same. These challenges prevent IT admins from being as productive as possible when building and operating cost-efficient big data analytics workloads.

With the new cluster Autoscaling feature, IT admins can have the