Category Archives : Big Data

13

Sep

HDInsight Tools for VSCode: Integrations with Azure Account and HDInsight Explorer

Making it easy for developers to get started on coding has always been our top priority. We are happy to announce that HDInsight Tools for VS Code now integrates with VS Code Azure Account. This new feature makes your Azure HDInsight sign-in experience much easier. For first-time users, the tools put the required sign-in code into the copy buffer and automatically opens the Azure sign-in portal where the user can paste the code and complete the authentication process. For returning users, the tools sign you in automatically. You can quickly start authoring PySpark or Hive jobs, performing data queries, or navigating your Azure resources.

We are also excited to introduce a graphical tree view for the HDInsight Explorer within VS Code. With HDInsight Explorer, data scientists and data developers can navigate HDInsight Hive and Spark clusters across subscriptions and tenants, and browse Azure Data Lake Storage and Blob Storage connected to these HDInsight clusters. Moreover, you can inspect your Hive metadata database and table schema.

Key Customer Benefits Support Azure auto sign-in and improve sign-in experiences via integration with Azure Account extension. Enable multi-tenant support so you can manage your Azure subscription resources across tenants. Gain insights into available HDInsight Spark,

12

Sep

Real-time data analytics and Azure Data Lake Storage Gen2
Real-time data analytics and Azure Data Lake Storage Gen2

It’s been a little more than two months since we launched Azure Data Lake Storage Gen2, we’re thrilled and overwhelmed by the response we’ve received from customers and partners alike. We built Azure Data Lake Storage to deliver a no-compromises data lake and the high level of customer engagement in Gen 2’s public preview confirms our approach. We have heard from customers both large and small and across a broad range of markets and industries that Gen2’s ability to provide object storage scale and cost effectiveness with a world class data lake experience is exceeding their expectations and we couldn’t be happier to hear it!

Partner enablement in Gen2

In fact we are actively partnering with leading ISV’s across the big data spectrum of platform providers, data movement and ETL, governance and data lifecycle management (DLM), analysis, presentation, and beyond to ensure seamless integration between Gen2 and their solutions.

Over the next few months you will hear more about the exciting work these partners are doing with ADLS Gen2. We’ll do blog posts, events, and webinars that highlight these industry-leading solutions.

In fact, I am happy to announce our first joint Gen2 engineering-ISV webinar with Attunity on September 18th, Real-time

11

Sep

Retail brands: gain a competitive advantage with modern data management

Retailers today are responsible for a significant amount of data. As a customer, I expect that the data I provide to a retailer is handled properly following the Payment Card Industry Data Security Standard (PCI DSS), General Data Protection Regulation (GDPR) and other compliance guidelines.

I also expect that the data I give to a retailer is being leveraged in a way that improves my shopping experience. For example, I want better recommendations given my purchase history, my “likes,” and my browsing. I expect the retailer to know my email and payment information for a single click checkout. These are small and trivial frictions that can be eliminated, as long as the data is handled properly with the right permissions. And the capture and analysis of big data (general information aggregated from all customers) is an opportunity for the retailer to fine-tune the overall customer experience.

However much of the data collected goes unused. This occurs because the infrastructure within an organization is unable to make the data accessible or searchable. It can’t be used to improve decision-making across the retail value chain. The unused data comes from many sources: mobile devices, digital and physical store shopping, and IoT.

06

Sep

Exciting new capabilities on Azure HDInsight
Exciting new capabilities on Azure HDInsight

Friends of Azure HDInsight, it’s been a busy summer. I wanted to summarize several noteworthy enhancements we’ve recently brought to HDInsight. We have even more exciting releases coming up at Ignite so please stay tuned!

Product updates Apache Phoenix and Zeppelin integration

You can now query data in Apache Phoenix from Zeppelin.

Apache Phoenix is an open source, massively parallel relational database layer built on HBase. Phoenix allows you to use SQL like queries over HBase. Phoenix uses Java Data Connectivity (JDBC) drivers underneath to enable users to create, delete, alter SQL tables, indexes, views and sequences, and upset rows individually and in bulk. Phoenix uses NoSQL native compilation rather than using MapReduce to compile queries, enabling the creation of low-latency applications on top of HBase.

Apache Phoenix enables online transaction processing (OLTP) and operational analytics in Hadoop for low latency applications by combining the best of both worlds. In Azure HDInsight Apache Phoenix is delivered as a first class Open Source framework.

Read More: Azure #HDInsight Apache Phoenix now supports Zeppelin 

Oozie support in HDInsight enterprise security package

Oozie is a workflow scheduler system for managing Apache Hadoop jobs. You can now use Oozie in domain-joined Hadoop clusters to

06

Sep

Powerful Debugging Tools for Spark for Azure HDInsight
Powerful Debugging Tools for Spark for Azure HDInsight

Microsoft runs one of the largest big data cluster in the world – internally called “Cosmos”. This runs millions of jobs across hundreds of thousands of servers over multiple Exabytes of data. Being able to run and manage jobs of this scale by developers was a huge challenge. Jobs with hundreds of thousands of vertices are common and to even quickly figure out why a job runs slow or narrow down bottlenecks was a huge challenge. We built powerful tools that graphically show the entire job graph including the various vertex execution times, playback etc. which helped developers greatly. While this was built for our internal language in Cosmos (called Scope), we are working very hard to bring this power to all Spark developers.

Today, we are delighted to announce the Public Preview of the Apache Spark Debugging Toolset for HDInsight for Spark 2.3 cluster and forward.   The default Spark history server user experience is now enhanced in HDInsight with rich information on your spark jobs with powerful interactive visualization of Job Graphs & Data Flows. The new features greatly assist HDInsight Spark developers in job data management, data sampling, job monitoring and job diagnosis.  

Spark History Server Enhancements

04

Sep

Delivering innovation in retail with the flexible and productive Microsoft AI platform

I often follow several publications related to trends and emerging innovations in retail and consumer goods. Artificial Intelligence (AI) continues to be touted as a key ingredient in transforming this industry. I agree with this sentiment given the critical components of cloud computing and data availability, which combined create a case for modernization.

We are seeing real application of AI resulting in positive business improvements aimed at solving a range of service to production-type problems. These examples are tangible and exemplify the merits of AI and its applicability in retail and consumer goods. Take Macy’s virtual agent that can solve customer issues via the web and transfer customers seamlessly to a live agent if necessary. More than one-quarter of customer queries are answered by a virtual agent, improving the speed of service for customers and providing valuable data that is connected to back-end systems through Microsoft Dynamics 365 AI solution for customer service.

Deschutes Brewery is another great example. It’s the seventh-largest craft brewery in the United States. By partnering with OSIsoft PI System to collect and manage production data with Microsoft Cortana Intelligence Suite, they have estimated a 20 percent increase in production capacity leveraging existing equipment by implementing

28

Aug

Extracting actionable insights from IoT data to drive more efficient manufacturing

Thanks to the explosion of IoT we now have millions of devices, machines, products, and assets connected and streaming terabytes of data. But connecting to devices, ingesting and storing their sensor data is just the first step. The whole point of collecting this data is to extract actionable insights — insights that will trigger some sort of action that will result in business value such as:

Optimized factory operations: reduce cycle time, increase throughput, increase machine utilization, reduce costs, reduce unplanned downtime. Improved product quality: reduce manufacturing defects, identify design features that are causing manufacturing problems. Better understanding of customer demand: validate usage assumptions, understand product usage patterns. New sources of revenue: support attached services, Product-as-a-Service models. Improved customer experience: respond more quickly to issues, help them optimize their usage of your product.

Extracting insights from IoT data is essentially a big data analytics challenge. It’s about analyzing lots of data, coming in fast, from different sources and in different formats. But it’s not your garden-variety analytics problem because: (1) data comes from “things” (as opposed to from humans or other software systems), (2) IoT data is almost always real-time, streamed, time-series data, coming in at different frequencies, and (3)

27

Aug

Sharing a self-hosted Integration Runtime infrastructure with multiple Data Factories

The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments. If you need to perform data integration and orchestration securely in a private network environment, which does not have a direct line-of-sight from the public cloud environment, you can install a self-hosted IR on premises behind your corporate firewall, or inside a virtual private network.

Untill now, you were required to create at least one such compute infrastructure in every Data Factory by design for hybrid and on-premise data integration capabilities. Which implies if you have ten such data factories being used by different project teams to access on-premise data stores and orchestrate inside VNet, you would have to create ten self-hosted IR infrastructures, adding additional cost and management concerns to the IT teams.

With the new capability of self-hosted IR sharing, you can share the same self-hosted IR infrastructure across data factories. This lets you reuse the same highly available and scalable self-hosted IR infrastructure from different data factories within the same Azure Active Directory tenant. We are introducing a new concept of a Linked self-hosted IR which references another self-hosted IR infrastructure. This does not introduce

27

Aug

Sharing a self-hosted Integration Runtime infrastructure with multiple Data Factories

The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments. If you need to perform data integration and orchestration securely in a private network environment, which does not have a direct line-of-sight from the public cloud environment, you can install a self-hosted IR on premises behind your corporate firewall, or inside a virtual private network.

Untill now, you were required to create at least one such compute infrastructure in every Data Factory by design for hybrid and on-premise data integration capabilities. Which implies if you have ten such data factories being used by different project teams to access on-premise data stores and orchestrate inside VNet, you would have to create ten self-hosted IR infrastructures, adding additional cost and management concerns to the IT teams.

With the new capability of self-hosted IR sharing, you can share the same self-hosted IR infrastructure across data factories. This lets you reuse the same highly available and scalable self-hosted IR infrastructure from different data factories within the same Azure Active Directory tenant. We are introducing a new concept of a Linked self-hosted IR which references another self-hosted IR infrastructure. This does not introduce

16

Aug

Azure #HDInsight Interactive Query: simplifying big data analytics architecture

Fast Interactive BI, data security and end user adoption are three critical challenges for successful big data analytics implementations. Without right architecture and tools, many big data and analytics projects fail to catch on with common BI users and enterprise security architects. In this blog we will discuss architectural approaches that will help you architect big data solution for fast interactive queries, simplified security model and improved user adoption with BI users.

Traditional approach to fast interactive BI

Deep analytical queries processed on Hadoop systems have traditionally been slow. MapReduce jobs or hive queries are used for heavy processing of large datasets, however, not suitable for the fast response time required by interactive BI usage.

Faced with user dissatisfaction due to lack of query interactivity, data architects used techniques such as building OLAP cubes on top of Hadoop. An OLAP cube is a mechanism to store all the different dimensions, measures and hierarchies up front. Processing the cube usually takes place at the pre-specified interval. Post processing, results are available in advance, so once the BI tool queries the cube it just needs to locate the result, thereby limiting the query response time and making it a fast and interactive