MICROSOFT IGNITE, ORLANDO, Florida, September 24, 2018 – Earlier today, Microsoft Corporation announced its continuing support and commitment to enterprises seeking to use Hadoop for open source big data analytics in the cloud. Leading off the series of major upgrades to the Azure HDInsight service is the preview release of Hadoop 3.0, the transformational update to the Hadoop stack that enterprises have been waiting for since earlier this year. In addition, enterprises with strict security and compliance requirements will be able to secure their Azure HDInsight clusters using Enterprise Security Package. And there is something in this release for everybody! Spark developers will particularly like the series of innovations from Microsoft that will now allow them to quickly identify and resolve performance bottlenecks in their code.
“We have been honored to be part of the open source analytics community,” said Ryan Waite, Director of Big Data Product Management. “We’re making open source analytics central to our product strategy, from our investments in HDInsight, to our participation in projects like YARN, to our shift to using open source analytics in our internal data lake. The rate of innovation in this space is only increasing with Hadoop 3.0. We are excited
Azure Databricks provides a fast, easy, and collaborative Apache Spark-based analytics platform to accelerate and simplify the process of building big data and AI solutions that drive the business forward, all backed by industry leading SLAs.
Since announcing general availability in March, we have been continuously listening to customers and adding functionality to the Azure Databricks service. Today, I am excited to announce several new updates to Azure Databricks.
General availability Azure Databricks is now available in Japan, Canada, India, and Australia Central
We are excited to announce the general availability of Azure Databricks in additional regions – Japan, Canada, India, Australia Central, and Australia Central 2. These additional locations bring the product worldwide availability count to 24 regions backed by a 99.95 percent SLA.
We want to ensure that we build our cloud infrastructure to serve the needs of customers by driving innovation and making it accessible globally. Stay updated with the region availability for Azure Databricks.
Organizations also benefit from Azure Databricks’ native integration with other services like Azure Blob Storage, Azure Data Factory, Azure SQL Data Warehouse, and Azure Cosmos DB. This enables new analytics solutions that support modern data warehousing, advanced analytics, and real-time analytics scenarios.
Making it easy for developers to get started on coding has always been our top priority. We are happy to announce that HDInsight Tools for VS Code now integrates with VS Code Azure Account. This new feature makes your Azure HDInsight sign-in experience much easier. For first-time users, the tools put the required sign-in code into the copy buffer and automatically opens the Azure sign-in portal where the user can paste the code and complete the authentication process. For returning users, the tools sign you in automatically. You can quickly start authoring PySpark or Hive jobs, performing data queries, or navigating your Azure resources.
We are also excited to introduce a graphical tree view for the HDInsight Explorer within VS Code. With HDInsight Explorer, data scientists and data developers can navigate HDInsight Hive and Spark clusters across subscriptions and tenants, and browse Azure Data Lake Storage and Blob Storage connected to these HDInsight clusters. Moreover, you can inspect your Hive metadata database and table schema.
Key Customer Benefits Support Azure auto sign-in and improve sign-in experiences via integration with Azure Account extension. Enable multi-tenant support so you can manage your Azure subscription resources across tenants. Gain insights into available HDInsight Spark,
It’s been a little more than two months since we launched Azure Data Lake Storage Gen2, we’re thrilled and overwhelmed by the response we’ve received from customers and partners alike. We built Azure Data Lake Storage to deliver a no-compromises data lake and the high level of customer engagement in Gen 2’s public preview confirms our approach. We have heard from customers both large and small and across a broad range of markets and industries that Gen2’s ability to provide object storage scale and cost effectiveness with a world class data lake experience is exceeding their expectations and we couldn’t be happier to hear it!
Partner enablement in Gen2
In fact we are actively partnering with leading ISV’s across the big data spectrum of platform providers, data movement and ETL, governance and data lifecycle management (DLM), analysis, presentation, and beyond to ensure seamless integration between Gen2 and their solutions.
Over the next few months you will hear more about the exciting work these partners are doing with ADLS Gen2. We’ll do blog posts, events, and webinars that highlight these industry-leading solutions.
In fact, I am happy to announce our first joint Gen2 engineering-ISV webinar with Attunity on September 18th, Real-time
Retailers today are responsible for a significant amount of data. As a customer, I expect that the data I provide to a retailer is handled properly following the Payment Card Industry Data Security Standard (PCI DSS), General Data Protection Regulation (GDPR) and other compliance guidelines.
I also expect that the data I give to a retailer is being leveraged in a way that improves my shopping experience. For example, I want better recommendations given my purchase history, my “likes,” and my browsing. I expect the retailer to know my email and payment information for a single click checkout. These are small and trivial frictions that can be eliminated, as long as the data is handled properly with the right permissions. And the capture and analysis of big data (general information aggregated from all customers) is an opportunity for the retailer to fine-tune the overall customer experience.
However much of the data collected goes unused. This occurs because the infrastructure within an organization is unable to make the data accessible or searchable. It can’t be used to improve decision-making across the retail value chain. The unused data comes from many sources: mobile devices, digital and physical store shopping, and IoT.
Friends of Azure HDInsight, it’s been a busy summer. I wanted to summarize several noteworthy enhancements we’ve recently brought to HDInsight. We have even more exciting releases coming up at Ignite so please stay tuned!
Product updates Apache Phoenix and Zeppelin integration
You can now query data in Apache Phoenix from Zeppelin.
Apache Phoenix is an open source, massively parallel relational database layer built on HBase. Phoenix allows you to use SQL like queries over HBase. Phoenix uses Java Data Connectivity (JDBC) drivers underneath to enable users to create, delete, alter SQL tables, indexes, views and sequences, and upset rows individually and in bulk. Phoenix uses NoSQL native compilation rather than using MapReduce to compile queries, enabling the creation of low-latency applications on top of HBase.
Apache Phoenix enables online transaction processing (OLTP) and operational analytics in Hadoop for low latency applications by combining the best of both worlds. In Azure HDInsight Apache Phoenix is delivered as a first class Open Source framework.
Oozie support in HDInsight enterprise security package
Oozie is a workflow scheduler system for managing Apache Hadoop jobs. You can now use Oozie in domain-joined Hadoop clusters to
Microsoft runs one of the largest big data cluster in the world – internally called “Cosmos”. This runs millions of jobs across hundreds of thousands of servers over multiple Exabytes of data. Being able to run and manage jobs of this scale by developers was a huge challenge. Jobs with hundreds of thousands of vertices are common and to even quickly figure out why a job runs slow or narrow down bottlenecks was a huge challenge. We built powerful tools that graphically show the entire job graph including the various vertex execution times, playback etc. which helped developers greatly. While this was built for our internal language in Cosmos (called Scope), we are working very hard to bring this power to all Spark developers.
Today, we are delighted to announce the Public Preview of the Apache Spark Debugging Toolset for HDInsight for Spark 2.3 cluster and forward. The default Spark history server user experience is now enhanced in HDInsight with rich information on your spark jobs with powerful interactive visualization of Job Graphs & Data Flows. The new features greatly assist HDInsight Spark developers in job data management, data sampling, job monitoring and job diagnosis.
Spark History Server Enhancements
I often follow several publications related to trends and emerging innovations in retail and consumer goods. Artificial Intelligence (AI) continues to be touted as a key ingredient in transforming this industry. I agree with this sentiment given the critical components of cloud computing and data availability, which combined create a case for modernization.
We are seeing real application of AI resulting in positive business improvements aimed at solving a range of service to production-type problems. These examples are tangible and exemplify the merits of AI and its applicability in retail and consumer goods. Take Macy’s virtual agent that can solve customer issues via the web and transfer customers seamlessly to a live agent if necessary. More than one-quarter of customer queries are answered by a virtual agent, improving the speed of service for customers and providing valuable data that is connected to back-end systems through Microsoft Dynamics 365 AI solution for customer service.
Deschutes Brewery is another great example. It’s the seventh-largest craft brewery in the United States. By partnering with OSIsoft PI System to collect and manage production data with Microsoft Cortana Intelligence Suite, they have estimated a 20 percent increase in production capacity leveraging existing equipment by implementing
Thanks to the explosion of IoT we now have millions of devices, machines, products, and assets connected and streaming terabytes of data. But connecting to devices, ingesting and storing their sensor data is just the first step. The whole point of collecting this data is to extract actionable insights — insights that will trigger some sort of action that will result in business value such as:
Optimized factory operations: reduce cycle time, increase throughput, increase machine utilization, reduce costs, reduce unplanned downtime. Improved product quality: reduce manufacturing defects, identify design features that are causing manufacturing problems. Better understanding of customer demand: validate usage assumptions, understand product usage patterns. New sources of revenue: support attached services, Product-as-a-Service models. Improved customer experience: respond more quickly to issues, help them optimize their usage of your product.
Extracting insights from IoT data is essentially a big data analytics challenge. It’s about analyzing lots of data, coming in fast, from different sources and in different formats. But it’s not your garden-variety analytics problem because: (1) data comes from “things” (as opposed to from humans or other software systems), (2) IoT data is almost always real-time, streamed, time-series data, coming in at different frequencies, and (3)
The Integration Runtime (IR) is the compute infrastructure used by Azure Data Factory to provide data integration capabilities across different network environments. If you need to perform data integration and orchestration securely in a private network environment, which does not have a direct line-of-sight from the public cloud environment, you can install a self-hosted IR on premises behind your corporate firewall, or inside a virtual private network.
Untill now, you were required to create at least one such compute infrastructure in every Data Factory by design for hybrid and on-premise data integration capabilities. Which implies if you have ten such data factories being used by different project teams to access on-premise data stores and orchestrate inside VNet, you would have to create ten self-hosted IR infrastructures, adding additional cost and management concerns to the IT teams.
With the new capability of self-hosted IR sharing, you can share the same self-hosted IR infrastructure across data factories. This lets you reuse the same highly available and scalable self-hosted IR infrastructure from different data factories within the same Azure Active Directory tenant. We are introducing a new concept of a Linked self-hosted IR which references another self-hosted IR infrastructure. This does not introduce