Category Archives : Supportability

17

Aug

Advancing the outage experience—automation, communication, and transparency

“Service incidents like outages are an unfortunate inevitability of the technology industry. Of course, we are constantly improving the reliability of the Microsoft Azure cloud platform. We meet and exceed our Service Level Agreements (SLAs) for the vast majority of customers and continue to invest in evolving tools and training that make it easy for you to design and operate mission-critical systems with confidence.

In spite of these efforts, we acknowledge the unfortunate reality that—given the scale of our operations and the pace of change—we will never be able to avoid outages entirely. During these times we endeavor to be as open and transparent as possible to ensure that all impacted customers and partners understand what’s happening. As part of our Advancing Reliability blog series, I asked Sami Kubba, Principal Program Manager overseeing our outage communications process, to outline the investments we’re making to continue improving this experience.”—Mark Russinovich, CTO, Azure

 

In the cloud industry, we have a commitment to bring our customers the latest technology at scale, keeping customers and our platform secure, and ensuring that our customer experience is always optimal. For this to happen Azure is subject to a significant amount of change—and in

Share

29

Jun

Advancing Azure service quality with artificial intelligence: AIOps

“In the era of big data, insights collected from cloud services running at the scale of Azure quickly exceed the attention span of humans. It’s critical to identify the right steps to maintain the highest possible quality of service based on the large volume of data collected. In applying this to Azure, we envision infusing AI into our cloud platform and DevOps process, becoming AIOps, to enable the Azure platform to become more self-adaptive, resilient, and efficient. AIOps will also support our engineers to take the right actions more effectively and in a timely manner to continue improving service quality and delighting our customers and partners. This post continues our Advancing Reliability series highlighting initiatives underway to keep improving the reliability of the Azure platform. The post that follows was written by Jian Zhang, our Program Manager overseeing these efforts, as she shares our vision for AIOps, and highlights areas of this AI infusion that are already a reality as part of our end-to-end cloud service management.”—Mark Russinovich, CTO, Azure

This post includes contributions from Principal Data Scientist Manager Yingnong Dang and Partner Group Software Engineering Manager Murali Chintalapati.

 

As Mark mentioned when he launched this Advancing Reliability blog

Share

03

Jan

Advancing no-impact and low-impact maintenance technologies

“This post continues our reliability series kicked off by my July blog post highlighting several initiatives underway to keep improving platform availability, as part of our commitment to provide a trusted set of cloud services. Today I wanted to double-click on the investments we’ve made in no-impact and low-impact update technologies including hot patching, memory-preserving maintenance, and live migration. We’ve deployed dozens of security and reliability patches to host infrastructure in the past year, many of which were implemented with no customer impact or downtime. The post that follows was written by John Slack from our core operating systems team, who is the Program Manager for several of the update technologies discussed below.” – Mark Russinovich, CTO, Azure

This post was co-authored by Apurva Thanky, Cristina del Amo Casado, and Shantanu Srivastava from the engineering teams responsible for these technologies.

 

We regularly update Azure host infrastructure to improve the reliability, performance, and security of the platform. While the purposes of these ‘maintenance’ updates vary, they typically involve updating software components in the hosting environment or decommissioning hardware. If we go back five years, the only way to apply some of these updates was by fully rebooting the entire host.

Share

14

Aug

Improving Azure Virtual Machines resiliency with Project Tardigrade

“Our goal is to empower organizations to run their workloads reliably on Azure. With this as our guiding principle, we are continuously investing in evolving the Azure platform to become fault resilient, not only to boost business productivity but also to provide a seamless customer experience. Last month I published a blog post highlighting several initiatives underway to keep improving in this space, as part of our commitment to provide a trusted set of cloud services. Today I wanted to expand on the mention of Project Tardigrade – a platform resiliency initiative that improves high availability of our services even during the rare cases of spontaneous platform failures. The post that follows was written by Pujitha Desiraju and Anupama Vedapuri from our compute platform fundamentals team, who are leading these efforts.” Mark Russinovich, CTO, Azure

This post was co-authored by Jim Cavalaris, Principal Software Engineer, Azure Compute. 

 

Codenamed Project Tardigrade, this effort draws its inspiration from the eight-legged microscopic creature, the tardigrade also known as the water bear. Virtually impossible to kill, tardigrades can be exposed to extreme conditions, but somehow still manage to wiggle their way to survival. This is exactly what we envision our servers to emulate

Share

07

Aug

High Availability Add-On updates for Red Hat Enterprise Linux on Azure

High availability is crucial to mission-critical production environments. The Red Hat Enterprise Linux High Availability Add-On provides reliability and availability to critical production services that use it. Today, we’re sharing performance improvements and image updates around the High Availability Add-On for Red Hat Enterprise Linux (RHEL) on Azure.

Pacemaker

Pacemaker is a robust and powerful open-source resource manager used in highly available compute clusters. It is a key part of the High Availability Add-On for RHEL.

Pacemaker has been updated with performance improvements in the Azure Fencing Agent to significantly decrease Azure failover time, which greatly reduces customer downtime. This update is available to all RHEL 7.4+ users using either the Pay-As-You-Go images or Bring-Your-Own-Subscription images from the Azure Marketplace.

New pay-as-you-go RHEL images with the High Availability Add-On

We now have RHEL Pay-As-You-Go (PAYG) images with the High Availability Add-On available in the Azure Marketplace. These RHEL images have additional access to the High Availability Add-On repositories. Pricing details for these images are available in the pricing calculator.

The following RHEL HA PAYG images are now available in the Marketplace for all Azure regions, including US Government Cloud:

RHEL 7.4 with HA RHEL 7.5 with HA RHEL 7.6 with

Share

29

May

Isolate app integrations for stability, scalability, and speed with an integration service environment

Innovation at scale is a common challenge facing large organizations. A key contributor to the challenge is the complexity in coordinating the sheer number of apps and environments.

Integration tools, such as Azure Logic Apps, give you the flexibility to scale and innovate as fast as you want, on-premises or in the cloud. This is a key capability you need to have in place when migrating to the cloud, or even if you’re cloud native. Often, integration has been relegated as something to do after the fact. In the modern enterprise, however, application integration is something that has to be done in conjunction with application development and innovation.

An integration service environment is the ideal solution for organizations concerned about noisy neighbor issues, data isolation, or who need more flexibility and configurability than the core Logic Apps service offers.

Building upon the existing set of capabilities, we are releasing a number of new, exciting changes that make integration service environments even better, such as:

Faster deployment times by halving the previous provisioning time Higher throughput limits for an individual Logic App and connectors An individual Logic App can now run for up to a year (365 days)

Integration

Share

15

May

Microsoft Azure portal May 2019 update

https://azure.microsoft.com/blog/microsoft-azure-portal-may-2019-update/

Share

31

Jul

Azure management groups now in general availability

I am very excited to announce today general availability of Azure management groups to all our customers. Management groups allow you to organize your subscriptions and apply governance controls, such as Azure Policy and Role-Based Access Controls (RBAC), to the management groups. All subscriptions within a management group automatically inherit the controls applied to the management group. No matter if you have an Enterprise Agreement, Certified Solution Partner, Pay-As-You-Go, or any other type of subscription, this service gives all Azure customers enterprise-grade management at a large scale for no additional cost.

With the GA launch of this service, we introduce new functionality to Azure that allows customers to group subscriptions together so that you can apply a policy or RBAC role to multiple subscriptions, and their resources, with one assignment. Management groups not only allow you to group subscriptions but also allows you to group other management groups to form a hierarchy. The following diagram shows an example of creating a hierarchy for governance using management groups.

By creating a hierarchy like this you can apply a policy, for example, VM locations limited to US West Region on the group “Infrastructure Team management group” to enable internal compliance and

Share

03

Jul

IP filtering for Event Hubs and Service Bus

For scenarios in which Azure Event Hubs or Azure Service Bus is only accessible from certain well-known sites, the IP Filter feature enables you to configure rules for accepting or rejecting traffic originated from specify IP addresses, for instance the addresses that come under corporate NAT gateway. The Azure team is happy to announce the public preview of IP Filtering for Service Bus Premium and Event Hubs Standard and Dedicated price plans.

This feature allows users to control which IPs are accessing their resources. Some characteristics of this feature:

Rules allow you to specify accept and reject actions on IP masks. The rules work with IPv4 addresses. Rules are applied to the namespace level. You can have multiple rules and they are applied in order. The first rule that matches the IP address determines the accept or reject action. Requests from IPs that are rejected receive an unauthorized response.

Today these features are available in the Azure portal as shown in the screenshot. You can find them at the Event Hubs or Service Bus namespace level or via an ARM template.

The below ARM template shows how you can use this feature. This template takes the following parameters:

ipFilterRuleName

Share

03

Jul

IP filtering for Event Hubs and Service Bus

For scenarios in which Azure Event Hubs or Azure Service Bus is only accessible from certain well-known sites, the IP Filter feature enables you to configure rules for accepting or rejecting traffic originated from specify IP addresses, for instance the addresses that come under corporate NAT gateway. The Azure team is happy to announce the public preview of IP Filtering for Service Bus Premium and Event Hubs Standard and Dedicated price plans.

This feature allows users to control which IPs are accessing their resources. Some characteristics of this feature:

Rules allow you to specify accept and reject actions on IP masks. The rules work with IPv4 addresses. Rules are applied to the namespace level. You can have multiple rules and they are applied in order. The first rule that matches the IP address determines the accept or reject action. Requests from IPs that are rejected receive an unauthorized response.

Today these features are available in the Azure portal as shown in the screenshot. You can find them at the Event Hubs or Service Bus namespace level or via an ARM template.

The below ARM template shows how you can use this feature. This template takes the following parameters:

ipFilterRuleName

Share