SAP on Azure–Designing for availability and recoverability
This is the third in a four-part blog series on Designing a great SAP on Azure Architecture.
Robust SAP on Azure Architectures are built on the pillars of security, performance and scalability, availability and recoverability, efficiency and operations.
We covered designing for performance and scalability previously and within this blog we will focus on availability and recoverability.
Designing for availability
Designing for availability ensures that your mission critical SAP applications such as SAP ERP or S/4HANA have high-availability (HA) provisions applied. These HA provisions ensure the application is resilient to both hardware and software failures and that the SAP application uptime is secured to meet your service-level-agreements (SLAs).
Within the links below, you will find a comprehensive overview on Azure virtual machine maintenance versus downtime where unplanned hardware maintenance events, unexpected downtime and planned maintenance events are covered in detail.
From an availability perspective the options you have for deploying SAP on Azure are as follows:
- 99.9 percent SLA for single instance VMs with Azure premium storage. In this case, the SAP database (DB), system central services A(SCS) and application servers are either running on separate VMs or consolidated on one or more VMs. A 99.9 percent SLA is also offered on our single node, bare metal HANA Large Instances.
- 99.95 percent SLA for VMs within the same Azure availability set. The availability set enforces that the VMs within the set are deployed in separate fault and update domains, in turn this ensures the VMs are safeguarded against unplanned hardware maintenance events, unexpected downtime and planned maintenance events. To ensure HA of the SAP application, the availability sets are used in conjunction with Azure Load Balancers, guest operating system clustering technologies such as Windows Failover cluster or Linux Pacemaker to facilitate short failover times and synchronous database replication technologies (SQL AlwaysOn, HANA System Replication, etc) to guarantee no loss of data. Additionally, configuring the SAP Enqueue Replication Server can mitigate against loss of the SAP lock table during a failover of the A(SCS).
- 99.99 percent SLA for VMs within Azure availability zones. An availability zone in an Azure region is a combination of a fault domain and an update domain. The Azure platform recognizes this distribution across update domains to ensure that VMs in different zones are not updated at the same time in the case of Azure planned maintenance events. Additionally, availability zones are physically separate zones within an Azure region where each zone has its own power source, network, cooling and is logically separated from the other zones within the Azure region. This construct hedges against unexpected downtime due to a hardware or infrastructure failure within a given zone. By architecting the SAP deployment to leverage replication across zones i.e. DBMS replication (HANA System Replication, SQL AlwaysOn), SAP Enqueue Replication Server and distributing the SAP application servers (for redundancy) across zones you can protect the SAP system from the loss of a complete datacenter. If one zone is compromised, the SAP System will be available in another zone. For an overview of Azure availability zones and our latest Mv2 VM offering you can check out this video.
- HANA Large Instances are offered at an SLA of 99.99 percent when they are configured as an HA pair, this applies to single datacenter and availability zones deployments.
In the case of availability sets and availability zones, guest OS clustering is necessary for HA. We would like to use this opportunity to clarify the Linux Pacemaker Fencing options on Azure to avoid split brain of your SAP application, these are:
Azure Fencing Agent
- Storage Based Death (SBD)
The Azure Fencing Agent is available on both RedHat Enterprise Linux (RHEL) and SUSE Enterprise Linux (SLES) and SBD is supported by SLES, but not RHEL; for the shortest cluster failover times for SAP on Azure with Pacemaker, we recommend:
Azure Fencing Agent for SAP clusters built on RHEL.
- SBD for SAP clusters built on SLES
In the case of productive SAP applications, we strongly recommend availability sets or availability zones. Availability zones are an alternative to availability sets to provide HA with the addition of resiliency to datacenter failures within an Azure region. However, be mindful, there is no guarantee of a certain distances between the building structures hosting different availability zones. Different Azure regions can encounter different setups in terms of distance of the physical buildings. Therefore, for deterministic application performance and the lowest network Round-Trip-Time (RTT), Availability sets could be the better option.
Single Instance VMs can be a good fit for non-production (project, sandbox and test SAP systems) which don’t have availability SLAs on the same level as production, this option also helps to minimize run costs.
Designing for recoverability
Designing for recoverability means recovering from data loss, such as a logical error on the SAP database, from large scale disasters, or loss of a complete Azure region. When designing for recoverability, it is necessary to understand the Recovery Point Objective (RPO) and Recovery Time Objective (RTO) of your SAP Application. Azure Regional Pairs are recommended for disaster recovery which offer isolation and availability to hedge against the risks of natural or human disasters impacting a single region.
On the DBMS layer, asynchronous replication can be used to replicate your production data from your primary region to your disaster recovery (DR) region. On the SAP application layer, Azure-to-Azure Site Recovery can be used as part of an efficient, cost-conscious DR solution. You could also choose to architect a dual-purpose scenario on your DR side such as running a combined QA/DR system for a better return on your investments as shown below.
In addition to HA and DR provisions an enterprise data protection solution for backup and recovery of your SAP data is essential.
Our first party Azure Backup offering is certified for SAP HANA, the solution is currently in public preview (as of September 2019) and supports SAP HANA scale-up (data and log backup) with further scenarios to be supported in the future such as data snapshot and SAP HANA scale-out.
Additionally, the Azure platforms supports a broad range of ISVs which offer enterprise data protection and management for your SAP applications. One such ISV is Commvault where Microsoft have recently partnered to produce this whitepaper. A key advantage of Commvault is the IntelliSnap (data snapshot) capability which offers instantaneous application consistent data snapshots of your SAP database – this is hugely beneficial for large databases which have low RTO requirements. Commvault facilitates highly performant multi-streaming (backint) data backup directly to Azure Blob storage for both SAP HANA scale-up, SAP HANA scale-out and anyDB workloads. Your enterprise data protection strategy can include a combination of data snapshots and data backup i.e. running daily snapshots and a data backup (backint) on the weekend. Below, a data snapshot executed via IntelliSnap against an SAP HANA database on an M128s (2TB) VM, the snapshot duration is 20 seconds.
Within this blog we have summarized the options for designing SAP on Azure for Availability and Recoverability. When architecting and deploying your production SAP applications on Azure, it is essential to include availability sets or availability zones to support your mission critical SAP SLAs. Furthermore, you should apply DR provisions and enterprise data protection to secure your SAP application against the loss of a complete Azure region or data corruption.
Be sure to execute HA and DR testing through the lifecycle of your SAP to Azure project and also re-test these capabilities during maintenance windows once your SAP Applications are in productive operations i.e. DR drill tests annually.
Availability and Recoverability should be reviewed on an ongoing basis to incorporate the latest technologies and guidance on best practices from Microsoft.
In blog #4 in our series we will cover designing for efficiency and operations.