Is your enterprise disaster ready? What does it take to ensure business continuity in the event disaster were to strike your organization? We recently had the opportunity to find out.
In this series, we’ll explore the some of the many nuances that collectively equate to true preparedness should a major disruption occur, so you will be able to reply “Yes, we are absolutely ready to deal with a catastrophe at any time.”
If you’ve run mission-critical applications either on-premises or in the cloud, you are intimately familiar with disaster-recovery scenarios. If you’ve ever had to architect a system capable of rapidly failing over to deliver services from a different location when disaster strikes, you understand the challenge of developing and delivering a complete solution. Moreover, if you’ve ever conducted a disaster recovery drill for your own company or a client, you realize that success lies in the details.
To simplify the problem of delivering a complete business continuity solution, Microsoft Azure offers Site Recovery. Azure Site Recovery (ASR) automates the replication of systems in near real-time to manage disaster recovery for Azure virtual machines as well as on-premises VMs and physical servers. It helps to orchestrate failover and recovery processes to keep vital applications running despite outages.
During recent disaster recovery drills conducted for a FedRAMP application running in the Azure Government sovereign cloud, I had the occasion to experience first hand the challenges of delivering a successful disaster recovery solution. In spite of careful planning and detailed peer reviews, many issues that could prevent a successful failover were discovered during the testing phase. Here is one of our key learnings from the process:
- Replication Health – Ensure all systems are fully protected by the latest ASR agent and the Replication Health is green.
It might seem obvious, but don’t assume that because site recovery protection was successfully configured previously that Replication Health remains “Healthy” and systems are still fully Protected. To ensure Recovery Point Objectives (RPO) are met, it’s essential to regularly check on the health of Replicated items, the VMs being protected, in the Recovery Services vault. To check replication status, from the Azure portal navigate to Recovery Services vaults > VaultName – Replicated items and check for any items in red marked Critical as seen below:
This should become part of a regular operational checkpoint. Be sure to check at least weekly if not more frequently, and remediate any replication issues you discover. The job you save might be your own! One additional point to note, a system with Replication Health in the Critical state are still protected. However, your RPO will keep slipping as ASR fails to replicate changes to the cache storage account.
While you’re at it, expect to periodically encounter this alert:
ASR uses a replication agent and Microsoft releases regular updates to support newer operating system versions as well as deliver bugfixes. Microsoft recommends maintaining the agent at no more than four versions behind the current release since the upgrade process becomes more involved and disruptive.
To trigger updates, click the message to open up the Agent update blade and select all the systems to be updated. As you can see here, you’ll need to click on each one separately as there is no select all checkbox on this blade. Another quirk in the Agent update UI is you can only select a maximum of 10 systems to be updated at a time. You’ll need to return to the Replicated items screen as many times as necessary if you’ve got more than 10 agents in the Recovery Services vault to update. Be sure to update them all.
My friends at Microsoft support tell me there a lot of enhancements are on the roadmap that will address some of the challenges, like the UI quirk above. In fact, ASR as a whole will receive a host of updates and improvements in the coming months so keep an eye out for them.
In the next installment, we’ll explore networking considerations that are essential to the success of failover operations.