Microsoft 365 disaster recovery planning
What disaster recovery means in a Microsoft 365 SaaS context — Microsoft's responsibilities, yours, and what to plan for.
Disaster recovery (DR) in a Microsoft 365 SaaS context is different from traditional DR for on-premises systems. Microsoft handles infrastructure resilience; you're responsible for everything else — identity, data protection, configuration, and the operational ability to respond to disasters within your scope of control. Getting clear on the boundaries is essential.
What Microsoft handles
Microsoft's commitments under the shared responsibility model cover:
- Datacentre resilience — multiple datacentres per region with automatic failover.
- Service availability — financially-backed SLAs (typically 99.9% per service).
- Data durability — multiple replicas of every piece of data across availability zones.
- Infrastructure security — physical security, network security at the platform level.
- Service health monitoring and incident response for platform-level issues.
- Software updates and patching for the Microsoft 365 service itself.
Customers don't need to plan for "what if Microsoft's datacentre fails" — that's Microsoft's problem, not yours.
What you handle
The customer side of disaster recovery in Microsoft 365 covers:
Identity recovery
- Break-glass admin accounts — emergency access if all other admins are locked out.
- MFA recovery — if your phone is destroyed, can you still get into your account?
- Domain takeover risk — protect your domain registrar accounts.
- Hybrid identity — what happens if Entra Connect fails? Your on-prem AD becomes a single point.
Data recovery
- User-initiated data loss — accidental deletion of sites, mailboxes, files.
- Account compromise — attacker deletes or encrypts content.
- Bad configuration change — a deployment that destroys content.
- Ransomware — encryption of cloud content (yes, it happens).
For these, you need:
- Microsoft 365 Backup for fast in-platform recovery.
- Purview retention policies for compliance preservation.
- Third-party backup for air-gapped insurance.
- Documented restore procedures with practiced exercises.
Configuration recovery
- Documented configuration state — what the tenant looked like before disaster.
- Configuration backup — export Conditional Access policies, transport rules, sensitivity labels, retention policies.
- Infrastructure-as-code patterns for repeatable rebuilds — Terraform, Bicep, custom scripts for Microsoft 365 settings.
If you lost the tenant tomorrow, could you rebuild the configuration?
Tenant recovery
The hardest case: what if the tenant itself is compromised or destroyed?
- Microsoft 365 doesn't support easy tenant-to-tenant restore.
- Third-party cross-tenant backup is the realistic insurance.
- Tenant rebuild from scratch is a multi-week project at best.
For most organisations, tenant-level disaster is low-probability but high-impact. Discuss with leadership; document the planned response even if it's "we accept the residual risk."
Service incident response
When Microsoft has a service incident affecting your tenant:
- Service Health dashboard is the authoritative source.
- Status communicated via Service Health, Twitter (@MSFT365Status), and email.
- Workarounds sometimes available — use mobile app if web is down, switch to phone if Teams is having issues.
- Internal communications to your users — let them know it's Microsoft-side, not their problem.
For organisations where Microsoft 365 downtime has serious cost:
- Documented internal status communications — pre-templated for "Microsoft is having an issue" scenarios.
- Alternate communication channels — what if email is down?
- Critical-path planning — what's the most critical Microsoft 365 service for your business? Do you have a backup workflow?
Business continuity vs disaster recovery
DR is one part of broader business continuity planning:
- DR — recovery from technical disasters (data loss, service outage).
- BC — continuity of business operations during any disruption.
For Microsoft 365 customers, the BC layer might include:
- Alternate work patterns if Teams is down — phone fallback, in-person meetings.
- Alternate file storage for critical operations if SharePoint is down (rare but possible during long incidents).
- Manual processes as fallback for digital workflows.
Testing
The DR plan that's never been tested doesn't work:
- Quarterly restore tests — actually restore a site, mailbox, OneDrive to validate procedures.
- Annual tabletop — walk through scenarios with the team.
- Major-incident retrospectives — what worked, what didn't, what to change.
Common gaps
- Break-glass accounts that haven't been tested in years.
- No documented configuration — when something breaks, no one remembers what the right state was.
- No backup of admin scripts — the PowerShell that drives operations is in someone's OneDrive that they can no longer access.
- No off-platform copy of critical data.
- Single-admin dependency — if one person leaves or is unavailable, no one knows the tenant.
For organisations serious about Microsoft 365 reliability, DR planning is one of those investments that's invisible when working and devastating when missing. Plan; document; test; iterate.