Principal Software Engineer Microsoft
Principal Software Engineer Azure EngOps AzRel Microsoft
BlueSky: @chris-ayers.com LinkedIn: - chris-l-ayers Blog: https://chris-ayers.com/ GitHub: Codebytes Mastodon: @Chrisayers@hachyderm.io Twitter: @Chris_L_Ayers
A system is considered "reliable" if it can consistently serve users under normal or abnormal conditions.
https://uptime.is/five-nines
Establish resilience expectations before selecting technology:
Design for how you respond, not just prevention
Set SLOs and RTO/RPO per flow, not per system
Focus on failure domains, redundancy strategy, and dependency design before implementation.
Failure Examples
Active-Active: Multiple instances process requests simultaneously.
Active-Passive: Primary instance processes traffic; secondary is on standby.
Automated, policy-driven environments (Landing Zones, AVM, APRL) reduce variance & misconfiguration risk.
Azure guarantees platform SLA; you own workload SLA
Pairing ≠ automatic failover—you must design for it
Embed failure-aware logic: timeouts, retries, backoff, bulkheads, circuit breakers, idempotency, hedging.
Retry-After
Emphasize fast detection, validated recovery paths, and continuous improvement through data & drills.
Reduce misconfigurations with standardized, production-tested templates
Principal Software Engineer Azure CXP AzRel Microsoft