Senior Site Reliability Engineer Azure CXP AzRel Microsoft
BlueSky: @chris-ayers.com LinkedIn: - chris-l-ayers Blog: https://chris-ayers.com/ GitHub: Codebytes Mastodon: @Chrisayers@hachyderm.io Twitter: @Chris_L_Ayers
A system is considered "reliable" if it can consistently serve users under normal or abnormal conditions.
https://uptime.is/five-nines
We've explored design principles and tradeoffs. Next, let's discuss a crucial activity: proactively identifying and mitigating possible failure modes.
Failure Examples
Active-Active: Multiple instances process requests simultaneously.
Active-Passive: Primary instance processes traffic; secondary is on standby.
Customer Responsibilities:
After validating performance, we can further strengthen resilience by deliberately introducing controlled failures.
Combine load testing and chaos engineering into continuous practices.
Validate system resilience comprehensively and frequently.
Utilize established Azure reference architectures for reliability.
Leverage available documentation and best practices for consistency and effectiveness.
Explore Azure Architecture Center
Learn about designing mission-critical applications on Azure for high availability, reliability, and performance:
Mission-Critical Workloads
Guidance for web apps on Azure, offering prescriptive architecture, code, and configuration aligned with the Well-Architected Framework:
Enterprise Web App Patterns
Azure Verified Modules
APRL: Azure Platform Resiliency Library
Azure Review Checklists