Ability of a system to recover from [[infrastructure]] or Service [[disruption]]s, dynamically acquire computing resources to meet demand, and mitigate disruptions such as misconfigurations or transient network issues
Design principles
Test [[recovery procedure]]s - Use automation to simulate different failures or to recreate scenarios that led to failures before
Automatically recover from failure - Anticipate and remediate failures before they occur
Scale horizontally to increase aggregate system availability - Distribute requests across multiple, smaller resources to ensure that they don't share a common point of failure
Stop guessing capacity - Maintain the optimal level to satisfy demand without over under provisioning - Use [[Auto Scaling]]
Manage change in automation - Use automation to make changes to infrastructure