People are moving toward the cloud because it guarantees to provide you with a more reliable system compared to an in-house setup. Cost is one of the major factors in achieving the right amount of reliability:
- Test recovery procedures
- Automatically recover from failure
- Scale horizontally to increase aggregate system availability
- Stop guessing capacity
- Manage change using automation
The following gives more information about the previous points:
- Test recover procedures: Practice recovery procedure for your services data so that you can handle any incident and recover your system in a short period of time to make your system more reliable.
- Automatically recover from failure: Monitor your key performance indicators (KPI) to automatically take action during failure, for quick recovery.
- Scale horizontally to increase aggregate system availability: Scaling horizontally is fast in comparison to vertical scaling, and the horizontal scaling cost is going linear, whereas vertically scaling cost goes exponential.
- Stop guessing capacity: Provide data and facts to your system so that it can take a decision to scale instead of guessing your customer traffic. Guessing is a short-term temporary fix, and this situation comes when people don't have metric collection about their services. In the absence of data/facts, if you choose to guess the load, it leads to failure and decreases reliability. We see more unreliable infra during DDoS attacks on a guessed system compared to a fact/data-based scaled system.
- Manage change using automation: Avoid human intervention in change implementation, and try to automate this process as much as possible.
Some time back, I was working to achieve reliability for my company, and I developed a project reliability maturity KPI matric that was very useful for telling people about the maturity or reliability implementation in the project using various KPI categorized under it.