The simple answer is everything, or as much as you can. You can never predict what scenario may be forced upon you and your cluster and having the correct monitoring and alerting in place can mean the difference between handing a situation gracefully or having a full-scale outage. A list of things that should be monitored in decreasing order of importance is as follows.