The control tier of an OpenStack cloud has the most moving parts that will need to be monitored. There are a few services that need at least a basic service connection validation. They include, but are not limited to, MySQL, RabbitMQ, and MongoDB. More monitoring can certainly be added beyond simple connection checks to monitor connections, queue sizes, and other statistics of the services. For now, we'll just add a connection check to make sure that these services are running:
define service { check_command check_mysql!nagios! nagios_password host_name control service_description MySQL Health check use generic-service } define service { check_command check_nrpe!check_rabbitmq_aliveness host_name control service_description RabbitMQ service check use generic-service } define service { check_command check_nrpe!check_mongod_connect host_name control service_description MongoDB service check use generic-service }
You can get the scripts for Rabbit and Mongo from https://github.com/mzupan/nagios-plugin-mongodb and https://github.com/jamesc/nagios-plugins-rabbitmq.
Next, we get into checking OpenStack services. We are going to add API checks to make sure that the service is running and that it is not in an error state. Packstack includes a few scripts to cover most of the API services. A few are additional to Packstack. Let's add the service stanzas for Nagios for the API calls:
define service { check_command keystone-user-list host_name control normal_check_interval 5 service_description number of keystone users use generic-service } define service { check_command neutron-net-list host_name network service_description Neutron Server service check use generic-service } define service { check_command nova-list host_name control normal_check_interval 5 service_description number of nova instances use generic-service } define service { check_command glance-index host_name control normal_check_interval 5 service_description number of glance images use generic-service } define service { check_command cinder-list host_name control normal_check_interval 5 service_description number of cinder volumes use generic-service } define service { check_command heat-stack-list host_name control normal_check_interval 5 service_description number of heat stacks for admin use generic-service } define service { check_command ceilometer-resource-list host_name control normal_check_interval 5 service_description number of ceilometer resources use generic-service } define service { check_command swift-list host_name control normal_check_interval 5 service_description number of swift containers for admin use generic-service }
With these basic checks in place, a set of successful checks in Nagios will show that services are up and running and the API services are healthy enough to list the resources that are being managed. There is a collection of services on the control node that are not API services. It is usually enough to do a service status check on them to make sure they are running. Let's add a service status check for the rest of the services that are not API endpoint services. You will want to add configuration stanzas that look like this for each service:
define service { check_command check_nrpe!check_service_name host_name 10.100.0.4 service_description Service Name service check use generic-service }
Do that for each of the following services, replacing service_name
and Service Name
with the actual service names:
openstack-ceilometer-alarm-evaluator openstack-ceilometer-alarm-notifier openstack-ceilometer-central openstack-ceilometer-collector openstack-ceilometer-notification openstack-cinder-backup openstack-cinder-scheduler openstack-cinder-volume openstack-glance-registry openstack-heat-api-cfn openstack-heat-engine openstack-nova-cert openstack-nova-conductor openstack-nova-consoleauth openstack-nova-novncproxy openstack-nova-scheduler
Remember that each of these services points to a corresponding NRPE command, so the hosts that these services run on will have to have the corresponding NRPE command defined on them.
3.141.19.185