Problems with starting/stopping RabbitMQ nodes

Consider that you have configured a running cluster with three nodes and one of your nodes suddenly fails. When you try to bring up that node using the following:

rabbitmq-server.bat

You get the dreadful BOOT FAILED message along with an error description message of timeout_waiting_for_tables and an Erlang stacktrace, as follows:

##########
              Starting broker...

BOOT FAILED
===========

Error description:
   {boot_step,database,
       {error,
           {timeout_waiting_for_tables,
               [rabbit_user,rabbit_user_permission,rabbit_vhost,
                rabbit_durable_route,rabbit_durable_exchange,
                rabbit_runtime_parameters,rabbit_durable_queue]}}}

Log files (may contain more information):
   D:/software/RabbitMQ/rabbitmq_server-3.4.4/log/[email protected]
   D:/software/RabbitMQ/rabbitmq_server-3.4.4/log/[email protected]

Stack trace:
   [{rabbit_table,wait,1,[]},
    {rabbit_table,check_schema_integrity,0,[]},
    {rabbit_mnesia,ensure_schema_integrity,0,[]},
    {rabbit_mnesia,init_db,3,[]},
    {rabbit_mnesia,init_db_and_upgrade,3,[]},
    {rabbit_mnesia,init,0,[]},
    {rabbit,'-run_step/3-lc$^1/1-1-',2,[]},
    {rabbit,run_step,3,[]}]

The error message tells you that there is something wrong while loading the data from the Mnesia database; however, it doesn't give you enough information on the exact cause of the problem. One thing you can do is that you can simply remove the node database files from the rabbit@DOMAIN-mnesia and rabbit@DOMAIN-plugins-expand folders that provide the storage of the Mnesia tables and the expanded plugins that are used by the RabbitMQ node. If you have a recent backup of your Mnesia database, you can try to use it to restore your database data. However, if using a backup is not an option, you need to perform some more troubleshooting in order to find and fix the problem. The first obvious thing to do is to inspect the RabbitMQ logs, as suggested earlier. However, doing so may not always give you more information than the error log that is displayed in the console. Moreover, there is a chance that your Mnesia database is not corrupt. You can try the following options:

  • If you are running a single (non-clustered) RabbitMQ node, you may try to specify the full RabbitMQ node name, along with the hostname (if you have changed the hostname of the machine on which you startup your nodes, you may get timeout_waiting_for_tables when Mnesia tries to fire up), as follows:
    set RABBITMQ_NODENAME=rabbit@<DOMAIN>
  • If you are running the node in a clustered environment and the other nodes have not started, the RabbitMQ node may wait for the other nodes to start by default within 30 seconds before throwing a timeout_waiting_for_tables error message. In that case, you can try to startup the other nodes in the cluster in 30 seconds from starting the current node and see if this resolves the problem.

Another common issue that may prevent the startup of clustered nodes is network partitioning. Consider that you can have a two- or three-node cluster and the communication links between the nodes fail. Each node becomes isolated from the other and thinks that the other nodes have failed and hence, becomes a master node. If you fix the communication links between the nodes and try to restart them, RabbitMQ will detect that there is more than one master node and startup of nodes may fail with an incosistent database, running_partitioned_network error message on subsequent master nodes that try to startup and join the cluster. You can detect this condition by running the following command:

rabbitmqctl.bat cluster_status

If you see a non–empty partition in the partitions attribute from the log, then a network partitioning was detected by RabbitMQ. In normal circumstances, this list is empty:

Cluster status of node rabbit@DOMAIN...
[{nodes,[{disc,[instance1@Domain,instance2@Domain,rabbit@DOMAIN]}]},
 {running_nodes,[instance2@Domain,instance1@Martin,rabbit@DOMAIN]},
 {cluster_name,<<"rabbit@Domain">>},
 {partitions,[]}]

While each node can act as a standalone master, this means that it may define new exchanges, queues, and bindings without the knowledge of other nodes. However, if you want to restore the cluster, you need to select one node as the master and rejoin the others to the cluster using this node. Before rejoining a node to the cluster, you may also want to reset its state. Assuming that the rabbit@DOMAIN node is your preferred master node, you can issue the following commands to rejoin the instance1 node to the cluster:

rabbitmqctl –n instance1 stop_app
rabbitmqctl –n instance1 reset
rabbitmqctl –n instance1 join_cluster rabbit@DOMAIN
rabbitmqctl –n instance1 start_app

For more information on network partitioning, you can refer to the Network Partitions entry in the RabbitMQ server documentation.

Another reason that your node may fail to startup is due to a resource that is already used by another RabbitMQ instance running on the same machine. If this is a network port that is already taken by the first instance, then the second instance will fail to start. If the first instance is running, for example, the management plugin on a default port and you try to start the second instance with the management plugin enabled, you will get an error message similar to the following:

  ##########
Starting broker...

BOOT FAILED
===========

Error description:
   {could_not_start,rabbitmq_management,
   {could_not_start_listener,[{port,15672}],eaddrinuse}}

Log files (may contain more information):
   D:/software/RabbitMQ/rabbitmq_server-3.4.4/log/instance1 .log
   D:/software/RabbitMQ/rabbitmq_server-3.4.4/log/instance1 -sasl.log

{"init terminating in do_boot",{rabbit,failure_during_boot,{could_not_start,rabb
itmq_management,{could_not_start_listener,[{port,15672}],eaddrinuse}}}}

Crash dump was written to: erl_crash.dump
init terminating in do_boot ()

This is easily solved by disabling the management plugin for that instance. Assuming that this is the instance1 instance, you can execute the following before starting the node:

rabbitmq-plugins.bat -n instance1 disable rabbitmq_management

As discussed in the earlier chapters, the management plugin is aware of clustering.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.138.120.187