Chapter 8. Troubleshooting

Running and maintaining a system successfully requires a good understanding of its components along with the various utilities that can be used to troubleshoot problems occurring in any of these components. In this chapter, we will look into some techniques that can be applied to troubleshoot the problem that is occurring with your RabbitMQ instances along with several common issues occurring in practice.

The topics to be covered in the chapter are as follows:

  • General troubleshooting approach
  • Problems with starting/stopping the RabbitMQ nodes
  • Problems with message delivery

General troubleshooting approach

As RabbitMQ instances run on top of the Erlang virtual machine, we can leverage the troubleshooting utilities provided by Erlang to troubleshoot problems occurring in the message broker. The variety of errors occurring may range from problems relating to starting/stopping the broker instance to performance issues—we already covered performance tuning and monitoring in the previous chapter; therefore, you can already apply that knowledge to troubleshooting. We will use a top-down approach to troubleshoot issues, as follows:

  1. Check the status of a particular node.
  2. Inspect RabbitMQ logs.
  3. Check the RabbitMQ community mailing list or ask in the IRC chat.
  4. Use Erlang utilities to troubleshoot a particular node.

Checking the status of a particular node

You can check the status of a particular node using the rabbitmq utility as follows:

rabbitmqctl.bat -n instance1 status

In the preceding example, we are checking the status of the instance1 RabbitMQ node. You will observe an output of the status command similar to the following (we are omitting resource-related statistics, such as memory usage and number of processes, as we already covered them in the previous chapter):

[{pid,10312},
 {running_applications,
     [{rabbitmq_shovel,"Data Shovel for RabbitMQ","3.4.4"},
      {rabbitmq_management_agent,"RabbitMQ Management Agent","3.4.4"},
      {rabbit,"RabbitMQ","3.4.4"},
      {os_mon,"CPO  CXC 138 46","2.3"},
      {gen_smtp,"An erlang SMTP server/client framework",
          "0.9.0-rmq3.4.x-61e19ec5-gita62c02e"},
      {ssl,"Erlang/OTP SSL application","5.3.8"},
      {public_key,"Public key infrastructure","0.22.1"},
      {crypto,"CRYPTO","3.4.2"},
      {mnesia,"MNESIA  CXC 138 12","4.12.4"},
      {amqp_client,"RabbitMQ AMQP Client","3.4.4"},
      {xmerl,"XML parser","1.3.7"},
      {asn1,"The Erlang ASN1 compiler version 3.0.3","3.0.3"},
      {sasl,"SASL  CXC 138 11","2.4.1"},
      {stdlib,"ERTS  CXC 138 10","2.3"},
      {kernel,"ERTS  CXC 138 10","3.1"}]},
 {os,{win32,nt}},
 {erlang_version,
     "Erlang/OTP 17 [erts-6.3] [64-bit] [smp:8:8] [async-threads:30]
"}

In the preceding piece of output, you can observe a lot of useful information, such as the following:

  • RabbitMQ message broker version
  • Erlang distribution
  • Operating system
  • RabbitMQ Erlang applications along with their versions

This is a good starting point to troubleshoot.

Inspecting the RabbitMQ logs

The RabbitMQ logs are located in the logs directory by default in the RabbitMQ installation directory in Windows or in the /var/log/rabbitmq directory in Unix-like operating systems. This location can be changed by setting the RABBITMQ_LOG_BASE environment variable. You can inspect the error logs for more detailed errors that are related to either the particular instance or in regard to communication with other nodes in the cluster. The RabbitMQ logs can be rotated using the rabbitmqctl utility with the rotate_logs command. Along with the RabbitMQ log file for the node, there is an alternative log file (ending with an SASL suffix), which is generated by the Erlang SASL (System Architecture Support Libraries) application libraries that provide different forms of logging reports, including crash reports.

The following message specifies that free disk monitoring (required for comparison against the free disk threshold, set by the disk_free_limit configuration parameter) is not supported on the platform that runs the RabbitMQ node:

=INFO REPORT==== 2-Sep-2015::20:41:47 ===
Disabling disk free space monitoring on unsupported platform:
{{'EXIT',{eacces,[{erlang,open_port,
                          [{spawn,"C:\Windows\system32\cmd.exe /c dir /-C /W "d:/software/RabbitMQ/rabbitmq_server-3.4.4/db/rabbit@DOMAIN-mnesia""},
                           [stream,in,eof,hide]],
                          []},
                  {os,cmd,1,[{file,"os.erl"},{line,204}]},
                  {rabbit_disk_monitor,get_disk_free,2,[]},
                  {rabbit_disk_monitor,init,1,[]},
                  {gen_server,init_it,6,[{file,"gen_server.erl"},{line,306}]},
                  {proc_lib,init_p_do_apply,3,
                            [{file,"proc_lib.erl"},{line,237}]}]}},

In this particular example, the message is descriptive enough and can save you the effort of looking further in the Erlang stack trace. In the SASL log file, the same error looks similar to the following:

=CRASH REPORT==== 2-Sep-2015::20:41:45 ===
  crasher:
    initial call: rabbit_disk_monitor:init/1
    pid: <0.28939.1>
    registered_name: []
    exception exit: unsupported_platform
      in function  gen_server:init_it/6 (gen_server.erl, line 322)
    ancestors: [rabbit_disk_monitor_sup,rabbit_sup,<0.143.0>]
    messages: []
    links: [<0.262.0>]
    dictionary: []
    trap_exit: false
    status: running
    heap_size: 1598
    stack_size: 27
    reductions: 646
  neighbours:

If you are trying to consume a message from a non-existent queue (for example, test-queue), you may see a message such as the following in the logs:

=ERROR REPORT==== 20-Jul-2015::12:31:20 ===
Channel error on connection <0.514.0> (127.0.0.1:63451 -> 127.0.0.1:5672, vhost: '/', user: 'guest'), channel 2:
{amqp_error,not_found,"no queue 'test-queue' in vhost '/'",'basic.consume'}

In case you lose a connection with a cluster node, you will get a message that can be easily interpreted, as follows:

=ERROR REPORT==== 2-Sep-2015::23:12:27 ===
** Node instance1@Domain not responding **
** Removing (timedout) connection **

In case you are running a RabbitMQ cluster and you already have the web management console started on the default port, you can hit the following problem (as displayed in the RabbitMQ log file):

=ERROR REPORT==== 20-Jul-2015::12:25:41 ===
** Generic server rabbit_web_dispatch_registry terminating 
** Last message in was {add,rabbit_mgmt,
                            [{port,15672}],
                            #Fun<rabbit_web_dispatch.1.31447083>,
                            #Fun<rabbit_mgmt_app.0.15521781>,
                            {[],"RabbitMQ Management"}}
** When Server state == undefined
** Reason for termination == 
** {{could_not_start_listener,[{port,15672}],eaddrinuse},
    [{rabbit_web_dispatch_sup,check_error,2,[]},
     {rabbit_web_dispatch_registry,handle_call,3,[]},
     {gen_server,try_handle_call,4,[{file,"gen_server.erl"},{line,607}]},
     {gen_server,handle_msg,5,[{file,"gen_server.erl"},{line,639}]},
     {proc_lib,init_p_do_apply,3,[{file,"proc_lib.erl"},{line,237}]}]}

This indicates that 15672 could not be opened (if another cluster node is running the management console, you do not need to enable it for other cluster nodes anyway, unless you want to specify a different port on which you want to run the management plugin for the purpose of high availability). However, if the 15672 port is not in use, this may indicate a mismatch between the Erlang distribution and the RabbitMQ server, preventing the management plugin to open the 15672 port. This leads us to use alternative mechanisms for further troubleshooting of the problem.

The RabbitMQ mailing list and IRC channel

At this point, you may have already discovered the output of the status command and inspected the logs; however, you might still be clueless about what the reason for the error that we saw in the previous section could be:

** Generic server rabbit_web_dispatch_registry terminating 

Now, you may look for a similar issue on the rabbitmq-users or rabbitmq-discuss mailing lists. If you don't find a similar issue suggested with a proper solution for the problem, you can drop a message to the mailing list describing your problem in detail and sending the RabbitMQ logs, along with the Erlang crash dump. The Erlang crash dump file is generated when the Erlang VM abnormally terminates, and it is generated in the directory where your RabbitMQ server starts (for example, the sbin directory from the RabbitMQ installation in Windows).

Erlang troubleshooting

The erl_crash.dump file is created in the startup directory of the RabbitMQ server when something goes wrong with the message broker. It is not the only means by which you can troubleshoot the message broker using information that is provided by the Erlang runtime, you can also directly connect to the Erlang process of the RabbitMQ instance and query it for the purpose of troubleshooting.

An Erlang Primer

To be able to dig into the root cause of a problem requires a good understanding of the Erlang programming language. In this section, we will cover the basics of Erlang and make use of this knowledge in the last chapter of the book, when we discuss how to create a plugin for RabbitMQ and how to implement RabbitMQ.

To begin, you need to add the <erlang_home>in directory to your PATH and execute the following command from the command line:

erl

The command will fire up the Erlang REPL (Read-Eval-Print-Loop) shell, where you can type the Erlang commands. To connect to a particular node that is running on the local workstation, you can provide the domain name of the instance with the –sname option (sname stands for 'short names' and it is the default instance-naming format that RabbitMQ uses), as shown in the following:

erl –sname rabbit@DOMAIN

In order to use the preceding command, you need to stop the rabbit@DOMAIN node first.

You can start by evaluating the following expression using the Erlang interpreter (don't forget the dot at the end of each expression):

(4 + 6) * 2.

Not only can the arithmetic expressions be evaluated. Let's transform the preceding example using two variables, as follows:

X = 4.
Y = 6.
(X + Y) * 2.

If you reassign the X variable to 10, as follows:

X = 10.

You will get an error as shown in the following:

** exception error: no match of right hand side value 10

To reassign the variable, you need to first unbind it using the f() function:

f(X).

Note that you can unbind all variables by simply calling the following function:

f().

The preceding expression is not of much use; therefore, let's make a function out of it from the Erlang shell:

F = fun(X,Y) -> (X + Y) * 2 end.

The fun keyword can be used to define an anonymous function. In the previous case, this function is bound to the F variable. Now, you can evaluate the former expression using the following function:

F(4,6).

Functions in Erlang are typically defined in modules. A module in Erlang is defined as a file with an .erl extension, which is further compiled to an Erlang object file with a .beam extension that represents the actual byte code that is executed by the Erlang virtual machine. You can define the preceding function in a module called sample (saved in a sample.erl file. Please note that the name of the file should match the module declaration):

-module(sample).
-export([double/2]).
double(X,Y) -> (X+Y) * 2.

The –module declaration specifies the name of the module, followed by one or more -export declarations that explicitly specify which functions from the module are exported by the module and can be used by other modules. You should specify the name of the function along with its arity (number of parameters that the function accepts). Functions with the same name but different numbers of parameters are treated as separate function declarations by Erlang. In the module, there is a double function—this declaration is valid only in a module and cannot be executed from the shell—you should use the fun keyword for this, as we saw earlier.

To compile the module, you must first navigate to the directory of your module using the cd() function and then, the c() function, to compile the module to a beam file. Assuming the sample.erl file is created in the D:sources directory, you can execute the following from the Erlang REPL in order to compile the module:

cd('D:/sources').(sample).

If compilation is successful, you will see a message as follows:

{ok,sample}

This is actually a tuple that is returned from the c() function, which indicates a successful status (ok) and the name of the compiled module. A tuple, in Erlang, is a container with a fixed number of elements that can be of different types. In order to invoke the double function from the sample module, you can write the following:

sample:double(6,4).

Use the m() function or the module_info() method (which returns a list with the result) that is available for each Erlang module to check for information, such as available functions, about the module:

m(sample).
sample:module_info().

These can also be pretty useful utilities to inspect the existing modules in a system such as RabbitMQ.

Variable definitions do not specify the type of the variable, it is determined at runtime (as seen in the double function). We have the following types of data:

  • integers: There is no limit to the size of an integer in Erlang, for example, 257.
  • floats: For example, 45.6.
  • atoms: They are used to create constants; you can think of them as values of an enumeration or constant, for example, X, Y.
  • booleans: true or false.
  • references: They are used to create unique identifiers for objects.
  • bit strings: They are used to represent sequences of bits as segments of particular value that optionally have a length and a type, for example, << <<0:1,1:1, 0:1>>. In this particular example, the bit string represents the bit sequence "010". Bit strings are very useful to parse binary streams of data, for example, parsing a protocol message based on a protocol mask. As you can see, this mechanism can be directly used to parse an AMQP message.
  • binaries: They are simply bit strings, where each segment of the string is a sequence of bits that is divisible by eight. For example, <<111, 172, 15>>.
  • pids: They are used to represent process identifiers.
  • ports: They are used to represent Erlang ports; essentially a separate processes is started for an Erlang process that maps to an OS port and provides a communication with the external world.
  • funs: They are used to create function objects (closures).
  • tuples: They are containers for a fixed number of items, possibly of different types.
  • lists: They are containers for a variable number of items, possibly of different types.
  • maps: They are containers for a key-value pair of items.
  • records: They are containers for a mixed type of data, similar to C structs and compiled to tuples.

Erlang uses the concept of pattern matching in order to bind one or more variables to the particular values. It is used to assign variables (denoted by atoms) using more complex expressions that direct assignment. Consider the following examples:

{X,b} = {a,b}.
[10,[Y],15] = [10,[[1,2,3]],15].
{X,X} = {a,b}.
[A,2] = [10].

The first expression binds X to a, the second expressions binds Y to the [1,2,3] list, and the third and fourth expressions result in exceptions as pattern matching fails in these cases. We will briefly cover error handling later in the chapter.

Another useful concept is list comprehensions, where you can iterate over a list and return a modified list using a filter function and a generator for the elements of the new list. Consider the following example:

[X+1 || X <- [4,5,6], X rem 2 == 0].

The result is the [5,7] list, all even elements are filtered and incremented by one in the new list. We can rewrite the preceding example using a recursive function, as Erlang enforces the functional programming style along with idioms derived from languages such as Prolog; the language does not provide a looping construct. The filter_list_sample function implements the same behavior as the list comprehension using an if statement:

filter_list_sample(L) -> filter_list_sample_helper(L, []).
filter_list_sample_helper([], Res) -> Res.
filter_list_sample_helper([X|L], Res) -> 
if 
    X rem 2 == 0 -> 
    filter_list_sample_helper(L, [X+1| Res]).
    true -> 
          filter_list_sample_helper(L, Res)
end.

If you add this to the sample module that we created earlier, export the filter_list_sample function from the module, and recompile it, you can invoke the preceding function with the following:

sample:filter_list_sample([4,5,6]).

The result is returned in reverse order due to the recursion; implement a function that reverses the resulting list as an exercise. Note that if you have multiple definitions of the same function (in this case, filter_list_sample_helper), you should separate them with a semicolon. Multiple expressions in the same function are separated by a comma. You can also use the case expression instead of the if expression in the preceding example, as shown in the following:

filter_list_sample_helper([X|L], Res) -> 
case X rem 2 of
0 -> filter_list_sample_helper(L, [X+1| Res]).
    _ -> filter_list_sample_helper(L, Res)
end.

The underscore (_) indicates any match (in this case, this could be only 1).

There are many scenarios where Erlang may throw an error, and we can differentiate between the three types of runtime errors, as follows:

  1. regular errors: Thrown by an erlang:error() call. This is the equivalent of a throw statement in the programming languages such as C++ or Java, stacktrace is included as a part of the error.
  2. throw errors: Thrown by a throw() function. This is typically used to exit a deeply nested function call and does include a stacktrace rather it includes a value that was handled earlier in the call stack.
  3. exit errors: Thrown by an erlang:exit() call. This is used to signal that a process is exiting (a value of normal passed to the function indicates that the process exits normally, other exit codes indicate an error).

All the types of errors can be caught using a try … catch block. The following example demonstrates the use of the different types of exceptions in Erlang:

exception_sample(Val) -> 
    case Val of
        1 -> throw("Invalid value: 1").
        2 -> error("Invalid value: 2").
        3 -> exit("Invalid value: 3").
        _ -> "Success"
    end.
    
exception_handler(Val) ->
    try 
        exception_sample(Val)
    catch
        error: Error -> {error, Error}.
        throw: Error -> {throw, Error}.
        exit: Error -> {exit, Error}    
    end.

Export the exception_handler() function as part of the sample module and execute it with different arguments to see how it behaves:

sample:exception_handler(1).
sample:exception_handler(2).
sample:exception_handler(3).
sample:exception_handler(4).

You should receive the following output:

{throw,"Invalid value: 1"}
{error,"Invalid value: 2"}
{exit,"Invalid value: 3"}
"Success"

When an Erlang process exits as a result of an error that is not handled by the process, you will get a result that is in a format similar to the RabbitMQ node crashing as RabbitMQ nodes are started as Erlang processes.

So far, we discussed the basic constructs of the language. However, Erlang excels when it comes to distributed programming. Processes in Erlang are lightweight, they are created by the Erlang VM without actually interacting with the underlying operating system (and creating any OS-level threads or processes). Communication between processes is possible via message passing. The Erlang VM takes the responsibility of handling the process execution underneath on one or more CPUs in the system on which the Erlang VM runs. Thus, reducing context switching' you don't need to go to the kernel scheduler to switch between the currently executing threads. This, and the ability to dynamically allocate process stacks (thus saving the effort to reserve a lot of RAM), provides the possibility of creating thousands of Erlang processes at once. If any two processes need to communicate on the same machine, you can do it directly using the ! and receive expression in order to exchange messages, as demonstrated in the following example:

sample_sender(Pid, Message) -> 
    Pid ! Message.

sample_receiver() -> 
    receive
        Message -> io:format(Message, [])
    end.

start() -> 
    Preceiver = spawn(?MODULE, sample_receiver, []),
    spawn(?MODULE, sample_sender, [Preceiver, "Test message."]).

We create a sender and receiver as separate processes in the start() method using the spawn function that creates a process based on a module function, along with the parameter passed to that function upon process creation. The ?MODULE macros refer to the current module, you can think of the Erlang macros as C++ preprocessor directives. The sample_sender() function sends a message using the ! operator to the process identified by a particular pid (proportional–integral–derivative). The sample_receiver() method uses the receive expression to wait for a message and is blocked until a message is received. The message is printed on the standard output using the built-in io:format Erlang function. You need to export all the three functions from the sample module and run the demo using the following line of code from the Erlang REPL:

sample:start().

In this particular example, the processes run in the same Erlang VM. However, if the processes are started on a remote machine, then several concerns are further raised. The most important issues to solve are as follows:

  • How do we exchange the process identifiers among the processes? How are the processes aware of each other?
  • How can you prevent tampering of communication from a third party among the processes?

The answer to the first question is the register() built-in function that allows you to map a symbolic name to a process identifier. This mapping information is stored in an Erlang register, and when a process needs to communicate with another remote process, it must know the address of the machine where the other process resides along with the symbolic name of the remote process. The rest is handled by Erlang behind the scenes.

The answer to the second question is the Erlang cookies that we mentioned in the earlier chapters when we talked about RabbitMQ clustering. Erlang cookies are stored in an .erlang.cookie file and are used by the Erlang processes as a shared secret. A node is not obliged to use the same cookie for all other remote nodes—a different cookie can be specified for communication with a remote node. This can be accomplished using the erlang:set_cookie() method that uses the remote node identifier and Erlang cookie instance as arguments. To retrieve the current cookie used by the node, you can use the erlang:get_cookie() method. In case no cookie is in use, the method will return nocookie.

Our brief primer of the Erlang language should be sufficient in order to make use of the utilities provided by the language for further troubleshooting of your RabbitMQ instances. You can retrieve the name of the current node with the following command:

node().

You can also retrieve the names and the ports of the processes that are registered by the EPMD (Erlang Port Mapper Daemon) process running on the same Erlang VM:

net_adm:names().

Assuming that we have started our three-node cluster on the same machine, we should observe the following output:

{ok,[{"rabbit",25672},
     {"instance1",25701},
     {"instance2",25702}]}

The ports that you see for each node are the ports assigned to the Erlang processes for each RabbitMQ instance (in the previous case, 20000 + the name of the RabbiqMQ instance port).

We can also use the rpc:call function in order to execute a function in a particular local/remote Erlang process (and this could be the process of a RabbitMQ instance). You can also use the different Erlang utilities, such as the rpc:call() function, to execute the commands on remote processes or retrieve the information about these processes.

The Erlang crash dump

The Erlang crash dump file is created in the current working directory of a Rabbit instance when it crashes. The crash dump file contains useful statistics that are collected at the time of the crash along with the information about the processes that are affected as part of the crash. The reason for the node failure is indicated by the line starting with the word slogan. For example, the following command indicates that there is a problem with starting up of a node (without providing more details as a part of the reason):

Slogan: init terminating in do_boot ()

You can use the knowledge gained from the previous section to inspect the information that is collected in the crash dump or better, use the Crashdump Viewer GUI utility to inspect the crash dump. To start the utility, invoke the following commnad from the Erlang REPL:

crashdump_viewer:start().

After the tool is started, you will be prompted to select the crash dump file. After the file is selected, the tool will divide the information from the file into proper sections and tables for easier inspection, as follows:

The Erlang crash dump

We will expand further on the concept of troubleshooting when we discuss the internal architecture of the message broker. If you get an error that contains: init terminating in do_boot(), then there are several things that might be the root cause of the problem (make sure that you analyze the crash dump for more information on the problem):

  • Insufficient permissions on some of the RabbitMQ folders and files.
  • Corrupt RabbitMQ database. In this case, delete the contents of the %APPDATA%RabbitMQ folder (in Windows) and restore it using a recent backup, if this is at all possible.
  • Check the version of your Erlang installation and if it does not match your OS architecture (32/64-bit), reinstall it.
..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.216.151.164