10

Troubleshooting Concurrency Issues

All non-trivial programs have bugs. When you realize that there are some anomalies in a program, starting a debugging session is usually not the first thing you should do. This chapter is about some of the techniques you can use for troubleshooting without using a debugger. You may find, especially when dealing with concurrent programs, that debuggers sometimes do not offer much help and the solution relies on careful code reading, log reading, and understanding stack traces.

In this chapter, we will explore the following:

  • How to read stack traces
  • How to detect failures using additional code to monitor program behavior and sometimes heal the program
  • Debugging anomalies using timeouts and stack traces

Technical Requirements

The source code for this particular chapter is available on GitHub at https://github.com/PacktPublishing/Effective-Concurrency-in-Go/tree/main/chapter10.

Reading stack traces

If you are lucky, your program panics when something goes wrong, and prints out lots of diagnostic information. You would be lucky because if you have the output of a panicked program, you can usually figure out what went wrong by just looking at it together with the source code. So, let’s take a look at some stack traces. The first example is a deadlock-prone implementation of the dining philosophers problem, with just two philosophers:

func philosopher(firstFork, secondFork *sync.Mutex) {
for {
    firstFork.Lock()
    secondFork.Lock() // line: 10
    secondFork.Unlock()
    firstFork.Unlock()
  }
}
func main() {
    forks := [2]sync.Mutex{}
    go philosopher(&forks[1], &forks[0]) // line: 18
    go philosopher(&forks[0], &forks[1]) // line: 19
    select {} // line: 20
}

This program eventually deadlocks because of the cyclic nature of the nested locks. When that happens, the runtime detects that there are no active goroutines left in the program and prints a stack trace. The stack trace starts with the root cause (in this case, deadlock):

fatal error: all goroutines are asleep - deadlock!

Then it lists all active goroutines, starting from the one that caused the panic. In the case of deadlock, this can be any one of the deadlocked goroutines. The following stack trace starts with the empty select statement in main (line 20). It shows that there is a goroutine waiting for that select statement:

goroutine 1 [select (no cases)]:
main.main()
/home/bserdar/github.com/Writing-Concurrent-Programs-in-Go/chapter10/stacktrace/deadlock/main.go:20 +0xa5

The second goroutine stack shows the path the goroutine followed:

goroutine 17 [semacquire]:
sync.runtime_SemacquireMutex(0x0?, 0x1?, 0x0?)
/usr/local/go/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc00010e000)
/usr/local/go/src/sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:90
main.philosopher(0xc00010e008, 0xc00010e000)
/home/bserdar/github.com/Writing-Concurrent-Programs-in-Go/chapter10/stacktrace/deadlock/main.go:10 +0x66
created by main.main
/home/bserdar/github.com/Writing-Concurrent-Programs-in-Go/chapter10/stacktrace/deadlock/main.go:18 +0x65

You can see that the first entry is from the runtime package, the unexported runtime_SemacquireMutex function, called with three arguments. The arguments that are displayed with question marks are values that cannot be captured reliably by the runtime because they were passed in registers instead of being pushed onto the stack. We can be quite sure that at least the first argument is not correct because, if you look at the source code printed in the stack trace (.../go/src/runtime/sema.go:77), it is the address of an uint32 value. (If you test this code yourself, the line number may not match what is displayed here. The important point is that you can still check both the Go standard library functions and your functions by looking at the line numbers printed in your environment.) This function is called by Mutex.lockSlow. If you check the source code, you will see that Mutex.lockSlow does not take any arguments, but the stack trace shows one. That argument is the receiver for the lockSlow method, which is the address of the mutex it is called on. So, here we can see that the mutex that is the subject of this call is at address 0xc00010e00. Moving down to the next entry, we see that this method was called by Mutex.Lock. The next entry shows where this Mutex.Lock is called in our program: at line 10. This corresponds to the secondFork.Lock line. The next entry in the stack trace also shows that this goroutine was created by main, at line 18.

Taking note of the arguments passed to the functions, we see that the main.philosopher function gets two arguments: the addresses of the two mutexes. Since the lockSlow method is passed the mutex at address 0xc00010e000, we can infer that it is &forks[0]. So, this goroutine is blocked while trying to lock &forks[0].

The third goroutine follows a similar path, but this time the philosopher goroutine is started at line 19, corresponding to the second philosopher. Following similar reasoning, you can see that this goroutine is trying to lock mutex at 0xc00010e008, which is &forks[1]:

goroutine 18 [semacquire]:
sync.runtime_SemacquireMutex(0x0?, 0x0?, 0x0?)
/usr/local/go/src/runtime/sema.go:77 +0x25
sync.(*Mutex).lockSlow(0xc00010e008)
/usr/local/go/src/sync/mutex.go:171 +0x165
sync.(*Mutex).Lock(...)
/usr/local/go/src/sync/mutex.go:90
main.philosopher(0xc00010e000, 0xc00010e008)
/home/bserdar/github.com/Writing-Concurrent-Programs-in-Go/chapter10/stacktrace/deadlock/main.go:10 +0x66
created by main.main
/home/bserdar/github.com/Writing-Concurrent-Programs-in-Go/chapter10/stacktrace/deadlock/main.go:19 +0x9b

This stack trace shows where the deadlock occurred. The first goroutine is waiting to lock &forks[0], which implies it already locked &forks[1]. The second goroutine is waiting to lock &forks[1], which implies it already locked &forks[0], hence the deadlock.

Now, let’s take a look at a more interesting panic. The following program contains a race, and it panics occasionally:

func main() {
    wg := sync.WaitGroup{}
    wg.Add(2)
    ll := list.New()
    // Goroutine that fills the list
    go func() {
        defer wg.Done()
        for i := 0; i < 1000000; i++ {
            ll.PushBack(rand.Int()) // line 18
        }
    }()
    // Goroutine that empties the list
    go func() {
        defer wg.Done()
        for i := 0; i < 1000000; i++ {
            if ll.Len() > 0 {
                ll.Remove(ll.Front())
            }
         }
    }()
    wg.Wait()
}

This program contains two goroutines: one adding elements to the end of a shared linked list, and the other removing elements from the beginning of the list. This program may run to completion every so often, but sometimes it will panic with a stack trace that looks like this:

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x459570]
goroutine 17 [running]:
container/list.(*List).insert(...)
/usr/local/go/src/container/list/list.go:96
container/list.(*List).insertValue(...)
/usr/local/go/src/container/list/list.go:104
container/list.(*List).PushBack(...)
/usr/local/go/src/container/list/list.go:152
main.main.func1()
/home/bserdar/github.com/Writing-Concurrent-Programs-in-Go/chapter10/stacktrace/listrace/main.go:18 +0x170
created by main.main
/home/bserdar/github.com/Writing-Concurrent-Programs-in-Go/chapter10/stacktrace/listrace/main.go:15 +0xcd
exit status 2

It is a segmentation violation error, which means the program tried to access parts of memory it is not allowed to. In this case, the panic says it is addr=0x0, so the program tried to access the contents of a nil pointer. The stack trace shows how this happened: main.go:18 is where List.PushBack is called. Following the stack trace from bottom to top, we see that List.PushBack calls List.insertValue, which then calls List.insert. The nil pointer access happened in List.insert, at line 96, whose source code is available as follows:

92: func (l *List) insert(e, at *Element) *Element {
93:    e.prev = at
94:    e.next = at.next
95:    e.prev.next = e
96:    e.next.prev = e  // This is where panic happens
97:    e.list = l

Now, some simple deductive reasoning: line 96 can panic if e is nil, or e.next is nil. Looking at the source, e cannot be nil because otherwise, it would have panicked before line 96. Then, e.next must be nil. Is there a bug in the standard library code then, because the assignment is done without a nil check?

In these situations, it helps to learn a bit more about the underlying code than to just insert a nil check in the code. If you look at the comments in the source code, you’ll see that:

// To simplify the implementation, internally a list l is
//implemented
// as a ring, such that &l.root is both the next element of
//the last
// list element (l.Back()) and the previous element of the
//first list
// element (l.Front()).

Since the list is implemented as a ring, the next and prev pointers cannot be nil. Even if there is only one node in the list, the pointers of that node will point to the node itself. So, somewhere else in the code, these must be set to nil. Browsing the source for assignment to nil values, we find the following:

func (l *List) remove(e *Element) {
     e.prev.next = e.next 
     e.next.prev = e.prev 
     e.next = nil // avoid memory leaks 
     e.prev = nil // avoid memory leaks 
     e.list = nil 
     l.len--
}

There it is! When an element is removed from the list, it is detached from the ring by assigning its pointers to nil. This behavior suggests a race where insert and remove are running concurrently. The next pointer of a node is set to nil by List.remove, but that removed node is used at the argument for List.insert, causing the panic. The solution is to create a mutex and move all list operations inside critical sections guarded by that mutex (and not to add nil checks.)

As I tried to show here, it is always advisable to fully understand the situation in which a panic happens. Most of the time, this will require you to research and find out the assumptions about the underlying data structures. Like in the previous linked-list example, if the data structure is written so that it cannot have nil pointers, then you should not add a nil check when you see one, and try to understand why you ended up with a nil pointer.

Detecting failures and healing

Most software systems will fail despite the efforts spent on testing them. This suggests there are limits to what can be achieved by testing. These limitations stem from several facts about non-trivial systems. Any non-trivial system interacts with its environment, and it is simply not practical (and in many cases, not possible) to enumerate all possible environments in which the system will run. Also, it is usually possible to test a system to make sure it behaves as expected, but it is much harder to develop tests to make sure the system does not behave unexpectedly. Concurrency adds additional complexities: a program that was successfully tested for a particular scenario may fail for the same scenario when put into production.

In other words, no matter how much you test your programs, all sufficiently complex programs will eventually fail. So, it makes sense to architect systems for graceful failure and quick recovery. Part of this architecture is the infrastructure to detect anomalies and, if possible, heal them. Cloud computing and container technologies provide many tools to detect program failures and orchestration tools to restart them. Other monitoring, alerting, and automated recovery tools are also available for non-cloud-based deployments with traditional executables.

Some of these anomalies are accumulative bugs that grow over time until all the resources are consumed. Memory leaks and goroutine leaks are such failures. If you notice repetitive program restarts with out-of-memory errors, you should search for a memory or goroutine leak. The Go standard library provides the tooling for this:

  • Use the runtime/pprof package to add profiling support to your program and run it in a controlled environment to replicate the leak. It makes sense to add a flag to enable profiling, so you can turn profiling on and off without recompilation. You can use the CPU or heap profile to determine the source of the leak.
  • Use the net/http/pprof package to publish profiles over an HTTP server, so you can observe how your program uses memory as it is running.

Sometimes, these anomalies are not bugs but are caused by reliance on other systems. For example, your program may rely on a response from a service that sometimes takes a very long time to return, or that fails to return at all. This is especially problematic if there is no way to cancel that service. Most network-based systems eventually time out, but that timeout can be an unacceptable value for your program. It is also possible that your program calls out to a service or a third-party library that just hangs. A practical solution for this may be to terminate the program gracefully and let the orchestration system start a new instance of the program.

The first problem is detecting failure. Let’s come up with a non-trivial realistic example: suppose we have a program that calls a SlowFunc function, which sometimes takes a long time to complete. Also, suppose there is no way to cancel that function. But we don’t want our program to wait indefinitely for the results of SlowFunc, so we develop the following scheme:

  • If the call to SlowFunc succeeds within a given duration (CallTimeout), we return the result.
  • If the call to SlowFunc lasts longer than CallTimeout, we return a Timeout error. Since there is no way to cancel SlowFunc, it will continue running in a separate goroutine until completion.
  • There may be many calls to SlowFunc that take a long time, so we want to limit the number of active concurrent calls to a given number. If all the available concurrent calls are waiting for SlowFunc to complete, then the function should fail immediately with a Busy error.
  • If none of the calls from the maximum number of concurrent calls respond within a given timeout (AlertTimeout), we raise an alert.

Let’s start developing this scheme as a generic type, Monitor[Req,Rsp any], where Req and Rsp are the request and response types, respectively:

type Monitor[Req, Rsp any] struct {
     // The amount of time to wait for the function 
     // to return
     CallTimeout  time.Duration
     // The amount of time to wait to raise an alert 
     // when all concurrent calls are in progress
     AlertTimeout time.Duration
     // The alert channel
     Alert        chan struct{}
     // The function we are monitoring
     SlowFunc     func(Req) (Rsp, error)
     // This channel keeps track of concurrent calls
     // to SlowFunc
     active  chan struct{}
     // A signal to this channel means that the active
     // channel is full
     full      chan struct{}
     // When an instance of SlowFunc returns, it will
     // generate a heartBeat
     heartBeat  chan struct{}
}

The Monitor.Call function calls Monitor.SlowFunc by implementing the timeout scheme described previously. This function can return one of three possible values: a valid response with or without an error, a timeout error after Monitor.CallTimeout, or a Busy error immediately:

func (mon *Monitor[Req, Rsp]) Call(ctx context.Context, req Req) (Rsp, error) {
    var (
        rsp Rsp
        err error
    )
// If the monitor cannot accept a new call, return ErrBusy
//immediately, but also start the alert timer
    select {
    case mon.active <- struct{}{}:
    default:
        // Start the alert timer
        select {
        case mon.active <- struct{}{}:
        case mon.full <- struct{}{}:
            return rsp, ErrBusy
        default:
            return rsp, ErrBusy
        }
    }
// Call the function in a separate goroutine
    complete := make(chan struct{})
    go func() {
        defer func() {
            // Notify the monitor that the function
            //returned
            <-mon.active
            select {
            case mon.heartBeat <- struct{}{}:
            default:
            }
            // Notify the caller that the call is complete
            close(complete)
        }()
        rsp, err = mon.SlowFunc(req)
    }()
    // Wait for the result or a timeout
    select {
    case <-time.After(mon.CallTimeout):
        return rsp, ErrTimeout
    case <-complete:
        return rsp, err
    }
}

Let’s dissect this method. Upon entry, the method attempts to send to mon.active. This is a non-blocking send, so it will succeed only if the number of active concurrent calls is smaller than the maximum allowed. If there are already a maximum number of concurrent calls in progress, the default case is chosen, which attempts to send to mon.active again. This is a typical way of emulating priority between channels. Here, mon.active is given priority. If it cannot send to mon.active, then it tries to send to mon.full. This will be enabled only if there is a goroutine waiting to receive from it, and later, it will be clear that this is only possible if mon.active is full but an alert timer has not been started yet. If the timer is started, the goroutine controlling the alert timer will not be listening from this channel, so the default case will be selected. If that happens, the call returns ErrBusy. If the send to mon.full is successful, then this call will be the one that starts the timer, and it will return ErrBusy.

The second part is the actual call to mon.SlowFunc. This is done in a separate goroutine. Since there is no way to cancel mon.SlowFunc, this goroutine will only return when mon.SlowFunc returns. If mon.SlowFunc returns, several things happen: first, a receive from mon.active removes one entry from it, so the monitor can accept another call. Second, a non-blocking send to the mon.heartBeat channel will stop the alert timer. This is a non-blocking send because, as will be shown later, if the alert timer is active, the send to mon.heartBeat will succeed, and if the alert timer is not active, the goroutine is not listening to the mon.heartBeat channel.

In the final section, we wait for the results of mon.SlowFunc. If the complete channel is closed, then we have the results ready and we can return them. If timeout strikes first, then we return ErrTimeout. If mon.SlowFunc returns after this (which will most likely happen), the result is thrown away.

The interesting piece is the monitor goroutine itself. It is embedded in a NewMonitor function:

func NewMonitor[Req, Rsp any](callTimeout time.Duration,
    alertTimeout time.Duration,
    maxActive int,
    call func(Req) (Rsp, error)) *Monitor[Req, Rsp] {
    mon := &Monitor[Req, Rsp]{
        CallTimeout:  callTimeout,
        AlertTimeout: alertTimeout,
        SlowFunc:     call,
        Alert:        make(chan struct{}, 1),
        active:       make(chan struct{}, maxActive),
        Done:         make(chan struct{}),
        full:         make(chan struct{}),
        heartBeat:    make(chan struct{}),
    }
    go func() {
        var timer *time.Timer
        for {
            if timer == nil {
                select {
                case <-mon.full:
                    timer = time.NewTimer(mon.AlertTimeout)
                case <-mon.Done:
                    return
                }
            } else {
                 select {
                 case <-timer.C:
                     mon.Alert <- struct{}{}
                 case <-mon.heartBeat:
                     if !timer.Stop() {
                         <-timer.C
                     }
                 case <-mon.Done:
                     return
                 }
                 timer = nil
             }
         }
    }()
    return mon
}

The goroutine has two states: one when timer==nil, and one when it is not. When timer==nil, it means that fewer than maxActive concurrent calls are in progress, so there is no need for an alert timer. In this state, we listen to the mon.full channel. As we saw previously, if one of the calls sends to the mon.full channel, the monitor creates a new timer and we enter the second state. In the second state, we listen to the timer channel and the heartBeat channel (and not the mon.full channel, so the non-blocking send is necessary in mon.Call). If mon.heartBeat comes before the timer, we stop the timer, and set it to nil, putting the goroutine in the first state again. If the timer strikes first, we raise an alert.

To use the monitor, you have to initialize it once and call SlowFunc through the monitor:

// Initialize monitor with 50 millisecond call timeout, 5
//second alert timeout with at most 10 concurrent calls.
//The target function is SlowFunc
var myMonitor = NewMonitor[*Request,*Response](50*time.Millisecond,5*time.Second,10,SlowFunc)

We have to set up a goroutine to handle alerts:

go func() {
    for {
        select {
        case <-myMonitor.Alert:
            // Handle alert
        case <-myMonitor.Done:
            return
        }
    }
}()

Then, call the target function through the monitor:

 response, err := myMonitor.Call(request)

In this scenario, there really isn’t much you can do to recover from the error. When an alert is triggered, you can send an email or a chat message to notify someone, or simply print out a log and terminate, so a fresh new process can be restarted.

Sometimes it makes sense to restart a failing goroutine. When a goroutine becomes unresponsive for a long time, a simple monitor can cancel the goroutine and create a new instance of it:

func restart(done chan struct{}, f func(done, heartBeat chan struct{}), timeout time.Duration) {
    for {
       funcDone := make(chan struct{})
       heartBeat := make(chan struct{})
       // Start the func in a new goroutine
       go func() {
           f(funcDone, heartBeat)
        }()
        // Expect a heartBeat before the timeout
        timer := time.NewTimer(timeout)
        retry := false
        for !retry {
            select {
            case <-done:
                close(funcDone)
                return
            case <-heartBeat:
                // heartBeat comes before timeout, reset
                //timer
                if !timer.Stop() {
                    <-timer.C
                }
                timer.Reset(timeout)
            case <-timer.C:
                 fmt.Println("Timeout, restarting func")
                 // Attempt to stop the current function
                 close(funcDone)
                 // Break out of the for-loop so a new
                 //goroutine can start
                 retry = true
            }
        }
    }
}
…
// Run longRunningFunc with 100 millisecond timeout
restart(doneCh, longRunningFunc, 100*time.Millisecond)

This will restart longRunningFunc if it fails to deliver a heartbeat signal every 100 milliseconds. This can be because longRunningFunc failed, or because it is waiting for another long-running process that became unresponsive.

Debugging anomalies

Concurrent algorithms have a way of working when observed and failing when not. Many times, a program that runs just fine in a debugger fails mysteriously in production environments. Sometimes, such failures come with a stack trace, and you can track it back to why it happened. But sometimes, failures are much more subtle with no clear indication of what went wrong.

Consider the monitor in the previous section. You might want to find out why SlowFunc hangs. You cannot really run it in a debugger and step through the code because you simply have no control over which invocation of the function hangs. But what you can do is print a stack trace when it happens. This is the nature of most anomalies in concurrent programs: you don’t know when it is going to happen, but you can usually tell that it did. So, you can print all sorts of diagnostic information to backtrack how the program got there. For instance, you can print the stack trace when the monitor raises an alert:

import (
"runtime/pprof"
     …
)
...
go func() {
    select {
    case <-mon.Alert:
        pprof.Lookup("goroutine").WriteTo(os.Stderr, 1)
    case <-mon.Done:
        return
    }
}()

This will give the stack trace for all goroutines at the time the alert is raised, so you can see which goroutines called SlowFunc and what it is waiting for.

Detecting a deadlock is easy if the deadlock includes all active goroutines. When the runtime realizes that no goroutine can proceed, it prints a stack trace and terminates. It is not so trivial if at least one goroutine is still alive. A common scenario is a server in which handling a request results in a deadlock that includes only a few goroutines. Since the deadlock does not block all the goroutines, the runtime will never detect this situation, and all the goroutines in the deadlock will leak. If you suspect such a leak, it may make sense to add some instrumentation to diagnose the problem:

func timeoutStackTrace(timeout time.Duration) (done func()) {
     completed := make(chan struct{})
     done = func() {
     close(completed)
}
go func() {
    select {
    case <-time.After(timeout):
        pprof.Lookup("goroutine").WriteTo(os.Stderr, 1)
    case <-completed:
        return
    }
}()
return
}

The timeoutStackTrace function waits until the done function is called, or timeout happens. If a timeout happens, it prints out the stack trace for all active goroutines, so you can try to find the reason for the timeout. It can be used as follows:

func (svc MyService) handler(w http.ResponseWriter, r *http.Request) {
     // Print tack trace if the call does not complete
     // in 20 seconds
     defer timeoutStackTrace(20*time.Second)()
        ...
}

As you can see, if you suspect a problem like a deadlock or unresponsive goroutines, printing out the stack trace after detecting such an occasion could be an effective way for troubleshooting.

Dealing with a race condition is usually harder. The best course of action is usually to develop a unit test replicating the situation in which you suspect a race and run it with the Go race detector (use the –race flag). The race detector adds the necessary instrumentation to the program to validate memory operations and reports a memory race when it detects one. Since it relies on code instrumentation, the race detector can detect races only when they happen. This means that if the race detector reports a race, then there is one, but if it does not report a race, it doesn’t mean there isn’t one. So, make sure you run your tests for race detection for a while to increase the chance of a race. Many race conditions will manifest themselves as corrupt data structures, like the list example I showed earlier in this chapter. This will require you to do a lot of code reading (including reading the code for the standard library or third-party library code) to identify the source of the problem as a race condition. However, once you realize that you are dealing with a race condition and not another bug, you can insert fmt.Printf or panic statements at critical points in the code when you detect that something that should not happen, happened.

Summary

This chapter showed some techniques that can be useful to troubleshoot concurrent programs. The key idea when dealing with such programs is that you can usually tell when something bad happened only after it happens. So, adding additional code that generates alerts with diagnostic information can be a lifesaver. Many times, this is simply additional logging or Printf statements and panics (a.k.a Poor Man’s Debugger). Add such code to your programs and keep that code alive in production. Immediate failure is almost always better than incorrect computation.

Further reading

The Go development environment comes with numerous tools for diagnostics:

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.222.115.120