8

Monitoring System Resources

In the last chapter, we learned how we can manage tasks that are running on our server. We now know how to see what's running in the background, how to enable or disable a unit from starting at boot time, and also how to schedule tasks to run in the future. But in order for us to be able to effectively manage the tasks that our servers carry out, we also need to keep an eye on system resources. If we run out of RAM, fill up our disk, or overload our CPU, then a server that normally processes tasks very efficiently might come to a screeching halt. In this chapter, we'll take a look at these resources and how to monitor them.

Our discussion on resource management will include:

  • Viewing disk usage
  • Monitoring memory usage
  • Understanding load average
  • Viewing resource usage with htop

One resource that is extremely important on our servers is storage, and keeping track of such things as available disk space is critical as even the most powerful server you can purchase would be unable to function without free disk space. We'll take a look at some ways to monitor disk usage in the next section.

Viewing disk usage

Keeping an eye on your storage is always important, as no one enjoys getting a call in the middle of the night saying that a server encountered an issue, especially not something that could've been easily avoided, such as a filesystem growing too close to being full. Managing storage on Linux systems is simple once you master the related tools, the most useful of which I'll go over in this section. In particular, we'll look at tools we can use to answer the question "what's using up all the disk space?", which is the most common question that comes up when dealing with disk usage.

First, let's look at the df command.

Using df

The df command is likely always going to be your starting point in situations where you don't already know which volume or mount point is becoming full. When executed, it gives you a high-level overview, so it's not necessarily useful when you want to figure out who or what in particular is hogging all your space. However, when you just want to list all your mounted volumes and see how much space is left on each, df fits the bill. By default, it shows you the information in bytes. However, I find it easier to use the -h option (which produces a human-readable output) with df so that you'll see information that's a bit easier to read. Go ahead and give it a try:

df -h

This should produce an output that looks something like the following:

Figure 8.1: Output from the df -h command

The output will look different depending on the types of disks and mount points on your system. In the screenshot, you'll see that the root filesystem is located on /dev/mapper/ubuntu--vg-ubuntu--lv. We know this because under the column Mounted on we see that the mount point is set to a single forward slash (/). As we discussed in Chapter 4, Navigating and Essential Commands, this single forward slash refers to the beginning of the filesystem (also referred to as the root filesystem). In my case, this is an LVM volume, which is why we have a device with such a long name, beginning with /dev/mapper. Let's not worry about LVM for now, we'll discuss that later. But for now, just keep in mind that the single forward slash refers to the beginning of the filesystem, and the device name on the left refers to the actual device that's mounted there.

The actual device name varies from one server to another, and also varies depending on whether you chose to utilize LVM during installation. Instead of a long path beginning with /dev/mapper, you may instead see the device name as /dev/sda1, /dev/xvda1, /dev/nvme0n1p1, or other variations. The name of the device is generated by the type of hardware the underlying storage device is, such as the /dev/nvme... naming convention used for NVME hard drives, /dev/sdaN for standard SATA hard drives, and so on.

The actual type of device the underlying hardware is doesn't matter so much; it only really matters that you can identify which device is at the most danger of becoming full. In the example screenshot, the root filesystem is using 36% of its available space. In this case, we aren't in danger of running out of space. There are some loopback devices that are up to 100% usage (identified by devices with a naming scheme of /dev/loopN) but those aren't actually of concern, as system processes may create loopback devices for various purposes as needed.

While investigating disk utilization, it's also important to check inode utilization as well. Checking inode utilization will be especially helpful in situations where it's being reported that your disk is full, yet the df -h command shows plenty of free space is available. It can definitely be very confusing the first time you run into this situation. In such a scenario, it may be that you've run out of inodes, and your disk isn't actually full from a free space perspective.

But, what exactly is an inode, and why would such a thing cause a disk to be reported as full when it's actually not? Think of the concept of an inode as a type of database object, containing metadata for the actual items you're storing. Information stored in inodes are details such as the owner of the file, permissions, last modified date, and type (whether it is a directory or a file). While metadata is certainly a good thing to have, the problem with inodes is that you can only have a limited number of them on any storage device.

This number is usually extremely high and the limit is very hard to reach, though. In the case of servers that are used for workloads that can result in your organization storing hundreds of thousands of files, the inode limit can become a real problem. I'll show you some output from one of my servers to help illustrate this:

df -i 

The output of this command is as follows:

Figure 8.2: Output from the df -i command

As you can see, the -i option of df gives us information regarding inodes instead of the actual space used. In this example, the root filesystem on the example server has a total of 6291456 inodes available, of which 74361 are used and 6217095 are free. If you have a system that's reporting a full disk (though you see plenty of space is free when running df -h), it may actually be an issue with your volume running out of inodes. In this case, the problem would not be the size of the files on your disk, but rather the sheer number of files you're storing. In my experience, I've seen this happen because of mail servers becoming bound (millions of stuck emails, with each email being a file), as well as unmaintained log directories. It may seem as though having to contend with an inode limitation is unbecoming of a legendary platform such as Linux, however, as I mentioned earlier, this limit is very hard to reach unless something is very, very wrong. So in summary, in a situation where you're getting an error message that no space is available and you have plenty of space, check the inodes.

Diving deeper into disk usage

The next step in investigating what's gobbling up your disk space is finding out which files in particular are using it all up. At this stage, there is a multitude of tools you can use to investigate. The first I'll mention is the du command, which is able to show you how much space a directory is using. Using du against directories and sub-directories will help you narrow down the problem. Like df, we can also use the -h option with du to make our output easier to read. By default, du will scan the current working directory your shell is attached to and give you a list of each item within the directory, the total space each item consists of, as well as a summary at the end.

The du command is only able to scan directories that its calling user has permission to scan. If you run this as a non-root user, then you may not be getting the full picture. Also, the more files and sub-directories that are within your current working directory, the longer this command will take to execute. If you have an idea where the resource hog might be, try to cd into a directory further in the tree to narrow your search down and reduce the amount of time the command will take.

The output of du -h can often be more verbose than you actually need in order to pinpoint your culprit and can fill several screens. To simplify it, my favorite variation of this command is the following:

du -hsc * 

Basically, you would run du -hsc * within a directory that's as close as possible to where you think the problem is. The -h option, as we know, gives us human-readable output (essentially, giving us output in the form of megabytes, gigabytes, and so on). The -s option gives us a summary and -c provides us with the total amount of space used within our current working directory. The following screenshot shows this output from my laptop:

Figure 8.3: Example output from du -hsc *

To make that example more interesting, I took the screenshot from my personal desktop, but the resulting command and its syntax are the same. As you can see, the information provided by du -hsc * is a nice, concise summary. From the output, we can clearly see how much space each of the directories within our working directory takes currently. For example, I have 38 GB used in my projects directory right now. Not only am I storing the files for this book in that folder, I'm also storing raw video recordings there for my YouTube channel, so this directory can become quite large at times.

At this point, we know which directories at the top level of our current working directories are using the most space. But we still need to narrow this down to what in particular within those directories is responsible for using that space. To dive deeper, we could cd into any of those large directories and run the du command again. After a few runs, we should be able to narrow down the largest files within these directories and make a decision on what we want to do with them. Perhaps we can clean unnecessary files or add another disk. Once we know what is using up our space, we can decide what we're going to do about it.

At this point in reading this book, you're probably under the impression that I have some sort of strange fixation on saving the best for last. You'd be right. I'd like to finish off this section by introducing you to one of my favorite applications, the NCurses Disk Usage utility (or more simply, the ncdu). The ncdu command is one of those things that administrators who constantly find themselves dealing with disk space issues learn to love and appreciate. In one go, this command gives you not only a summary of what is eating up all your space, it also gives you an ability to traverse the results without having to run a command over and over while manually traversing your directory tree. You simply execute it once and then you can navigate the results and drill down as far as you need.

To use ncdu, you will need to install it as it doesn't come with Ubuntu by default:

sudo apt install ncdu

Once installed, simply execute ncdu in your shell from any starting directory of your choosing. When done, simply press q on your keyboard to quit. Like du, ncdu is only able to scan directories that the calling user has access to. You may need to run it as root to get an accurate portrayal of your disk usage.

You may want to consider using the -x option with ncdu. This option will limit it to the current filesystem, meaning it won't scan network mounts or additional storage devices; it'll just focus on the device you started the scan on. This can save you from scanning areas that aren't related to your issue.

When executed, ncdu will scan every directory from its starting point onward. When finished, it will give you a menu-driven layout allowing you to browse through your results:

Figure 8.4: ncdu in action

Again, I took this screenshot from my desktop, from within my home directory. What ncdu does is show you your disk usage from your current directory down, and it will order the results by placing the items with the highest usage toward the top. To move around inside of ncdu, you do so by moving your selection (indicated with a long white highlight) with the up and down arrows on your keyboard. If you press Enter on a directory, ncdu switches to showing you the summary of that directory, and you can continue to drill down as far as you need. In fact, you can actually delete items and entire folders by pressing d. Therefore, ncdu not only allows you to find what is using up your space, it also allows you to take action as well!

Sometimes, it's obvious what's taking up space on a disk, and ncdu may not always be necessary. Generally speaking, you'll start out your investigation with df -h, to see which storage volume is the one that's running out of space. Then, you'll go into that directory and run another command, such as du -hsc *, to see which directory is using up the most space. If you don't immediately know from the output of du what the underlying issue is, then consider using a tool such as ncdu to dive down even deeper.

Although monitoring storage is critical, we also need to keep an eye on free memory. Next up, we'll take a look at how to monitor the memory of our server.

Monitoring memory usage

I forget things all the time. I regularly forget where my car keys are, even though they're almost always right there in my pocket the entire time. I even forget to use sudo for commands that normally require it, over 18 years since starting to work with Linux. Thankfully, computers have a better memory than I do, but if we don't manage it effectively, the memory on our servers will be just as useless as I am when I forget to put freshly washed laundry in the dryer.

Understanding how Linux manages memory can actually be a somewhat complex topic, as understanding how much memory is truly free can be a hurdle for newcomers to overcome. You'll soon see that how Linux manages memory on your server is actually fairly straightforward once explained.

Understanding server memory

For the purpose of monitoring memory usage on our server, we have the free command at our disposal, which we can use to see how much memory is being consumed at any given time. Giving the free command with no options will result in the output being shown in terms of kilobytes:

Figure 8.5: Output of the free command

My favorite variation of this command is free -m, which shows the amount of memory in use in terms of megabytes. You can also use free -g to show the output in terms of gigabytes, but the output won't be precise enough on most servers. In my opinion, adding the -m option makes the free command much more readable:

Figure 8.6: Output of the free -m command

Since everything is broken down into megabytes, it's much easier to read, at least for me.

At first glance, it may appear as though this server has only 179 MB free. You'll see this in the first row and third column under free. In actuality, the number you'll really want to pay attention to is the number under available, which is 1613 MB in this case. That's now much memory is actually free. Since this server has 1987 MB of total RAM available (you'll see this on the first row, under total), this means that most of the RAM is free, and this server is not really working that hard at all.

Some additional explanation is necessary to truly understand these numbers. You could very well stop reading this section right now as long as you take away from it that the available column represents how much memory is free for your applications to use. However, it's not quite that simple. Technically, when you look at the output, the server really does have 179 MB free. The amount of memory listed under available is legitimately being used by the system in the form of a cache but would be freed up in the event that any application needed to use it. If an application starts and needs a decent chunk of memory in order to run, the kernel will provide it with some memory from this cache.

Linux, like most modern systems, subscribes to the belief that "unused RAM is wasted RAM." RAM that isn't being used by any process is given to what is known as a disk cache, which is utilized to make your server run more efficiently. When data needs to be written to a storage device, it's not directly written right away. Instead, this data is written to the disk cache (a portion of RAM that's set aside) and then synchronized to the storage device later in the background. The reason this makes your server more efficient is that this data being stored in RAM would be written to and retrieved faster than it would be from disk. Applications and services can synchronize data to the disk in the background without forcing you to wait for it. This cache also works for reading data, as when you first open a file, its contents are cached. The system will then retrieve it from RAM if you read the same file again, which is more efficient than loading it from the storage volume each time. If you just recently saved a new file and retrieve it right away, it's likely still in the cache and then retrieved from there, rather than from the disk directly.

To understand all of the columns shown in Figure 8.6, I'll outline the meaning of each in the following table:

Column

Meaning

total

The total amount of RAM installed on the server.

used

The memory that is used (from any source). This is calculated as follows:used = total - free - buffers - cache.

free

The memory not being used by anything, the cache or otherwise.

shared

The memory used by tmpfs as well as other shared resources.

buff/cache

The amount of memory being used by the buffers and cache.

available

The memory that is free for applications to use.

You may have noticed in Figure 8.6 that memory usage is also listed for a resource called Swap. Let's take a look at that as well. We will dedicate the next section entirely to it so that we ensure we understand what it is, and what it does for us.

Managing swap

Swap is one of those things we never want to use, but always want to make sure is available. It's kind of like car insurance, no one is excited to buy it, but we do want to have it in case something bad happens. There's even some debate between administrators on whether or not swap is still relevant today. It's definitely relevant, regardless of what anyone says, as it's a safety net of sorts. (And disk space is cheaper nowadays, so dedicating some of our storage to this task isn't really a big deal, so we may as well).

So what is it? Swap is basically a partition or a file that acts as RAM in situations where your server's memory is saturated. If we manage a server properly, we hope to never need it, as swap is stored on your hard disk, which is orders of magnitude slower than RAM. But if something goes wrong on your server and your memory usage skyrockets, swap may save you from having your server go down. The Out of Memory (OOM) Killer may also activate itself when memory is full, to kill a misbehaving process that's using the majority of your memory, but as much as possible, we don't want to rely on that and instead ensure adequate swap in case memory is exhausted.

The way swap is configured by default in Ubuntu has changed a bit since the time the first edition of this book was published. With Ubuntu 16.04 and earlier, a swap partition was automatically created for you if you chose the default partitioning scheme during installation, and if you didn't create a swap partition when partitioning your server, the installer would yell at you for it. However, a swap partition is no longer created by default in modern versions of Ubuntu, and the installer will create a swap file (rather than a partition) for you automatically. You may still see swap partitions on older installations, but going forward, this is the best way to handle it. At least until something better comes around. If you need more swap space, you can delete the swap file and recreate it. That's definitely an easier thing to do than having to resize your partition tables in order to enlarge swap, which is potentially dangerous if you make a mistake during the process.

Therefore, I'm not going to talk about creating a swap partition in this edition of the book, as there's no reason to do so anymore.

The swap file for your server is declared in the /etc/fstab file (we'll discuss the /etc/fstab file in more detail in Chapter 9, Managing Storage Volumes). In most cases, you would've had a swap file created for you during installation. You could, of course, add a swap partition later if for some reason you don't have one. In the case of some cloud instance providers, you may not get a swap file by default. In that situation, you would create a swap file yourself (we'll discuss the process later in this section) and then use the swapon command to activate it:

sudo swapon -a

When run, the swapon -a command will find your swap partition in/etc/fstab (if it's mentioned there), mount it, and activate it for use. The inverse of this command is swapoff -a, which deactivates your swap file. It's rare that you'd need to disable swap, unless, of course, you were planning on deleting your swap file in order to create a larger one. If you find out that your server has an inadequate swap partition size, that may be a course of action you would take.

While having swap is generally a good idea, there are actually some applications that prefer that the server doesn't have it at all. This is rare, but Kubernetes is a good example of this. If you're running Kubernetes, the installation of it will complain if you do have a swap partition. This is somewhat of a rare occurrence; not very many applications run better if the server has no swap file. In the case of a Kubernetes cluster, the individual servers within such a cluster would be a special case anyway, each dedicated to the task of running containers (which is what Kubernetes does; more on that in Chapter 18, Container Orchestration).

When you check your free memory (hint: execute free -m), you'll see swap listed whether you have it or not, but when swap is deactivated, you will see all zeros for the size totals.

So, how do you actually create a swap file? To do so, you'll first create the actual file to be used as swap. This can be stored anywhere, but /swapfile is typically ideal. You can use the fallocate command to create the actual file. The fallocate command will force a file to be a particular size:

sudo fallocate -l 4G /swapfile

Here, I'm creating a 4 GB swap file, but feel free to make yours whatever size you want in order to fit your needs. Next, we need to prepare this file to be used as swap. First, we'll need to fix the permissions as we need this file to be a bit more restrictive than most:

sudo chmod 0600 /swapfile

Then, we can use the mkswap command to convert this file into an actual swap file:

sudo mkswap /swapfile

Now, we have a handy-dandy swap file stored in our root filesystem. Next, we'll need to mount it. As always, it's recommended that we add this to our /etc/fstab file. What follows is an example entry:

/swapfile   none   swap   sw   0 0

From this point, we can activate our new swap file with the swapon command that I mentioned earlier:

sudo swapon -a

Now, the swap file is active and in use. While I certainly hope you won't need to resort to using swap, I know from experience that it's only a matter of time. Knowing how to add and activate swap when you need it is definitely a good practice, but for the most part, you should be fine because, by default on most platforms, you'll have swap created for you when setting up Ubuntu for the first time during installation. If you do need to create it manually for whatever reason, I always recommend a bare minimum of 2 GB on servers, but if you can manage to create a larger one for this purpose, that's even better.

How much swap is being used is something you should definitely keep an eye on. When the memory starts to get full, the server will start to utilize the swap file that was created during installation. It's normal for a small portion of swap to be utilized even when the majority of the RAM is free. But if a decent chunk of swap is being used, it should be investigated (perhaps a process is using a larger than normal amount of memory).

You can actually control at which point your server will begin to utilize swap. How frequently a Linux server utilizes swap is referred to as its swappiness. By default, the swappiness value on a Linux server is typically set to 60. You can verify this with the following command:

cat /proc/sys/vm/swappiness

The higher the swappiness value, the more likely your server will utilize swap. If the swappiness value is set to 100, your server will use swap as much as possible. If you set it to 0, swap will never be used at all. This value correlates roughly to the percentage of RAM being used. For example, if you set swappiness to 20, swap will be used when RAM becomes (roughly) 80 percent full. If you set it to 50, swap will start being used when half your RAM is being used, and so on.

To change this value on the fly, you can execute the following command:

sudo sysctl vm.swappiness=30 

This method doesn't set swappiness permanently, however. When you execute that command, swappiness will immediately be set to the new value and your server will act accordingly. Once you reboot, though, the swappiness value will revert back to the default. To make the change permanent, open the following file with your text editor:

/etc/sysctl.conf 

A line in that file corresponding to swappiness will typically not be included by default, but you can add it manually. To do so, add a line such as the following to the end of the file and save it:

vm.swappiness = 30 

Changing this value is one of many techniques within the realm of performance tuning. While the default value of 60 is probably fine for most, there may be a situation where you're running a performance-minded application and can't afford to have it swap any more than it absolutely has to. In such a situation, you would try different values for swappiness and use whichever one works best during your performance tests.

In the next section, we'll take a look at another important metric to keep an eye on: load average. The load average gives us an idea of how busy the CPU(s) might be, so we can better understand how to tell when our server is overwhelmed and we may need to take action.

Understanding load average

Another very important topic to understand when monitoring performance is load average, which is a series of numbers that represents your server's trend in CPU utilization over a given time. You've probably already seen these series of numbers before, as there are several places in which the load average appears. If you run the htop utility, for example, the load average is shown on the screen. In addition, if you execute the uptime command, you can see the load average in the output of that command as well. You can also view your load average by viewing the text file that stores it in the first place:

cat /proc/loadavg

Personally, I habitually use the uptime command to view the load average. This command not only gives me the load average but also tells me how long the server has been running.

The load average is broken down into three sections, each representing 1 minute, 5 minutes, and 15 minutes respectively. A typical load average may look something like the following:

0.36, 0.29, 0.31

In this example, we have a load average of 0.36 in the 1-minute section, 0.29 in the five-minute section, and 0.31 in the fifteen-minute section. In particular, each number represents how many tasks were waiting for attention from the CPU for that given time period. Therefore, these numbers are really good. The server isn't that busy, since virtually no task is waiting for the CPU at any one moment (each number is less than 1). This is contrary to something such as overall CPU percentages, which you may have seen in task managers on other platforms. While viewing your CPU usage percentage can be useful, the problem with this is that your CPUs will constantly go from a high percentage of usage to a low percentage of usage, which you can see for yourself by just running htop for a while. When a task does some sort of processing, you might see your cores shoot up to 100 percent and then right back down to a lower number. That really doesn't tell you much, though. With load averages, you're seeing the trend of usage over three given time frames, which is more accurate in determining whether your server's CPUs are running efficiently or are choking on a workload they just can't handle.

The main question, though, is when you should be worried, which really depends on what kind of CPUs are installed on your server. Your server will have one or more CPUs, each with one or more cores. To Linux, each of these cores, whether they are physical or virtual, is the same thing (a CPU). In my case, the machine I took the earlier output from has a CPU with four cores. The more CPUs your server has, the more tasks it's able to handle at any given time, which also means it can handle a higher load average.

When a load average for a particular time period is equal to the number of CPUs on the system, that means your server is at capacity. It's handling a consistent number of tasks that are equal to the number of tasks it can handle. For example, if you have an 8-core CPU and the load average is 8 for a given time frame, then the CPU is 100% at its available capacity for that time frame. If your load average is consistently more than the number of cores you have available, that's when you'd probably want to look into the situation. It's fine for your server to be at capacity every now and then, but if it always is, that's a cause for alarm.

I'd hate to use a cliché example in order to fully illustrate this concept, but I can't resist, so here goes. A load average on a Linux server is equivalent to the check-out area at a supermarket. A supermarket will have several registers open, where customers can pay to finalize their purchases and move along. In my experience, you would have something like 20 check-out registers but only two cashiers working at any one time, but for this example, we'll assume each register has a cashier operating it.

Each cashier is only able to handle one customer at a time. If there are more customers waiting to check out then there are cashiers, the lines will start to back up and customers will get frustrated. In a situation where there are four cashiers and four customers being helped at a time, the cashiers would be at capacity, which is not really a big deal since no one is waiting. What can add to this problem is a customer that is paying by check and/or using a few dozen coupons, which makes the checkout process much longer (similar to a resource-intensive process). If there were four cashiers and six customers waiting, then there would be two more customers than the store is able to handle at the same time. This is essentially how load average works. Each cashier is a CPU, and each customer is a process that needs CPU time.

Just like the cashiers, each CPU can only handle one task at a time, with some tasks hogging the CPU longer than others. If there are exactly as many tasks as there are CPUs, there's no cause for concern. But if the lines start to back up, we may want to investigate what is taking so long. To take action, we may hire an additional cashier (add a new CPU) or ask a disgruntled customer to leave (kill a process).

Let's take a look at another example load average:

1.87, 1.53, 1.22

In this situation, we shouldn't be concerned, because our hypothetical server has four CPUs, and none of them have been at capacity within the 1-, 5-, or 15-minute time periods. Even though the load is consistently higher than 1, we have CPU resources to spare, so it's no big deal. If we had one of those awesome new Threadripper CPUs from AMD, which can have 32 cores (or maybe more by the time you're reading this), then those numbers would represent extremely low load. Going back to our supermarket comparison, the load average in the previous example would be equivalent to having four cashiers with an average of almost two customers being assisted during any 1 minute. If this server only had one CPU, we would probably want to figure out what's causing the line to begin to back up.

While having a low load average is usually a good thing, it can actually represent a really big problem depending on the context. When we deploy servers, we do so to get some sort of work done.

Whether that "work" is to host an application or run jobs to process data, our servers need to be doing some sort of work, otherwise, we're wasting money by having them. If the load average of your server drops to an abnormally low value, that might mean that a service that would normally be running all the time has failed and exited. For example, if you have a database server that constantly has a load within the 1.x range that suddenly drops to 0.x, that might mean that you either have legitimately less traffic or the database server service is no longer running. This is why it's always a good idea to develop baselines for your server, in order to gauge what is normal and what isn't. A baseline refers to resource usage, most of the time. If the resource usage is drastically higher or even lower than the baseline, that's a potential cause for concern either way.

Overall, load averages are something you'll become very familiar with as a Linux administrator if you haven't already. As a snapshot in time of how heavily utilized your server is, it will help you to understand when your server is running efficiently and when it's having trouble. If a server is having trouble keeping up with the workload you've given it, it may be time to consider increasing the number of cores (if you can) or scaling out the workload to additional servers. When troubleshooting utilization, planning for upgrades, or designing a cluster, the process always starts with understanding your server's load average so you can plan your infrastructure to run efficiently for its designated purpose.

Now that we've gone over the important resources that we need to monitor to ensure our server remains healthy, let's take a look at a useful utility we can utilize that will make resource usage even easier to understand.

Viewing resource usage with htop

When wanting to view the overall performance of your server, nothing beats htop. Although not typically installed by default, htop is one of those utilities that I recommend everyone installs as soon as possible, since it's indispensable when wanting to check on the resource utilization of your server. If you don't already have htop installed, all you need to do is install it with apt:

sudo apt install htop 

When you run htop at your shell prompt, you will see the htop application in all its glory. In some cases, it may be beneficial to run htop as root, since doing so does give you additional options such as being able to kill processes, though this is not required:

Figure 8.7: Running htop

At the top of the htop display, you'll see a progress meter for each of your cores (the server used for my screenshot only has one core), as well as a meter for memory as well as swap. In addition, the upper portion will also show your Uptime, Load average, and the number of Tasks you have running. The lower section of htop's display will show you a list of processes running on your server, with fields showing you useful information such as how much memory or CPU is being consumed by each process, as well as the command being run, the user running it, and its Process ID (PID). We discussed PIDs in Chapter 7, Controlling and Managing Processes. To scroll through the list of processes, you can press Page Up or Page Down or use your arrow keys. In addition, htop features mouse support, so you are also able to click on columns at the top in order to sort the list of processes by that criteria. For example, if you click on MEM% or CPU%, the process list will be sorted by memory or CPU usage respectively. The contents of the display will be updated every 2 seconds.

The htop utility is also customizable. If you prefer a different color scheme, for example, you can press F2 to enter Setup mode, navigate to Colors on the left, and then you can switch your color scheme to one of the six that are provided. Other options include the ability to add additional meters, add or remove columns, and more. One tweak I find especially helpful on multicore servers is the ability to add an average CPU bar. Normally, htop shows you a meter for each core on your server, but if you have more than one, you may be interested in the average as well. To do so, enter Setup mode again (F2), then with Meters highlighted, arrow to the right to highlight CPU average and then press F5 to add it to the left column. There are other meters you can add as well, such as Load average, Battery, and more.

Depending on your environment, function keys may not work correctly in terminal programs such as htop, because those keys may be mapped to something else. For example, F10 to quit htop may not work if F10 is mapped to a function within your terminal emulator, and using a virtual machine solution such as VirtualBox may also prevent some of these keys from working normally.

Here's an example of htop configured with a meter for the CPU average.:

Figure 8.8: htop with a meter for CPU average added

When you open htop, you will see a list of processes for every user on the system. When you have a situation where you don't already know which user/process is causing extreme load, this is ideal. However, a very useful trick (if you want to watch a specific user) is to press u on your keyboard, which will open up the Show processes of: menu. In this menu, you can highlight a specific user by highlighting it with the up or down arrow keys and then pressing Enter to only show processes for that user. This will greatly narrow down the list of processes.

Another useful view is the tree view, which allows you to see a list of processes organized by their parent/child relationship, rather than just a flat list. In practice, it's common for a process to be spawned by another process. In fact, all processes in Linux are spawned from at least one other process, and this view shows that relationship directly. In a situation where you are stopping a process only to have it immediately re-spawn, you would need to know what the parent of that process is in order to stop it from resurrecting itself. Pressing F5 will switch htop to tree view mode, and pressing it again will disable the tree view.

With the tree view activated, htop will appear similar to the following:

Figure 8.9: htop with tree view activated

As I've mentioned, htop updates its stats every 2 seconds by default. Personally, I find this to be ideal, but if you want to change how fast it refreshes, you can call htop with the -d option and then apply a different number of seconds (entered in tenths of seconds) for it to refresh. For example, to run htop but have it update every 7 seconds, start htop with the following command:

htop -d 70 

To kill a process with htop, use your up and down arrow keys to highlight the process you wish to kill and press F9. A new menu will appear, giving you a list of signals you are able to send to the process with htop. SIGTERM, as we discussed before, will attempt to gracefully terminate the process. SIGKILL will terminate it uncleanly. Once you highlight the signal you wish to send, you can send it by pressing Enter or cancel the process with Esc.

As you can see, htop can be incredibly useful and has (for the most part) replaced the legacy top command that was popular in the past. The top command is available by default in Ubuntu Server and is worth a look, if only as a comparison to htop. Like htop, the top command gives you a list of processes running on your server, as well as their resource usage. There are no pretty meters and there is less customization possible, but the top command serves the same purpose. In most cases, though, htop is probably your best bet going forward.

Summary

In this chapter, we learned how to monitor our server's resource usage. We began with a look at the commands we can use to investigate disk usage, and we also learned how to monitor memory usage as well. We also discussed swap, including what it is, why you'd want to have it, as well as how to create a swap file manually should the need to do so come up. We then took a look at load average and closed out the chapter by checking out htop, which is my favorite utility for getting an overall look at resource usage on servers.

In Chapter 9, Managing Storage Volumes, we'll take a closer look at storage. In this chapter, we learned how to see how much is being used, but in the next we'll look at more advanced concepts surrounding storage, such as formatting volumes, adding additional volumes, and even LVM. See you there!

Further reading

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.147.54.6