Chapter 10. SysDB

One of the most impressive EOS features that makes it stand out from the competition is SysDB. Simply put, SysDB is a System Database on the switch that holds all the state, variables, and any other important information so that processes may access it. Doesn’t sound too earth-shattering, does it? Read on.

Traditionally, switches (and every other networking device out there) were built using monolithic code. Naturally, when I read the word monolithic, I think of apes dancing around the monolith in Stanley Kubrik’s masterpiece, 2001. That’s actually not a bad analogy, aside from the whole “spark of humanity” thing.

Networking devices have been around for decades now, and many of them are very mature products, running very mature code. Executives like to use the word mature to describe something that’s been around long enough to have all of the bugs worked out. Developers don’t always agree with the usage of this word.

The problem is that some of this code has been around for decades too. In keeping with our monolithic analogy, imagine a switch that was first brought to market in, say, the year 2001. Now imagine that this switch is still in production 12 years later, and the software is up to around version 13, only instead of calling it version 13, let’s call it something else, say, version 215. You know, because the number 13 is bad luck in many cultures, especially those that worship monolithic code.

Anyway, imagine that the initial software written for the switch consumed about 10 MB of disk space. Now imagine that every year, for the next 12 years, more and more features were added to that code. Imagine that the switch became so full of features that after those 12 years, the code grew to consume 100 MB of disk space.

Technology advances at a frightening pace. Features expected as standard offerings today might not have been conceivable 12 years before. New hardware becomes more complex, which requires more complex code, all of which is added on to the original code base. If the original coders didn’t anticipate 12 years of growth, then they may not have made the original code easily expandable. Maybe, in the past 12 years, memory architectures changed. If you’re as old as I am, you might remember a time when 640 K was more memory than a computer would ever need. Today, my Mac Pro currently has 24 GB of RAM. Things change. I can’t imagine running a DOS machine today. As of July 2012, Windows XP was all the rage 12 years ago. Would you want a switch based on Windows XP?

That’s not all, though. Even if the code was written well, and all those years of added routines were also written well, the fact remains that it’s a single giant chunk of code that gets shoveled into memory. If you load an image that supports Spanning Tree Protocol (STP), but you don’t need STP, the STP process is still in memory, sitting there, consuming space. Not only would that process consume space, but what would happen if that one piece of code crashed? With that code only a routine in the giant pile of monolithic code, the entire switch would crash. Bummer.

So there must be a better way, right? Sure there is! Linux changed everything in the server world, and it’s doing so in the networking world as well. With Linux, the switch runs a kernel and a process manager. The process manager manages each process (hence the clever name) and restarts them if they crash. With this model, should STP crash, it wouldn’t take the rest of the switch down with it. Or would it?

To properly protect a Linux system from misbehaving processes, each process should be written to use its own user virtual address space. Without getting into a server architecture discussion, understand that this type of memory is protected space that is only addressable by the process that owns it. If written using this model, then should STP crash, only STP is affected; the rest of the switch continues to run unaffected.

There are two problems with this paradigm. First, many vendors don’t properly utilize user virtual address space, so one misbehaving process often affects other processes. Second, if STP crashes, your network will reconverge when STP comes back online. This is because the STP lost all of its known network topology, timer values, and such when it crashed, so it must start as if the switch was just booted. Arista has solved both problems.

Arista code is well written, and follows strict rules regarding user virtual address space. If the STP process crashes, no other process will be affected. More impressively, though, all state information for all processes is stored in the central database called SysDB. If that doesn’t seem impressive, consider this: on an Arista switch, STP runs and stores all its state information in SysDB. When any process starts, the first thing it does is go to SysDB to retrieve state information. If it finds a state, it uses it. If not, it initializes. Let’s take that to the next logical conclusion.

When a switch is booted, STP loads, checks SysDB (just booted, no state found), determines its state (through convergence, etc.), and then writes that state to SysDB. For some reason, STP crashes. Because STP is isolated, no other processes are affected. When STP crashes, the process manager restarts STP. STP loads, after which it immediately checks SysDB, where it finds state information. STP loads that information, and no convergence is necessary.

Don’t believe me? Let’s have a look. First, I’ll add a timestamp to my command prompt. By issuing the prompt %H[%D{%T}]%p command, the prompt will change to hostname[HH:MM:SS]command-mode:

SW1(config)#prompt %H[%D{%T}]%p
SW1[21:54:22](config)#

Sure, that’s a busy prompt, but it will help to illustrate how quickly the next few events take place.

Warning

What I’m about to do is something that you shouldn’t do in a production environment. Though the impact will be negligible (that’s the point of this example), the complexities of a production environment should never be taken for granted.

I’ve built a simple network between two switches, as shown in Figure 10-1. There are two switches, SW1 and SW2. SW1 is an Arista switch, and SW2 is a terrific example of a monolithic code switch from another vendor. The network has a loop built into it, and SW1 is the root of Spanning-Tree MST instance 1. The interface g1/0/16 is blocking on the non-Arista switch.

Simple network built for abusing EOS processes

Figure 10-1. Simple network built for abusing EOS processes

With this network humming along nicely, and STP performing as it should, we’ll go in and attempt to destroy it by killing STP on the Arista switch.

There are a number of ways to kill a process in an Arista switch. First, the hard way, which involves dropping to bash, finding the process number and killing it with the −9 (sigterm) flag:

SW1[22:27:49]#bash

Arista Networks EOS shell

The process names all begin with capital letters, so remember that when searching for the process:

[admin@SW1 ~]$ ps -ef | grep Stp
root      1512  1505  0 10:13 ?        00:00:05 StpTopology     -d -i
root     12713 12712  0 22:26 ?        00:00:00 Stp             -d -i
admin    12760 12739  0 22:27 pts/2    00:00:00 grep --color=auto Stp

Now that we know where the process is, we can kill it. Remember, only root can kill processes, so you’ll need to use sudo:

[admin@SW1 ~]$ sudo kill −9 12713

The process manager sees the untimely death of the STP process so quickly that I couldn’t capture output with it missing:

[admin@SW1 ~]$ ps -ef | grep Stp
root      1512  1505  0 10:13 ?        00:00:05 StpTopology     -d -i
root     12777 12776  0 22:28 ?        00:00:00 Stp             -d -i
admin    12824 12739  0 22:30 pts/2    00:00:00 grep --color=auto Stp

Of course, we could also just use the much friendlier killall command:

[admin@SW1 ~]$ sudo killall −9 Stp

That actually lets us kill STP with one command that can be called from within the CLI, so that helps to make my example flow better. Back in CLI, I can use the bash command along with the desired command. Watch the timestamps as I show the status of MST1, kill STP, then show the status of MST1 again:

SW1[22:35:33]#sho spanning-tree mst 1
##### MST1    vlans mapped:    1-1024
Bridge        address 001c.7308.80ae  priority     4097 (4096 sysid 1)
Root          this switch for MST1

Interface        Role       State      Cost      Prio.Nbr Type
---------------- ---------- ---------- --------- -------- -------------
Et5              designated forwarding 20000     128.5    P2p
Et6              designated forwarding 20000     128.6    P2p

SW1[22:35:34]#bash sudo killall −9 Stp
SW1[22:35:35]#sho spanning-tree mst 1
##### MST1    vlans mapped:    1-1024
Bridge        address 001c.7308.80ae  priority     4097 (4096 sysid 1)
Root          this switch for MST1

Interface        Role       State      Cost      Prio.Nbr Type
---------------- ---------- ---------- --------- -------- -------------
Et5              designated forwarding 20000     128.5    P2p
Et6              designated forwarding 20000     128.6    P2p

Within a second of killing the STP process, the process manager restarted it, and thanks to SysDB, STP never missed a beat. When the STP process started, it immediately requested the current state of the network from SysDB. Since there was a state found there, it simply used that instead of needing to start from an unknown network topology. Since this all happened well within the window of STP timers, the network didn’t reconverge, and the other switch didn’t even realize there was a problem.

I should warn you; be careful when killing processes. If you’re a bit overzealous, you’ll likely annoy the process manager, which may scold you with the following message:

SW1[22:42:19]#Jul  1 22:35:36 SW1 ProcMgr-worker:
%PROCMGR-3-PROCESS_DELAYRESTART: 'Stp' (PID=12994) restarted too often!
Delaying restart for 120.0 (HH:MM:SS)
(until 2012-07-01 22:44:19.942820)

That’s one more reason not to play with this stuff in a production environment! Here’s what the other switch in my test network reported during my two minutes in the penalty box:

SW2#
Jul  1 18:31:43: %SW_MATM-4-MACFLAP_NOTIF: Host 001c.7308.80ae in
vlan 100 is flapping between port Gi1/0/15 and port Gi1/0/16
SW2#
Jul  1 18:31:58: %SW_MATM-4-MACFLAP_NOTIF: Host 001c.7308.80ae in
vlan 100 is flapping between port Gi1/0/15 and port Gi1/0/16

Warning

That’s worth another warning. Don’t kill processes in a production switch unless you really know what you’re doing, or Arista Support tells you to, or you’re looking to get fired anyway.

As a quick aside, if you start playing around with process killing, you’ll likely discover the agent command. With this command, you can kill any process without resorting to bash commands. To kill the STP process, the following command would have the same effect as the bash sudo killall −9 Stp command we used earlier:

VM-SW3#agent stp terminate
Stp was terminated

Be warned though, if you issue the agent process-name shutdown command in configuration mode, the process will not come back! I can’t tell you how many times I’ve used the agent Stp shutdown command instead of the agent Stp terminate command, and then spent a half hour trying to figure out why STP wasn’t working. Luckily, agent process-name shutdown commands appear in the running-config, so when I finally get to the show running-config stage of my panicked troubleshooting, I’m gently reminded that I’m an idiot.

We’ve seen what happens when STP get’s killed, which illustrates the power and benefit of SysDB, but what happens if SysDB itself dies a horrible, unnatural death? Let’s find out!

SW1#bash sudo killall −9 Sysdb
SW1#Jun 30 17:36:47 SW1 Fru: %FWK-3-SOCKET_CLOSE_REMOTE: Connection
to Sysdb (pid:28194) at tbl://sysdb/ closed by peer (EOF)
Jun 30 17:36:47 SW1 Fru: %FWK-3-SELOR_PEER_CLOSED: Peer closed
socket connection. (tbl://sysdb/-in)(Sysdb (pid:28194))

Jun 30 17:36:47 SW1 Fru: %FWK-3-SELOR_EXIT: Process exiting.
Jun 30 17:36:47 SW1 Launcher: %FWK-3-SOCKET_CLOSE_REMOTE: Connection
to Sysdb (pid:28194) at tbl://sysdb/ closed by peer (EOF)

[--- Lots of processes screaming about SysDB dying removed ---]

Jun 30 17:36:47 SW1 PhyAeluros: %FWK-3-SELOR_PEER_CLOSED: Peer
closed socket connection. (tbl://sysdb/-in)(Sysdb (pid:28194))
Jun 30 17:36:47 SW1 PhyAeluros: %FWK-3-SELOR_EXIT: Process exiting.
Connection to Sysdb (pid:28194) at tbl://sysdb/ closed by peer (EOF)
Exiting because NboAttrLog connection has closed.

SW1 login:

In a nutshell, after killing the process at the heart of the entire switch, it recovered gracefully. To be fair, gracefully means that all the processes needed to reinitialize, and the end result was the same as rebooting the switch, but consider this: on my Arista 7124SX, it takes 1 minute and 45 seconds from the point where I enter the reload now command to the point where I get a login prompt (it takes longer if I pull the power since the hardware needs to initialize). If I kill SysDB, I get a login prompt after 15 seconds.

Let’s have some fun and use that to our advantage. Since the running-config resides in SysDB, and SysDB is reinitialized when it’s killed, I’m going to load a completely new configuration on the switch without rebooting. Ever been frustrated by needing to reboot in order to load a completely new configuration? This isn’t a total cure, but it sure beats rebooting.

First, I’ll save the current startup-config to a file named OLD-Config:

SW1#copy startup-config OLD-Config

Next, I’ll remove the current startup-config with the write erase command:

SW1#write erase

I’ve written a simple configuration file that has nothing except the defaults and a changed hostname, just to show that the new configuration actually gets loaded. The new hostname will be GAD. The file name for this configuration is GAD-Config:

SW1#copy GAD-Config startup-config

At this point, my switch is running according to the configuration loaded in the running-config, but when it boots, it will load the code in startup-config, which has my new hostname installed. Remember, if I were to reboot the switch, it would take almost two minutes. Instead, I’ll kill SysDB and watch what happens:

SW1#bash sudo killall −9 Sysdb
SW1#Jun 30 18:01:25 SW1 Fru: %FWK-3-SOCKET_CLOSE_REMOTE: Connection
to Sysdb (pid:2359) at tbl://sysdb/ closed by peer (EOF)
Jun 30 18:01:25 SW1 Launcher: %FWK-3-SOCKET_CLOSE_REMOTE: Connection
 to Sysdb (pid:2359) at tbl://sysdb/ closed by peer (EOF)
Jun 30 18:01:25 SW1 Fru: %FWK-3-SELOR_PEER_CLOSED: Peer closed socket
connection. (tbl://sysdb/-in)(Sysdb (pid:2359))

[--- Processes whining about SysDB removed ---]

Jun 30 18:01:25 SW1 PhyAeluros: %FWK-3-SELOR_PEER_CLOSED: Peer closed
socket connection. (tbl://sysdb/-in)(Sysdb (pid:2359))
Jun 30 18:01:25 SW1 PhyAeluros: %FWK-3-SELOR_EXIT: Process exiting.

SW1 login:

Note

Another way to accomplish this, which is probably a lot cleaner, is to use the service ProcMgr restart command, which will have the same result, but allow the system to perform a graceful restart. My goal in this chapter was to show how EOS handles a catastrophic failure, and graceful restarts are boring, so I decided to kill SysDB with reckless abandon instead.

Notice that the hostname hasn’t changed? This is due to the fact that the hostname shown at this stage is actually the Linux hostname. Once I log in, the CLI process will report the newly configured EOS hostname:

SW1 login: admin
Last login: Sat Jun 30 21:47:12 on ttyS0
GAD>

And just like that, in 15 seconds flat, I’ve reloaded my Arista switch with a new configuration. Try that with a monolithic code–based switch from another vendor! To be completely fair, the time until all interfaces are up, all protocols settle, and things like MLAG stabilize will depend on the switch platform and the configuration.

Note

If you decide to play around with killing SysDB, be warned that as stated, it has the same effect as rebooting the switch. That means that unsaved changes to the running-config will be lost. Since killing SysDB isn’t really graceful, there’s no warning about unsaved changes. But if you’re doing things like killing SysDB, then you must know what you’re doing, right?

SysDB is the heart of Arista’s EOS, and as I’ve hopefully shown in this chapter, the resiliency of SysDB along with the power of Linux makes for an amazingly robust switch operating system that almost can’t be killed. Now if you’ll excuse me, I have to go try and piece together the running-config I lost from that last example.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.227.111.33