Chapter 7. SysDB

One of the most impressive features of Arista’s Extensible Operating System (EOS) that makes it stand out from the competition is SysDB. Simply put, SysDB is a System Database on the switch that holds all of the state, variables, and any other important information so that processes can access it. Doesn’t sound too earth-shattering, now does it? Read on.

Traditionally, switches (and every other networking device out there) were built using monolithic code. Naturally, when I read the word monolithic, I think of apes dancing around the monolith in Stanley Kubrick’s masterpiece, 2001: A Space Odyssey. That’s actually not a bad analogy, aside from the whole “spark of humanity” thing.

Networking devices have been around for decades now, and many of them are very mature products, running very mature code. Executives like to use the word mature to describe something that’s been around long enough to have all of the bugs worked out. Developers don’t always agree with the usage of this word.

The problem is that some of this code has been around for decades, too. In keeping with our monolithic analogy, imagine a switch that was first brought to market in, say, the year 2001. Now imagine that this switch is still in production 12 years later, and the software is up to around version 13, only instead of calling it version 13, let’s call it something else, say, version 15. You know, because the number 13 is bad luck in many cultures, especially those that worship monolithic code.

Anyway, imagine that the initial software written for the switch consumed about 10 MB of disk space. Now imagine that every year, for the next 12 years, more and more features were added to that code. Imagine that the switch became so full of features that after those 12 years, the code grew to consume 100 MB of disk space.

Technology advances at a frightening pace and features expected as standard offerings today might not have been conceivable 12 years before. New hardware becomes more complex, which requires more complex code, all of which is added on to the original code base. If the original coders didn’t anticipate 12 years of growth, they might not have made the original code easily expandable. Maybe, in the past 12 years, memory architectures changed. If you’re as old as I am, you might remember a time when 640 K was more memory than a computer would ever need. Today, my Mac Pro currently has 128 GB of RAM. Things change. I can’t imagine running a DOS machine today. As of this writing, Windows XP was all the rage 12 years ago. Would you want a switch based on Windows XP?

That’s not all, though. Even if the code was written well and all those years of added routines were also written well, the fact remains that it’s a single giant chunk of code that gets shoveled into memory. If you load an image that supports Spanning Tree Protocol (STP) but you don’t need STP, the STP process is still in memory, sitting there, consuming space. Not only would that process consume space, but what would happen if that one piece of code crashed? With that code being only a routine in the giant pile of monolithic code, the entire switch would crash. Bummer.

So, there must be a better way, right? Sure there is! Linux changed everything in the server world, and it’s doing so in the networking world, as well. With Linux, the switch runs a kernel and a process manager. The process manager manages each process (hence the clever name) and restarts them if they crash. With this model, should STP crash, it wouldn’t take the rest of the switch down with it. Or would it?

To properly protect a Linux system from misbehaving processes, each process should be written to use its own user virtual address space. Without getting into a server architecture discussion, understand that this type of memory is protected space that is addressable only by the process that owns it. If written using this model, should STP crash, only STP is affected; the rest of the switch continues to run unaffected.

There are two problems with this paradigm. First, many vendors don’t properly utilize user virtual address space, so one misbehaving process often affects other processes. Second, if STP crashes, your network will reconverge when STP comes back online. This is because the STP lost all of its known network topology, timer values, and such when it crashed, so it must start as if the switch were just booted. Arista has solved both problems.

Arista code is well written and follows strict rules regarding user virtual address space. If the STP process crashes, no other process will be affected. More impressively, though, all state information for all processes is stored in the central database called SysDB. To further explain, consider this: on an Arista switch, STP runs and stores all of its state information in SysDB. When any process starts, the first thing it does is go to SysDB to retrieve state information. If it finds a state, it uses it. If it does not find state information, it initializes. Let’s take that to the next logical conclusion.

When a switch is booted, STP loads, checks SysDB (just booted, no state found), determines its state (through convergence, etc.), and then writes that state to SysDB. For some reason, STP crashes. Because STP is isolated, no other processes are affected. When STP crashes, the process manager restarts STP. STP loads, after which it immediately checks SysDB, where it finds state information. STP loads that information, and no convergence is necessary. This is a very big deal!

Don’t believe me? Let’s have a look.

Warning

What I’m about to do is something that you shouldn’t do in a production environment. Though the impact will be negligible (that’s the point of this example), you should never take the complexities of a production environment for granted.

I’ve built a simple network between two switches, as shown in Figure 7-1. There are two switches, Arista-1 and Arista-2. Arista-1 is the root, and Arista-2 is in its default configuration.

Simple network built for abusing EOS processes
Figure 7-1. Simple network built for abusing EOS processes

With this network humming along nicely and STP performing as it should, let’s go in and attempt to destroy it by killing STP on the Arista-1 switch. But first, let’s look at what Spanning Tree sees on both sides. Here’s Arista-1:

Arista-1(config)#sho spanning-tree  
MST0
  Spanning tree enabled protocol mstp
  Root ID    Priority    8192
             Address     2899.3a06.696f
             This bridge is the root

  Bridge ID  Priority     8192  (priority 8192 sys-id-ext 0)
             Address     2899.3a06.696f
             Hello Time  2.000 sec  Max Age 20 sec  Forward Delay 15 sec


Interface        Role       State      Cost      Prio.Nbr Type
---------------- ---------- ---------- --------- -------- --------------------
Et47             designated forwarding 2000      128.47   P2p          
Et48             designated forwarding 2000      128.48   P2p

And here’s the output from Arista-2:

Arista-2#sho spanning-tree
MST0
  Spanning tree enabled protocol mstp
  Root ID    Priority    8192
             Address     2899.3a06.696f
             Cost        0 (Ext) 2000 (Int)
             Port        47 (Ethernet47)
             Hello Time  2.000 sec  Max Age 20 sec  Forward Delay 15 sec

  Bridge ID  Priority    16384  (priority 16384 sys-id-ext 0)
             Address     2899.3a06.6b2b
             Hello Time  2.000 sec  Max Age 20 sec  Forward Delay 15 sec

Interface        Role       State      Cost      Prio.Nbr Type
---------------- ---------- ---------- --------- -------- --------------------
Et47             root       forwarding 2000      128.47   P2p        
Et48             alternate  discarding 2000      128.48   P2p

As we can see, Arista-1 is the root, and E48 on Arista-2 is discarding, just as we expected.

Before we continue, I’d like to show you what the log entries look like when the Multiple Spanning Tree (MST) agent is initialized. To do that, I’m going to shut off MST entirely, restart it, and then show the logs:

Arista-1(config)#send log message [ Shutting down STP ]
Arista-1(config)#spanning-tree mode none

Jan 23 18:30:37 Arista-1 ConfigAgent: %SYS-6-LOGMSG_INFO: Message from admin on  
vty3 (10.0.0.100): [ Shutting down STP ]
Jan 23 18:30:45 Arista-1 StpTopology: %SPANTREE-6-DISABLED: STP is disabled.
Jan 23 18:30:45 Arista-1 Stp: %SPANTREE-6-STABLE_CHANGE: Stp is now not stable
Jan 23 18:30:45 Arista-1 Stp: %SPANTREE-6-INTERFACE_DEL: Interface Ethernet47 has
 been removed from instance MST0
Jan 23 18:30:45 Arista-1 Stp: %SPANTREE-6-INTERFACE_DEL: Interface Ethernet48 has
 been removed from instance MST0
Jan 23 18:30:46 Arista-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface
 Vlan50, changed state to down
Jan 23 18:30:46 Arista-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface
 Vlan50, changed state to up
Jan 23 18:31:15 Arista-1 Stp: %SPANTREE-6-STABLE_CHANGE: Stp state is now stable
Arista-1(config)#show spanning-tree
Spanning-tree has been disabled in the configuration.

Now, I start MST. Pay attention to the entries for Ethernet 47 and 48 in the log:

Arista-1(config)#send log message [ Starting STP ]
Arista-1(config)#spanning-tree mode mst
Arista-1(config)#show log last 5 min
Jan 23 18:33:25 Arista-1 ConfigAgent: %SYS-6-LOGMSG_INFO: Message from admin on
vty3 (10.0.0.100): [ Starting STP ]
Jan 23 18:33:31 Arista-1 StpTopology: %SPANTREE-6-MODE_CHANGE: Spanning tree mode
 is now mstp.
Jan 23 18:33:32 Arista-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface
 Vlan50, changed state to down
Jan 23 18:33:32 Arista-1 Stp: %SPANTREE-6-STABLE_CHANGE: Stp is now not stable
Jan 23 18:33:32 Arista-1 Stp: %SPANTREE-6-ROOTCHANGE: Root changed for instance
 MST0: new root interface is (none), new root bridge mac address is
 28:99:3a:06:69:6f (this switch)
Jan 23 18:33:32 Arista-1 Stp: %SPANTREE-6-INTERFACE_ADD: Interface Ethernet48 has
 been added to instance MST0
Jan 23 18:33:32 Arista-1 Stp: %SPANTREE-6-INTERFACE_ADD: Interface Ethernet47 has
 been added to instance MST0
Jan 23 18:33:32 Arista-1 Ebra: %LINEPROTO-5-UPDOWN: Line protocol on Interface
 Vlan50, changed state to up
Jan 23 18:33:32 Arista-1 Stp: %SPANTREE-6-INTERFACE_STATE: Interface Ethernet48
 instance MST0 moving from discarding to learning
Jan 23 18:33:32 Arista-1 Stp: %SPANTREE-6-INTERFACE_STATE: Interface Ethernet47
 instance MST0 moving from discarding to learning
Jan 23 18:33:33 Arista-1 Stp: %SPANTREE-6-INTERFACE_STATE: Interface Ethernet48
 instance MST0 moving from learning to forwarding
Jan 23 18:33:33 Arista-1 Stp: %SPANTREE-6-INTERFACE_STATE: Interface Ethernet47
 instance MST0 moving from learning to forwarding

The key entries here are in bold and show that when MST is initialized, it puts the interfaces into learning mode and then from learning into forwarding.

Now, let’s wreak some havoc (as if that weren’t enough havoc). There are a number of ways to kill a process in an Arista switch, but I’m going to use one of the most ungraceful ways possible and that’s to kill everything that even resembles a process with the name “Stp” (MST is a form of the Spanning Tree Protocol and is thus controlled by the agent named Stp), and I’m going to kill it with extreme prejudice (that’s what the “-9” means). Seriously, don’t ever do this unless you’re in a lab environment.

Warning

That’s worth another warning. Don’t kill processes in a production switch unless you really know what you’re doing, Arista Support tells you to, or you’re looking to get fired anyway.

Arista-1(config)#send log message [ Killing STP ]
Arista-1(config)#bash sudo killall -9 Stp
Arista-1(config)#show log last 5 min
Jan 23 18:38:12 Arista-1 ConfigAgent: %SYS-6-LOGMSG_INFO: Message from admin on
 vty3 (10.0.0.100): [ Killing STP ]
Jan 23 18:38:18 Arista-1 Stp: %AGENT-6-INITIALIZED: Agent Stp 
 initialized; pid=8338
Jan 23 18:38:18 Arista-1 Stp: %SPANTREE-6-STABLE_CHANGE: Stp state is now stable
Jan 23 18:38:18 Arista-1 Stp: %SPANTREE-6-INTERFACE_ADD: Interface Ethernet47 has
 been added to instance MST0
Jan 23 18:38:18 Arista-1 Stp: %SPANTREE-6-INTERFACE_ADD: Interface Ethernet48 has
 been added to instance MST0

Notice what’s missing? There are no moving from discarding to learning or moving from learning to forwarding messages. That’s because the state was already known, so the interfaces didn’t need to go through that process.

Within a second (much less, actually) of killing the Stp agent, the process manager restarted it, and thanks to SysDB, Spanning Tree (MST) never missed a beat. When the Stp agent started, it immediately requested the current state of the network from SysDB. Because there was a state found there, it simply used that instead of needing to start from an unknown network topology. Because this all happened well within the window of MST timers, the network didn’t reconverge, and the other switch didn’t even realize that there was a problem.

As a quick aside, if you begin playing around with agent killing, you’ll likely discover the agent command. With this command, you can kill any agent without resorting to Bash commands. To kill the Stp process, the following command (which is undocumented in newer versions of code like 4.17.2F) would have the same effect as the Bash sudo killall -9 Stp command we used earlier:

VM-SW3#agent stp terminate
Stp was terminated

That’s an undocumented command on EOS 4.21.1F, and it’s likely undocumented for a reason, so you probably shouldn’t do that. Other options include environment and shutdown, both of which you should also leave alone unless Arista TAC or your sales/customer engineer recommends otherwise.

We’ve seen what happens when we kill Spanning Tree, which illustrates the power and benefit of SysDB, but what happens if SysDB itself dies a horrible, unnatural death? Let’s find out! But first, let’s save our configuration because if years of video games have taught me anything, it’s to save your game before a big boss battle:

Arista-1(config)#wri
Copy completed successfully.

OK, let’s run into that cave with a resounding scream of Leeeroooy Ah-Jenkins!

Arista-1(config)#bash sudo killall -9 Sysdb
Arista-1(config)#
Connection to Arista-1 closed.
gad@[Lab]:~$

Uh…that can’t be good. A couple of seconds later, though, I can get right back in. Here’s what I would have seen if I’d done that on the console:

Arista-1#bash sudo killall -9 Sysdb
Arista-1#Jan 23 18:51:29 Arista-1 Lldp: %FWK-3-SOCKET_CLOSE_REMOTE: Connection to
 Sysdb (pid:9129) at tbl://sysdb/+n closed by peer (EOF)
Jan 23 18:51:29 Arista-1 Bfd: %FWK-3-SOCKET_CLOSE_REMOTE: Connection to Sysdb
 (pid:9129) at tbl://sysdb/+n closed by peer (EOF)
Jan 23 18:51:29 Arista-1 Capi: %FWK-3-SOCKET_CLOSE_REMOTE: Connection to Sysdb
 (pid:9129) at tbl://sysdb/+n closed by peer (EOF)
Jan 23 18:51:29 Arista-1 Bfd: %FWK-3-MOUNT_PEER_CLOSED: Peer closed socket
 connection. (tbl://sysdb/+n-in)(Sysdb (pid:9129))
Jan 23 18:51:29 Arista-1 SuperServer: %FWK-3-SOCKET_CLOSE_REMOTE: Connection to
 Sysdb (pid:9129) at tbl://sysdb/+n closed by peer (EOF)
Jan 23 18:51:29 Arista-1 Bfd: %FWK-3-MOUNT_CLOSED_EXIT: Process exiting.

[-- much output removed --]

Jan 23 18:51:37 Arista-1 Sysdb: %AGENT-6-INITIALIZED: Agent 'Sysdb' initialized;
 pid=11321
Jan 23 18:51:40 Arista-1 ConfigAgent: %AGENT-6-INITIALIZED: Agent 'ConfigAgent'
 initialized; pid=11336
Jan 23 18:52:06 Arista-1 ConfigAgent: %SYS-5-CONFIG_E: Enter configuration mode
 from console by root on UnknownTty (UnknownIpAddr)
Jan 23 18:52:11 Arista-1 ConfigAgent: %SYS-5-CONFIG_I: Configured from console
 by root on UnknownTty (UnknownIpAddr)
Jan 23 18:52:18 Arista-1 StageMgr: %AGENT-6-INITIALIZED: Agent 'StageMgr'
 initialized; pid=11335
Jan 23 18:52:18 Arista-1 Fru: %AGENT-6-INITIALIZED: Agent 'Fru' initialized;
 pid=11338
Jan 23 18:52:18 Arista-1 Launcher: %AGENT-6-INITIALIZED: Agent 'Launcher'
 initialized; pid=11339

In a nutshell, after killing the process at the heart of the entire switch, it recovered gracefully. To be fair, gracefully means that all the processes needed to reinitialize, and the end result was the same as rebooting the switch, but consider this: on my Arista 7280R, it takes 4 minutes and 15 seconds from the point at which I enter the reload now command to the point when I get a login. If I kill SysDB, I get a login prompt again after about 30 seconds.

Let’s have some fun and use that to our advantage. Because the running-config resides in SysDB, and SysDB is reinitialized when it’s killed, I’m going to load a completely new configuration on the switch without rebooting. Ever been frustrated by needing to reboot in order to load a completely new configuration? This isn’t a total cure, but it sure beats rebooting.

First, I save the current startup-config to a file named OLD-Config:

Arista-1#copy startup-config OLD-Config
Copy completed successfully.

Next, I remove the current startup-config with the write erase command:

Arista-1#write erase
Proceed with erasing startup configuration? [confirm]<cr>
Arista-1#

I’ve written a simple configuration file that has nothing except the defaults and a changed hostname, just to show that the new configuration actually is loaded. The new hostname will be GAD-1. The filename for this configuration is GAD-Config:

Arista-1#copy flash:GAD-Config startup-config
Copy completed successfully.

At this point, my switch is running according to the configuration loaded in the running-config, but when it boots, it will load the code in the startup-config, which has my new hostname installed. Remember, if I were to reboot the switch, it would take almost two minutes. Instead, I’ll kill SysDB from the console and watch what happens:

Arista-1#bash sudo killall -9 Sysdb
Arista-1#Jan 23 19:07:49 Arista-1 StageMgr: %FWK-3-SOCKET_CLOSE_REMOTE:
 Connection to Sysdb (pid:4137) at tbl://sysdb/+n closed by peer (EOF)
Jan 23 19:07:49 Arista-1 Fru: %FWK-3-SOCKET_CLOSE_REMOTE: Connection to
 Sysdb (pid:4137) at tbl://sysdb/+n closed by peer (EOF)
Jan 23 19:07:49 Arista-1 ConfigAgent: %FWK-3-SOCKET_CLOSE_REMOTE: Connection
 to Sysdb (pid:4137) at tbl://sysdb/+n closed by peer (EOF)
Jan 23 19:07:49 Arista-1 Launcher: %FWK-3-SOCKET_CLOSE_REMOTE: Connection
 to Sysdb (pid:4137) at

[--- Processes whining about SysDB removed ---]

Jan 23 19:20:27 Arista-1 SandMact: %FWK-3-MOUNT_PEER_CLOSED: Peer closed socket
 connection. (tbl://sysdb/+n-in)(Sysdb (pid:9618))
Jan 23 19:20:27 Arista-1 SandMact: %FWK-3-MOUNT_CLOSED_EXIT: Process exiting.
Jan 23 19:20:27 Arista-1 SandFap: %FWK-3-SOCKET_CLOSE_REMOTE: Connection to Sysdb
 (pid:9618) at tbl://sysdb/+n closed by peer (EPIPE when writing)

GAD-1 login:
Jan 23 19:23:56 GAD-1 SandL3Unicast: %HARDWARE-3-DISRUPTIVE_RESTART:
 Non-disruptive restart of SandL3Unicast forwarding agent failed. All hardware
 entries in Routing table will be deleted and re-programmed in the hardware.
 Reason: agent was restarted twice in quick succession.

GAD-1 login:

And just like that, in 15 seconds flat, I’ve reloaded my Arista switch with a new configuration. Try that with a monolithic code–based switch from another vendor! To be completely fair, the time until all interfaces are up, all protocols settle, and things like Multi-Chassis Link Aggregation Group (MLAG) stabilize will depend on the switch platform and the configuration.

Note

If you decide to play around with killing SysDB, be warned that, as stated, it has the same effect as rebooting the switch. This means that unsaved changes to the running-config will be lost. Because killing SysDB isn’t really graceful, there’s no warning about unsaved changes. But if you’re doing things like killing SysDB, you must know what you’re doing, right?

Oh—did you notice that scary message at the end? Here it is again:

Jan 23 19:23:56 GAD-1 SandL3Unicast: %HARDWARE-3-DISRUPTIVE_RESTART:
 Non-disruptive restart of SandL3Unicast forwarding agent failed. All hardware
 entries in Routing table will be deleted and re-programmed in the hardware.
 Reason: agent was restarted twice in quick succession.

Yikes! Be warned: when you play around with killing agents, bad things can (and probably will) happen, so don’t do that!

Oh, and I should probably mention that on EOS releases 4.14 and higher, there is a very sexy command called config replace. This allows you to specify a new running-config and load it without reloading the switch, without killing SysDB, and without interrupting processes that have not changed their configurations! Here I am loading my original congfiguration back on my 7280R running version 4.21.1F:

GAD-1#configure replace flash:OLD-Config
Arista-1#

Note that the only obvious change is the hostname, but it loaded a new configuration in seven seconds with no downtime. That’s much better than killing SysDB! For more details on the configure replace command, see Chapter 9.

SysDB is the heart of Arista’s EOS, and as I’ve hopefully shown in this chapter, the resiliency of SysDB along with the power of Linux makes for an amazingly robust switch operating system that almost can’t be killed.

As a last note, SysDB is an agent, so it resides in memory. Its contents are built after every boot, which means that it does not survive a reboot (that’s what the startup-config is for). Finally, SysDB is actually a very robust agent and the last time I asked had never spontaneously crashed in a customer network, so you don’t need to worry about seeing that in the real world. Just keep me away from your switches, and you’ll be fine.

Conclusion

In Chapter 5, I said that the fabric was the heart of the switch, and that’s true when it comes to the hardware. When it comes to the software, though, the heart of EOS has to be the concept of SysDB, the concept upon which almost everything else in the operating system is based.

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.17.172.104