Chapter 29. Troubleshooting

There are some pretty useful diagnostic tools on Arista switches, some of which we’ve already covered, such as tcpdump. Sometimes, we need to know more detail about what the switch is doing, and that’s where performance monitoring comes into play.

Logs

In Linux, you can follow live updates to a log file by using the tail -f filename command. You can do something similar when viewing your system log in EOS with the command-line interface (CLI) command show logging follow. Syslog messages are stored in Linux where any Linux person would expect to see them, namely, in /var/log/:

[admin@Arista ~]$ ls /var/log
EosInit           error.log              messages
NorCalInit        eventMon.db            nginx-access.log
Post              fneserver              nginx-error.log
agents            inotifyrun-local.log   ntpstats
btmp              inotifyrun-secure.log  ppp
cli               inotifyrun-sys.log     qt
cron              kernel.debug           secure
eos               lastlog                spooler
eos-console       libvirt                startup-config-output
eos-console-sync  logrot.log             tallylog
eos-monitor       maillog                wtmp
eos-monitor-sync  mcelog                 yum.log

/var/log/messages contains all of the syslog messages that you would see on the console. Note that you need sudo to view this file:

[admin@Arista ~]$ cd /var/log
[admin@Arista log]$ tail messages
tail: cannot open 'messages' for reading: Permission denied
[admin@Arista log]$ sudo tail messages
Nov 29 02:07:34 Arista Stp: %SPANTREE-6-STABLE_CHANGE: Stp state
is now stable
Nov 29 02:07:56 Arista Cli: %SYS-5-CONFIG_I: Configured from
console by admin on vty3 (10.0.0.100)
Nov 29 02:08:01 Arista CROND[10746]: (root) CMD
(/etc/archivecheck.sh &> /dev/null)
Nov 29 02:08:33 Arista Cli: %SYS-5-CONFIG_E: Enter configuration
mode from console by admin on vty3 (10.0.0.100)
[--output truncated--]

The filesystem in EOS resides in memory, so anything outside of flash: will not survive a reboot. Additionally, some switches can be shipped with solid-state drives (SSDs), and if you have a switch that contains one, the system automatically archives many of the system log files onto that drive, which is drive: in CLI and /mnt/drive in Linux:

[admin@Arista log]$ cd /mnt/drive
[admin@Arista drive]$ ls
aquota.user
archive
lost+found
var_archive.2016-11-21-03:20:02.dir
var_archive.2016-11-21-06:19:05.dir
var_archive.2016-11-21-14:30:02.dir
var_archive.2016-11-21-19:15:02.dir
var_archive.2016-11-27-02:30:02.dir
var_archive.2016-11-27-22:29:02.dir
var_archive.2016-11-27-22:45:02.dir

Performance Monitoring

A great tool on Linux systems is the top command. The top command produces output that automatically updates every few seconds (the default is three seconds in Linux). You can call this command from the CLI with the show process top command, or from Bash with the command top. Here’s a sample output from a live 7280SE-72. This is a very healthy switch with nothing unusual going on. Depending on the switch platform you’re using, the processes near the top might change:

Arista(config)#sho proc top once
top - 02:22:23 up  1:42,  1 user,  load average: 0.28, 0.29, 0.29
Tasks: 289 total,   1 running, 288 sleeping,   0 stopped,   0 zombie
%Cpu(s):  7.0 us,  1.2 sy,  0.0 ni, 91.5 id,  0.1 wa,  0.1 hi,  0.0 si,  0.0 st
KiB Mem:   3796192 total,  3640996 used,   155196 free,   225716 buffers
KiB Swap:        0 total,        0 used,        0 free,  1909320 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
 3853 root      20   0  985m 264m 118m S   7.9  7.1   5:14.88 SandFap
 3463 root      20   0  708m  97m  20m S   4.0  2.6   0:01.45 Aaa
 3882 root      20   0  698m 114m  37m S   4.0  3.1   0:15.62 SandMact
11881 admin     20   0 25892  12m 9028 R   4.0  0.3   0:00.11 top
    1 root      20   0  7900 3992 2500 S   0.0  0.1   0:02.22 systemd
    2 root      20   0     0    0    0 S   0.0  0.0   0:00.00 kthreadd
    3 root      20   0     0    0    0 S   0.0  0.0   0:00.06 ksoftirqd/0
    5 root      20   0     0    0    0 S   0.0  0.0   0:00.09 kworker/u:0
    6 root      rt   0     0    0    0 S   0.0  0.0   0:00.19 migration/0
    8 root      rt   0     0    0    0 S   0.0  0.0   0:00.19 migration/1
    9 root      20   0     0    0    0 S   0.0  0.0   0:00.00 kworker/1:0
   10 root      20   0     0    0    0 S   0.0  0.0   0:00.06 ksoftirqd/1
   11 root      20   0     0    0    0 S   0.0  0.0   0:00.07 kworker/0:1
   13 root      rt   0     0    0    0 S   0.0  0.0   0:00.52 migration/2

Because I’m a maniac with programming skills, I wrote a program called corecrusher that consumes 100% CPU resources for any number of CPU cores. Here I ran it for 30 seconds so that it would consume three cores. Notice that the CPU idle is at only 21.9% and you can see that there are three processes called corecrusher, each consuming 99.8% of a CPU core.

Arista(config)#sho proc top once

top - 02:32:00 up  1:51,  1 user,  load average: 1.41, 0.60, 0.40
Tasks: 293 total,   5 running, 288 sleeping,   0 stopped,   0 zombie
%Cpu(s): 77.4 us,  0.7 sy,  0.0 ni, 21.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st
KiB Mem:   3796192 total,  3646380 used,   149812 free,   226364 buffers
KiB Swap:        0 total,        0 used,        0 free,  1911628 cached

  PID USER      PR  NI  VIRT  RES  SHR S  %CPU %MEM    TIME+  COMMAND
12573 admin     20   0  5040  652  332 R  99.8  0.0   0:27.50 corecrusher
12574 admin     20   0  5040  652  332 R  99.8  0.0   0:27.48 corecrusher
12575 admin     20   0  5040  652  332 R  99.8  0.0   0:27.54 corecrusher
 3853 root      20   0  985m 264m 118m S   8.8  7.1   5:40.01 SandFap
12601 admin     20   0 25872  12m 9140 R   5.9  0.3   0:00.19 top
 3553 root      20   0  689m  99m  24m R   2.9  2.7   1:40.72 Smbus
    1 root      20   0  7900 3992 2500 S   0.0  0.1   0:02.26 systemd
    2 root      20   0     0    0    0 S   0.0  0.0   0:00.00 kthreadd
    3 root      20   0     0    0    0 S   0.0  0.0   0:00.06 ksoftirqd/0
    5 root      20   0     0    0    0 S   0.0  0.0   0:00.09 kworker/u:0
    6 root      rt   0     0    0    0 S   0.0  0.0   0:00.19 migration/0
    8 root      rt   0     0    0    0 S   0.0  0.0   0:00.19 migration/1

There are a variety of ways to sort this output, but the default is by CPU utilization. For 99% of what you’d use this tool for, that’s fine. The %CPU column is what determines the process’s place on the list, but remember that processes on a Linux system can bounce up and down on this list in milliseconds, and you will likely see just that as you watch.

I recommend doing some further reading on the top command online, but there are a couple of things to always look at when you use it. To begin, the very first line shows three numbers after the words load average:. These numbers will vary from switch to switch, and they’re not necessarily an indication of anything unless you know what they are normally. In other words, a 7280R switch in one environment might run with a load average of (for example) 0.9, whereas another might run at an average of 1.5. The load average is the average number of processes in the run queue over the interval in question (1 min, 5 mins, 15 mins), and there are so many things that contribute to these numbers that they’re not terribly useful on a switch, in my opinion. That being said, if you see a number like 25.99, there’s probably something bad happening.

The next bold line in the output shows this:

%Cpu(s): 77.4 us,  0.7 sy,  0.0 ni, 21.9 id,  0.0 wa,  0.0 hi,  0.0 si,  0.0 st

This is a very important line, though again, it’s only for the snapshot of time when the last top iteration ran. In other words, it’s not historical; it’s a snapshot. The numbers from left to right indicate the following values:

77.4%us

The combined CPUs spent 77.4% of their time on user processes (mostly agents on an Arista switch).

0.7%sy

The combined CPUs spent 0.7% of their time on system processes (kernel, etc.).

0.0%ni

The combined CPUs spent 0% of their time on user processes that have been niced (this is a Unix thing used for setting priorities that you really don’t need to worry about). Unless you’ve really messed around in Bash, this should probably always be 0, though there are some processes like khelper that pop up with a –20 nice from time to time in order to keep them high priority (a higher niceness means a lower priority).

21.9%id

The combined CPUs are spending 21.9% of their time completely idle. The bigger this number is, the happier the switch is. Note that this is not strictly true, because the forwarding of frames is done by the Application-Specific Integrated Circuit (ASIC) so even with 100% CPU utilization, the switch will likely be operating well.

0.0%wa

The CPU spent 0% of its time waiting for I/O to complete.

0.0%hi

The CPU spent 0% of its time servicing hardware interrupts.

0.0%si

The CPU spent 0% of its time servicing software interrupts.

0.0%st

The CPU spent 0% of its time stolen by a hypervisor.

On a healthy switch, idle should be high, whereas system and user should be low. On a busier switch, user might spike up and down as processes vie for the CPU’s attention. Remember, though, that this is just control-plane stuff like Open Shortest Path First (OSPF), Border Gateway Protocol (BGP), and Spanning Tree Protocol (STP). Packets are forwarded by ASIC, and not the CPU. Still, high CPU should be monitored, and you should contact Arista support if your CPU stays unreasonably high.

Tracing Agents (Debugging)

Debugging processes is significantly different in EOS than it is in other vendors’ industry-standard operating system environments. The amount of debug information available in EOS is staggering, but figuring out where to look can be a bit overwhelming if you’re used to using other vendors’ debug commands. After you get the hang of how it works in EOS, though, you’ll be amazed at the power. You’ll also quite possibly never do this unless TAC tells you to, but seeing as how there are limited places to see this documented, I figured it would be a good addition to this book.

First, EOS now has a debug command, but it’s fairly limited in scope, covering only some basic IP and OSPF topics:

Arista(config)#debug ip ?
  general  General debugging
  ospf     Ospf protocol

In EOS, the concept of debugging is mostly replaced by the concept of tracing. Almost any process (EOS agents) can be traced with—you guessed it—the trace command. You can’t just go typing trace ospf packets, though, so step away from the keyboard and keep reading.

Tracing in EOS can produce an overwhelming amount of data, so by default, the output of your trace does not go to the console or to your Secure Shell (SSH) sessions. Because of the large amount of data, output first must be directed to a file. In this section, I show you how to view the output to the screen via a couple of methods, but let’s take a look at how to get a trace started first.

Warning

Be careful when using trace. It can consume system resources, though unlike debug in other vendors’ operating systems, it does not take priority over other processes, nor does it stop everything else to deliver its output to the console. (Seriously, who thought that was a good idea?)

Using OSPF as an example, it quickly becomes obvious that you don’t trace protocols in EOS, but rather agents. In fact, all agents have hooks in them for the trace system, which is a pretty clever way for developers to gather information from their agents via built-in methods. To learn the name of the agent that you want to trace, issue the show agent names command:

Arista#sho agent names
Aaa
Acl
AgentMonitor
AradLanz
Arp
ArpInspection
Asu
Avago
Bfd
BugAlert
CapiApp
Cdp
Chl822X

[ -- lots of output removed in the interest of brevity --]

Vm
VmTracer
Vxlan
VxlanController
VxlanSwFwd
Wbem
Xcvr
XcvrSlice
Xmpp
ZeroTouch

Now we have a list of agents that we can trace. One of the frustrating things about this process is that it might not always be obvious what you’re looking for. For example, in this list, there are no agents named Ospf, Bgp, or Rip, and the agents that are listed don’t have very useful descriptions. That output, however, is from EOS 4.15.5M, and things have changed a bit on newer code.

On newer revisions of code (EOS 4.21.1F in this example), there are actually agents for BGP and OSPF:

Arista(config)#sho agent name | egrep -i "ospf|bgp"
Bgp
Ospf
Ospf-vrf
Ospf3
Ospf3-vrf

Those agents aren’t running by default, though.

Arista(config)#sho trace Bgp
% Agent 'Bgp' is not running

If you’re running FlexRoute (Chapter 20) on a supported Arista switch, those agents will be running:

Arista(config)#service routing protocols model multi-agent  
Arista(config)#sho trace Bgp
Global trace settings for agent Bgp
-----------------------------------------------
Tracing sent to stderr

date:     enabled
time:     enabled
[--output truncated--]

Back to a default switch, if you know a bit about routing, you might recognize the idea of a Routing Information Base (RIB), and there’s a process for that, so let’s see where that leads us. To find out details about what you can trace within an agent, use the show trace agent-name command. Let’s do this for the Rib process:

Arista(config)#sho trace Rib
Global trace settings for agent Rib
-----------------------------------------------
Tracing sent to stderr

date:     enabled
time:     enabled
PID:           disabled
TID:           enabled
facility name: enabled  (width 20)
trace level:   enabled
filename:      disabled (width 20)
line number:   disabled
function name: disabled (width 20)

Trace facility settings for agent Rib is  
-----------------------------------------------
AclAggregator        enabled  ............
AclApi               enabled  ............
AdjacencyHelper      enabled  ............
Agent                enabled  ............
AgentIdManager       enabled  ............
AgentLogging         enabled  ............
AgentStageCommonSm   enabled  ............
AgentStartupModeSm   enabled  ............
ArcherBackup         enabled  ............
ArcherEm             enabled  ............
ArrowATPServerMain   enabled  ............
ArrowArRTOp          enabled  ............
ArrowArRtOp          enabled  ............
ArrowArTableInterface enabled  ............
ArrowArUtils         enabled  ............
ArrowClientLib       enabled  ............
ArrowDB              enabled  ............
ArrowLocalTable      enabled  ............
ArrowProxy           enabled  ............
ArrowSerDes          enabled  ............
ArrowServer          enabled  ............
BfdStatAgentLib      enabled  ............
BfdStaticRouteSm     enabled  ............
BgpSmashInfo         enabled  ............
ConnectionManager    enabled  ............
Dash::Local<bool>    enabled  ............

[-- Wow, there's a lot of stuff (removed) --]

Rib                  enabled  ............
Rib::Adv             enabled  ............
Rib::Bgp             enabled  ............
Rib::Bgp::Keepalive  enabled  ............
Rib::Bgp::Normal     enabled  ............
Rib::Bgp::Notification enabled  ..........
Rib::Bgp::Open       enabled  ............
Rib::Bgp::Policy     enabled  ............
Rib::Bgp::Route      enabled  ............
Rib::Bgp::State      enabled  ............
Rib::Bgp::Task       enabled  ............
Rib::Bgp::Timer      enabled  ............
Rib::Bgp::Update     enabled  ............
Rib::DebugMessages   enabled  ............
Rib::DynPolicyRoutes enabled  ............
Rib::Isis            enabled  ............
Rib::Mio             enabled  ............
Rib::Normal          enabled  ............
Rib::Ospf            enabled  ............
Rib::Ospf3           enabled  ............
Rib::Parse           enabled  ............
Rib::Policy          enabled  ............
Rib::Rip             enabled  ............
Rib::Route           enabled  ............
Rib::RouteControl    enabled  ............
Rib::State           enabled  ............
Rib::Task            enabled  ............
Rib::Te              enabled  ............
Rib::Timer           enabled  ............
Rib::Tracing         enabled  ............

[-- Even more stuff removed --]

VxlanIntfAggregator  enabled  ............
Watchdog             enabled  ............
XFire                enabled  ............
XFireServer          enabled  ............

We found it! We’re so clever.

Note

If you are running FlexRoute, the Rib process will not be running.

Now that we know the name of the agent (Rib), we can begin to specify trace commands. The first step should always be to point the output to a file. You can put the file anywhere, but Arista recommends that it be stored on flash: or, better yet, drive: if your switch has an SSD drive, given that those locations usually have the most space and can survive a reboot. To do so, use the trace agent-name filename filename command in configuration mode:

Arista(config)#trace Rib filename flash:Rib.txt
Note

Notice how the agent names all begin with capital letters? That’s not a requirement on the command line for specifying the agent name. I’m just obsessive. If you specify a filename with a capital letter like I have, that filename will need to be consistent when referencing it later.

Now there are two paths you can take. You can either trace it all (generally a bad idea), or you can restrict what you want to see. Because we’re looking to trace BGP, we need to include only the items with BGP in the name, which I’ve taken the liberty of highlighting in bold. Actually, there are a couple more, so let’s try a different way to see them:

Arista(config)#sho trace Rib | grep Bgp
BgpSmashInfo         enabled  ............
GatedBgpHelper       enabled  ............
Rib::Bgp             enabled  ............
Rib::Bgp::Keepalive  enabled  ............
Rib::Bgp::Normal     enabled  ............
Rib::Bgp::Notification enabled  ............
Rib::Bgp::Open       enabled  ............
Rib::Bgp::Policy     enabled  ............
Rib::Bgp::Route      enabled  ............
Rib::Bgp::State      enabled  ............
Rib::Bgp::Task       enabled  ............
Rib::Bgp::Timer      enabled  ............
Rib::Bgp::Update     enabled  ............
RibBgpPlugin         enabled  ............
RibGatedBgp          enabled  ............

Note that these Rib::Bgp entries appear only if BGP is running.

To include only these entries, use the trace agent-name enable trace-facility-name levels command. Let me show you how to figure all that out. First, we know the agent name (Rib), so let’s begin there:

Arista(config)#trace Rib enable ?
  WORD  Trace facility name

The Trace facility name is one or more entries from that long list we got from using the show trace Rib command. Look at the leftmost lines in bold from that output. They all begin with Rib::Bgp::. To include them all, we can use a wildcard and specify Rib::Bgp::*, as shown in the next example.

Note

This is one more place where capitalization matters. If you specify rib::bgp::* (no caps), it won’t work. If you use caps the way you might think you should such as RIB::BGP::*, that also will not work. You need to specify exactly what’s listed from the output of show trace agent-name, including proper case and the right number of colons.

That leaves one more item to determine: levels.

Arista(config)#trace Rib enable Rib::Bgp::* ?
  all     Enable tracing at all levels
  levels  Enable tracing at one or more levels

This one is easy, just use all. Sure, you can drill down even further if you’d like, and to do so, specify the word levels with a question mark (?) to see what the possibilities are:

Arista(config)#trace Rib enable Rib::Bgp::* levels ?
  0         Enable tracing at level 0
  1         Enable tracing at level 1
  2         Enable tracing at level 2
  3         Enable tracing at level 3
  4         Enable tracing at level 4
  5         Enable tracing at level 5
  6         Enable tracing at level 6
  7         Enable tracing at level 7
  8         Enable tracing at level 8
  9         Enable tracing at level 9
  coverage  Enable tracing at level coverage
  function  Enable tracing at level function

Like I said, just use all. (But feel free to mess with each setting to your heart’s content. It’s not like I can stop you!)

Arista(config)#trace Rib enable Rib::Bgp::* all

Note that this will alter the output of the show trace Rib command to include references to what is being traced:

Arista(config)#sho trace Rib | grep Bgp
Trace facility settings for agent Rib is Rib::Bgp::Task/*cf,
Rib::Bgp::Normal/*cf,Rib::Bgp::Keepalive/*cf,Rib::Bgp::
Update/*cf,Rib::Bgp::State/*cf,Rib::Bgp::Timer/*cf,Rib::Bgp/*cf,Rib::Bgp::
Open/*cf,Rib::Bgp::Notification/*cf,Rib::Bgp::Route/*cf,Rib::Bgp::Policy/*cf
BgpSmashInfo         enabled  ............
GatedBgpHelper       enabled  ............
Rib::Bgp             enabled  0123456789cf
Rib::Bgp::Keepalive  enabled  0123456789cf
Rib::Bgp::Normal     enabled  0123456789cf
Rib::Bgp::Notification enabled  0123456789cf
Rib::Bgp::Open       enabled  0123456789cf
Rib::Bgp::Policy     enabled  0123456789cf
Rib::Bgp::Route      enabled  0123456789cf
Rib::Bgp::State      enabled  0123456789cf
Rib::Bgp::Task       enabled  0123456789cf
Rib::Bgp::Timer      enabled  0123456789cf
Rib::Bgp::Update     enabled  0123456789cf
RibBgpPlugin         enabled  ............
RibGatedBgp          enabled  ............
SrTeBgpPolicyStatus  enabled  ............

The numbers in the right side of the Rib::Bgp:: entries indicate the logging levels. Because we chose all for the logging level, they’re all included. Had we chosen only level 9, for example, only the 9 would be shown for those entries.

At this point, the trace is active, and provided there is something going on with the agent you’ve chosen, the file will begin to fill up:

Spine-1(config)#dir Rib*
Directory of flash:/Rib*

       -rwx       28464           Jan 31 14:15  Rib.txt

3440762880 bytes total (724103168 bytes free)

As a quick aside, the trace rib enable Rib::Bgp::* all command will be expanded in the running-config file to something that looks like this, which is also shown in the output of show trace Rib:

trace Rib setting Rib::Bgp::Task/*cf,Rib::Bgp::Normal/*cf,
Rib::Bgp::Keepalive/*cf,Rib::Bgp::Update/*cf,Rib::Bgp::State/*cf,
Rib::Bgp::Timer/*cf,Rib::Bgp::Open/*cf,Rib::Bgp::Notification/*cf,
Rib::Bgp::Route/*cf,Rib::Bgp::Policy/*cf

Don’t worry about that. It’s just all of the items you included by using the asterisk wildcard on one line, separated by commas.

A file that’s quietly filling with trace output isn’t very useful by itself, so let’s see what’s in it. First, you can monitor the output directly with the trace monitor agent-name command. Press Ctrl-C to break out of this output. This is not like using the terminal monitor command in that other company’s operating system; when you issue the trace monitor agent-name command, you cannot use the CLI until you break the output:

Arista(config)#trace monitor Rib
--- Monitoring /mnt/flash/Rib.txt ---
14:17:08.654507 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option
 TTL(16) value 1
14:17:08.654538 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option
 VirtualRoutingandForwarding(39): not supported
14:17:08.654572 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option
 TOS(17) value 192
14:17:08.654624 task_addr_local: task BGP_65006.10.10.6.2+179 address 10.10.6.1
14:17:08.654658 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option
 DontRoute(5) value 1
14:17:08.654689 BGP NORMAL: bgp_need_peer_resolution: timeout is 0, resolution
 disabled
14:17:08.654842 task_connect: task BGP_65006.10.10.6.2+179 addr 10.10.6.2+179:
 Operation now in progress
14:17:08.654899 task_timer_uset: timer BGP_65006.10.10.6.2+179_Connect <OneShot
 Set Processing> set to offset 2:28 at 14:19:36
14:17:08.654930 bgp_set_connect_timer: 10.10.6.2 (AS 65006) timer started with
 timeout 148
14:17:08.654986 task_timer_dispatch: returned from BGP_65006.10.10.6.2+179_
Connect, rescheduled in 2:28 at 14:19:36
14:17:11.654925 task_process_sockets: processing BGP_65006.10.10.6.2+179 fd 101
 event 8220

The method I prefer to use is more Unix centric. Using the tail filename command shows you the last few lines of a file. For a constantly updating file like this one, the tail –f filename command continuously updates until broken (again, with Ctrl-C):

Arista(config)#bash tail -f /mnt/flash/Rib.txt
14:19:03.700397 BGP TASK: bgp_recv_msg_job: processed 1 messages. yield_now no
14:19:03.700426 task_job_delete: delete background job recv msgq job for task
 BGP.0.0.0.0+179
14:19:03.700518 task_timer_dispatch: calling BGP_65002.10.10.2.2+60402_HoldTime,
 late by 0.000
14:19:03.700580 task_timer_uset: timer BGP_65002.10.10.2.2+60402_HoldTime
 <OneShot Set Processing> set to offset 2:59.999796 at 14:22:03
14:19:03.700629 task_timer_dispatch: returned from BGP_65002.10.10.2.2+60402_
HoldTime, rescheduled in 2:59.999 at 14:22:03
14:19:07.550984 task_timer_dispatch: calling BGP.0.0.0.0+179_PolicyCache_Cleanup,
 late by 0.000
14:19:07.551150 task_job_create: create background job DedupPtrCache Purge for
 task BGP.0.0.0.0+179
14:19:07.551185 task_timer_dispatch: returned from BGP.0.0.0.0+179_PolicyCache_
Cleanup, timer now inactive
14:19:07.551223 task_job_delete: delete background job DedupPtrCache Purge for
 task BGP.0.0.0.0+179
14:19:07.551273 task_timer_uset: timer BGP.0.0.0.0+179_PolicyCache_Cleanup
 <OneShot> set to offset 10 at 14:19:17

I like this method because I can pipe using grep (which doesn’t work with trace from the CLI):

Arista(config)#trace monitor Rib | grep 65006
% Invalid input

Here’s what I’m looking for by using Bash commands from Cli:

Arista(config)#bash tail -f /mnt/flash/Rib.txt | grep 65006
14:22:04.655769 task_timer_dispatch: calling BGP_65006.10.10.6.2+179_Connect,
 late by 0.000
14:22:04.655874 bgp_connect_timeout: BGP_65006.10.10.6.2+179_Connect
14:22:04.655937 bgp_event: peer 10.10.6.2 (AS 65006) old state Active event
 ConnectRetry new state Connect
14:22:04.655981 bgp_trap_backward: sending trap for 10.10.6.2 (AS 65006)
14:22:04.656247 task_set_socket: task BGP_65006.10.10.6.2+179 socket 101
14:22:04.656296 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option
 NonBlocking(8) value 1
14:22:04.656339 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option
 ReUseAddress(3) value 1
14:22:04.656380 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option
 TTL(16) value 1
14:22:04.656419 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option
 VirtualRoutingandForwarding(39): not supported
14:22:04.656458 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option
 TOS(17) value 192
14:22:04.656553 task_addr_local: task BGP_65006.10.10.6.2+179 address 10.10.6.1
14:22:04.656593 task_set_option: task BGP_65006.10.10.6.2+179 socket 101 option
 DontRoute(5) value 1

Turn It Off!

Traces stay running even when you’re not looking at them, and they will run, forever consuming disk space, unless you shut them off. Here’s a directory listing from my switch where I left a few traces running for only an hour!

Agent(config)#dir *.txt
Directory of flash:/*.txt

       -rwx    122612063           Jan 7 03:50  RIB-Debug.txt
       -rwx    152724544           Jan 11 00:22  Rib.txt
       -rwx    189626048           Jan 11 00:22  STP.txt

1862512640 bytes total (0 bytes free)

That’s 45 megabytes of file space chewed up by something that I don’t need. Let’s not forget that those traces are still running, which means they’re wasting system resources.

Note

Though I did run out of disk space, that’s more likely due to the fact that this is my test lab switch, which has 4 different versions of EOS on the flash: drive, each of which is over 500 MB. If you run these Trace files in your home directory (which runs in memory, remember), then you may cause more significant harm by filling up a filesystem in EOS.

Unfortunately, at least as of EOS version 4.21.1F, I know of no way to disable all traces at once like you can with the undebug all command in other operating systems. To disable tracing, use the no trace agent-name enable * all command. This works regardless of what other trace facility names you may have chosen, such as Rib::Bgp::.

Remember that you might have multiple traces running, too, and you’ll need to stop them all individually.

Arista(config)#no trace Sysdb enable * all
Arista(config)#no trace Rib enable * all

Don’t forget that although you’ve stopped the traces, the files still reside on disk. If you want to save them elsewhere, you can copy them off using a variety of protocols such as FTP, SCP, TFTP, and more. Here’s a list:

Arista(config)#copy flash:Rib.txt ?
  boot-extensions  Copy to boot extensions configuration
  certificate:     Destination file path
  drive:           Destination file path
  extension:       Destination file path
  file:            Destination file path
  flash:           Destination file path
  ftp:             Destination file path
  http:            Destination file path
  https:           Destination file path
  running-config   Update (merge with) current system configuration
  scp:             Destination file path
  sftp:            Destination file path
  sslkey:          Destination file path
  startup-config   Copy to startup configuration
  system:          Destination file path
  terminal:        Destination file path
  tftp:            Destination file path

Usually, I’ve gotten what I’ve needed from them, so I just delete them, instead:

Arista(config)#del Rib.txt
Arista(config)#del Sysdb.txt

With the traces disabled, leaving the filename entries cluttering the configuration doesn’t make any sense, so as a final step, we can clean those up too:

Arista(config)#no trace Rib filename flash:/Rib.txt
Arista(config)#no trace Stp filename flash:/Sysdb.txt

Tracing can seem a bit overwhelming if you’re used to other vendors’ offerings and the well-known debug commands, but I think if you spend some time getting used to this format, you’ll quickly discover that it’s more powerful than debug and certainly more flexible. The fact that it won’t lock up the switch pushing logs to the console is a nice bonus, too.

Honestly, though, most people I know only use Trace when TAC tells them to.

CLI Standalone Mode

Here’s a neat instance of something not going quite according to plan. In this case, I have a switch with roughly 400,000 static routes (because I’m a menace and that’s what I do). I did a config replace on it with a base configuration so that I could work on another chapter, which has tied up the CPU removing all of those CLI commands. When I SSH to the switch during that process, check out what happens:

gad@[AW-Lab]:~$ switch-01
Warning: Permanently added 'student-01,10.0.0.1' (ECDSA) to the list of
 known hosts.

Last login: Wed Jan 30 21:29:28 2019 from 10.1.0.100



Cannot connect to CLI. Entering standalone shell.


Note: Standalone CLI does not share data with other agents, but it can
run certain commands such as 'bash' to troubleshoot the system within 
authentication and authorization constraints.

Standalone>

Whoa! In all the years that I’ve been working on Arista switches, I’d never seen this. Let’s see what we can see!

Standalone>en
Standalone#sho run
! Command: show running-config
! device: Student-01 (unknown-hardware, EOS-4.21.1F)
!
! boot system flash:/EOS-4.21.1F.swi
!
prompt Standalone%R%v%P
!
transceiver qsfp default-mode 4x10G
!
spanning-tree mode mstp
!
no aaa root
!
no ip routing
!
end

Well that’s new. What about interfaces?

Standalone#sho int
Standalone#

Nothing! There’s not even a management interface! The management interface is a kernel interface in Linux, and so it is not controlled by the ASIC, and I can’t even see that! Some poking around has found that this is a special Cli mode that loads without access to SysDB or any other agents. Here’s the relevant help output from the Cli command in Linux:

[admin@switch-02 ~]$ Cli --help | grep -A1 "without Sysdb"
  -l, --standalone      run in standalone mode (without Sysdb). Implies
                        --disable-guards.(also see --standalone-guards

At this point, I mutter, “fascinating” in my best Mr. Spock voice and then try to trigger standalone mode myself. I couldn’t get that to work and quickly gave up because I had other things to do.

What’s the point of this? To show you that if you do see this, not to panic. After about a minute, I was able to SSH to the switch normally, even with my standalone shell still active. I just thought it was an interesting (fascinating, even) curiosity and thought I’d share it.

Bottom line: if you see this, don’t panic. Wait a minute or two and try to SSH again. I bet it will work. If it doesn’t after a reasonable amount of time, something more serious might be afoot, at which point it might be time to call TAC. Or reboot. Maybe panic. It all depends on your environment and caffeine intake.

Arista Support

Arista Support, or TAC (Technical Assistance Center), is a fantastic resource, and my experience with it has been far superior to that other company that has nothing to do with crystallized cottonseed oil. TAC has helped me when I’ve discovered bugs, it has helped me diagnose configuration problems, and on the rare occasion when we had a switch go bad under warranty, TAC shipped us a new one in no time. After you’re registered with Arista Support as a valid customer, opening a TAC case is as simple as calling, or if the issue isn’t critical, you can use email, as well. And with your valid and registered email address, you can download documentation, EOS revisions, whitepapers, and a host of other useful documents from the Arista.com website.

There is also a fabulous online forum and repository of knowledge at the EOS Central website. At this site, you can post a question to the forum, and you might be surprised to find one of the company founders answering it. There is also a development blog and some tech tips that are well worth a look.

Of course, you can always send me an email, too, but I’m a cranky, old, recalcitrant nerd who doesn’t have a lot of free time, so you’d probably do better emailing Arista. That is, of course, unless you want to tell someone how much you like this book, in which case I’m probably the better choice.

Conclusion

The concept of troubleshooting is one that I am asked about quite a bit, and it’s a tough topic to write about within a single chapter. The idea of writing about common things that go wrong is really what people want, but in my experience, I don’t really see common things that go wrong aside from my favorite thing to troubleshoot, which is user error. My favorite complaint from students and entrenched engineers alike is, “I did everything right, but it still doesn’t work.”

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
3.137.188.120