Chapter 1. Types of Network Automation

Automation has taken the world of networking by storm in the past couple years. The speed and agility automation provides to companies is unprecedented compared to the “old way” of doing things manually. In addition to speed, automation can bring many other benefits to networking, such as consistency in actions and reduction of human error.

Automation can be applied to many different processes and technologies in different ways in the world of networking. In this book, when we refer to network automation, we are referring to the process of automatic management, configuration, testing, deployment, and operation in any kind of network environment, either on premises or in the cloud. This chapter approaches automation by subdividing it into the following types and providing use cases and tools for each of them:

• Data-driven automation

• Task-based automation

• End-to-end automation

This division is not an industry standard but is a reflection of the most commonly seen outcomes from network automation.


Note

This is not a deeply technical chapter but rather an introductory one that describes automation architectures and common use cases. It lays the foundations for the upcoming chapters, which go more in detail on some of these topics.


Figure 1-1 provides an overview of the use cases described for each of these three types of automation in this chapter.

Images

Figure 1-1 Automation Use Cases Overview

Data-Driven Automation

Everyone is talking about data, automation folks included. Data is undoubtedly among the most important components in a system. But what actually is data? Data is facts or statistics collected about something.

Data can be used to understand the state of a system, used to make decisions based on reasoning and/or calculations, or stored and used later for comparisons, among numerous other possibilities.

The possible insights derived by analyzing data make data an invaluable asset also for networking automation solutions.

In Chapter 2, “Data for Network Automation,” you will see how to capture data for networking purposes, and in Chapter 3, “Using the Data of Your Network,” you will learn how to prepare and use the captured data.

What Data-Driven Automation Is

Data-driven automation can be involves automating actions based on data indicators. As a simple example, you could use data-driven automation to have fire sprinklers self-activate when the heat levels are abnormally high. Data-driven automation is the prevailing type of automation in the networking world. Firewalls can be configured to block traffic (action) from noncompliant endpoints (compliance data), and monitoring systems can trigger alarms (action) based on unexpected network conditions such as high CPU usage on networking equipment (metric data).

Data-driven automation can take many shapes and forms, but it always starts from data. Data can be expressed in many formats, and we cover common industry formats in Chapter 2.

There are push and pull models in data-driven automation architectures. In push models, devices send information to a central place. The sensors in Figure 1-2, for example, send measurements to the management station, which receives them and can take action based on those readings. Examples of networking protocols that use a push model include Syslog, NetFlow, and SNMP traps.

Images

Figure 1-2 Push Model Sensor Architecture

Pull models (see Figure 1-3) are the opposite of push models. In a pull model, the management station polls the sensors for readings and can take action based on the results. A networking protocol that utilizes a pull model is SNMP polling.

Images

Figure 1-3 Pull Model Sensor Architecture

To better understand the difference between these two models, consider a real-life example involving daily news. With a pull model, you would go to a news website to read news articles. In contrast, with a push model, you would subscribe to a news feed, and the publisher would push notifications to your phone as news stories are published.

Push models are popular today because the allow for faster actions as you don’t have to pull data at a specific time; Chapter 2 covers model-driven telemetry, which is an example of a push model. Modern push protocols allow you to subscribe to specific information, which will be sent to you as soon as it is available or at configurable fixed time intervals. However, in some use cases, the pull model is preferred, such as when the computing power available is limited or there are network bandwidth constraints. In these cases, you would pull only the data you think you need at the time you need it. The problem with a pull model is that it may not always give you the information you need. Consider a scenario in which you are polling a temperature measurement from a network device; failing to poll it regularly or after an event that would increase it could have disastrous consequences.

You have probably been using data-driven automation for years. The following section guide you through use cases for adding automation to a network.

Data-Driven Automation Use Cases

Even if they don’t use the term automation, companies use data-driven automation all the time. There are opportunities for automation in many IT processes. The following sections describe some use cases for data-driven automation. Later in the book you will see case studies of some of these use cases transformed and applied to global-scale production systems.

Monitoring Devices

One of the quickest and easiest ways of getting into automation is by monitoring device configurations or metrics.

Monitoring device metrics is a well-established field, and you can use a number of tools that can collect network statistics and create dashboards and reports for you. We cover some of these tools later in this chapter; in addition, you can easily create your own custom solutions. One of the reasons to create your own tools is that off-the-shelf tools often give you too much information—and not necessarily the information you need. You might need to build your own tool, for example, if you need to monitor hardware counters, because industry solutions are not typically focused on specific detailed implementations. By collecting, parsing, and displaying information that is relevant for your particular case, you can more easily derive actionable insights.

Monitoring device configurations is just as important as device metrics as it prevents configuration drift—a situation in which production devices’ configurations change so that they differ from each other or from those of backup devices. With the passing of time, point-in-time changes are made to deal with fixes or tests. Some of the changes are not reflected everywhere, and thus drift occurs. Configuration drift can become a problem in the long run, and it can be a management nightmare.

The following are some of the protocols commonly used in network device monitoring:

• Simple Network Management Protocol (SNMP) (metrics)

• Secure Shell Protocol (SSH) (configuration)

• NetFlow (metrics)

• Syslog (metrics)

Each of these protocols serves a different purpose. For example, you might use SNMP to collect device statistics (such as CPU usage or memory usage), and you might use SSH to collect device configurations. Chapter 3 looks more closely at how to collect data and which methods to use.

Automating network monitoring tasks does not simply add speed and agility but allows networking teams to get updated views of their infrastructure with minimal effort. In many cases, automating network monitoring reduces the administrative burden.

To put this into perspective, let’s look at an example of a network operations team that needs to verify that the time and time zone settings in all its network devices are correct so that logs are correlatable. In order to achieve this, the network operator will have to follow these steps:

Step 1. Connect to each network device using SSH or Telnet.

Step 2. Verify time and time zone configurations and status.

a. If these settings are correct, proceed to the next device.

b. If these settings are incorrect, apply needed configuration modifications.

The operator would need to follow this process for every single device in the network. Then, it would be a good idea to revisit all devices to ensure that the changes were applied correctly. Opening a connection and running a command on 1000 devices would take a lot of time—over 40 hours if operators spend 2 or 3 minutes per device.

This process could be automated so that a system, rather than an operator, iterates over the devices, verifies the time and time zone settings, and applies configuration modifications when required. What would take 40 hours as a manual process would take less than 2 hours for automation, depending on the tool used. In addition, using automation would reduce the possibility of human error in this process. Furthermore, the automation tool could later constantly monitor time data and alert you when time is not in sync. From this example, you can clearly see some of the benefits of automation.

The process used in the time verification example could also be used with other types of verification, such as for routing protocols or interface configurations. You will see plenty of examples throughout this book of applying data-driven automation using tools such as Python and Ansible.

Compliance Checking

Most network engineers must deal with compliance requirements. Systems must be built for regulatory compliance (for example, PCI DSS, HIPAA) as well as for company policy. In many cases, compliance checking has become so cumbersome that some companies have dedicated roles for it, with individuals going around monitoring and testing systems and devices for compliance.

Today, we can offload some of the burden of compliance checking to automated systems. These systems systematically gather information from target systems and devices and compare it to the desired state. Based on the results, a range of actions can be taken, from a simple alert to more complex remediation measures.

Figure 1-4 illustrates a simple data-driven automated flow:

Step 1. The management system polls device configuration by using a protocol such as SSH.

Step 2. The management system compares the device configuration against predefined template rules.

Step 3. The management system pushes a configuration change if the compliance checks failed.

Images

Figure 1-4 Compliance Workflow

Automated compliance tools can typically generate reports that keep you aware of the network’s compliance status. In addition, integrations with other network solutions can help you secure your network. For example, through integration with an endpoint management system, you can quarantine noncompliant endpoints, reducing the network’s exposure in the event of a cyberattack.

How are you doing your compliance checks today? If you are already using a tool for compliance checks, you are already on the automation train! If you are doing these checks manually, you might consider implementing one of the tools presented later in this chapter to help you with the task.

Optimization

Organizations are eager to optimize their networks—to make them better and more efficient. Optimization is an activity that we often do manually by understanding what is currently happening and trying to improve it. Possible actions are changing routing metrics or scaling elastic components up and down. Some automation tools can do such optimization automatically. By retrieving data from the network constantly, they can perceive what is not running as efficiently as possible and then take action. A good example of automated network optimization is software-defined wide area networking (SD-WAN). SD-WAN software constantly monitors WAN performance and adjusts the paths used to access resources across the WAN. It can monitor metrics such as latency to specific resources and link bandwidth availability. It is a critical technology in a time when many of the resources accessed are outside the enterprise domain (such as with SaaS or simply on the Internet).

Imagine that you have two WAN links, and your company uses VoIP technologies for critical operations that, when disrupted, can cost the business millions of dollars. Say that you configure a policy where you indicate that packet loss must be less than 3% and latency cannot exceed 50ms. An SD-WAN controller will monitor the WAN links for predefined metrics and compare them to your policy. When the controller detects that your main WAN link is experiencing poor performance for voice traffic, it shifts this type of traffic to your second link. This type of preventive action would be considered an optimization.

As another example of optimization, say that you have some tools that gather the configurations from your network devices and systems. They compare them to configuration best practices and return to you possible enhancements or even configuration issues. Optimization tools today can even go a step further: They can also collect device metrics (for example, link utilization, client density) and compare them to baselines of similar topologies. These tools can, for example, collect data from several customers, anonymize that data, and use it as a baseline. Then, when the tool compares device metrics against the baseline, it might be able to produce insights that would otherwise be very difficult to spot.

Predictive Maintenance

Isn’t it a bummer when a device stops working due to either hardware or software issues? Wouldn’t it be nice to know when something is going to fail ahead of time so you could prepare for it?

With predictive maintenance, you can gather data from your network and model systems that attempt to predict when something is going to break. This involves artificial intelligence, which is a subset of automation. Predictive maintenance is not magic; it’s just pattern analysis of metrics and logs to identify failure events.

Take, as an example, router fans. By collecting rotation speed over time (see Table 1-1), you might find that before failure, router fans rotate considerably more slowly than they do when they’re brand new. Armed with this information, you could model a system to detect such behavior.

Table 1-1 Router Fan Rotations per Minute (RPM)

Images

Predictive models typically use more than one parameter to make predictions. In machine learning terms, the parameters are called features. You may use many features combined to train predictive models (for example, fan RPM, system temperature, device Syslog data). Such a model takes care of prioritizing the relevant features in order to make correct predictions.

Bear in mind that modeling such prediction systems requires a fair amount of data. The data needed depends on the complexity of the use case and the algorithms used. Some models require millions of data points in order to show accurate predictions.

Predictive maintenance is still in its infancy. However, it has been showing improved results lately, thanks to increased computing power at attractive prices. You will see more on machine learning techniques to achieve this sort of use case in Chapter 3.

Another prediction example is load prediction, which can be seen as a type of maintenance. Some tools collect metrics from network devices and interfaces and then try to predict whether a link will be saturated in the future.

Say that you have a branch site with a 1Gbps link to the Internet. Depending on the time of the day, the link may be more or less saturated with user traffic. Over time, users have been using more external resources because the company has adopted SaaS. You could use a tool to constantly monitor this link’s bandwidth and get warnings when you might run into saturation on the link. Such warnings would provide valuable insight, as you would be able to act on this information before you run into the actual problem of not having enough bandwidth for users.

Troubleshooting

Network engineers spend a good portion of their time troubleshooting. Common steps include the following:

Step 1. Connect to a device.

Step 2. Run show commands.

Step 3. Analyze the output of these commands.

Step 4. Reconfigure features on the device.

Step 5. Test for success.

Step 6. Repeat steps 1–5 for each of the other devices you need to troubleshoot.

Take a moment to consider how much knowledge you need in order to correlate logs to a root cause or to know what protocol is causing loss of connectivity for a specific user. Now consider that you might need to do this in the middle of the night, after a long workday.

Automation can lighten the load in troubleshooting. By automating some of the aforementioned steps, you can save time and solve problems more easily. Some tools (such as DNA Center, covered later in this chapter) already have troubleshooting automation embedded. The way they work is by deriving insights from collected data and communicating to the operator only the end result. Such a tool can handle the steps 1, 2, and 3; then you, as the operator, could act on the analysis. Steps 4 and 5 could also be automated, but most troubleshooting automation tools are not in a closed-loop automated state, which is a state where the tool collects, analyzes, and acts on the analysis without any input from a human. Rather, with most troubleshooting automation tools, humans still press that final button to apply the changes, even if it just to tell the tool to apply the changes; sometimes the only human interaction is a simple confirmation.

Visualize the following troubleshooting scenario:

• Reported issues:

• Applications and users are reporting loss of connectivity.

• Symptoms:

• Several switches are experiencing high CPU usage.

• Packet drop counters are increasing.

• Troubleshooting steps:

• Collect log messages from affected devices by using show commands.

• Verify log messages from affected devices.

• Trace back activity logs to the time when the problem was first reported.

• Correlate logs by using other troubleshooting techniques, such as traceroute.

After the users report problems, you could follow the troubleshooting steps yourself, but you would need to narrow down the issue first. You could narrow it down by collecting metrics from a network management system or from individual devices. After this initial step, you could troubleshoot the specific devices. While you do all this, users are still experiencing issues.

An automated system is more agile than a human at troubleshooting, and it can make your life easier. An automated system consistently collects logs and metrics from the network, and it can detect a problem when it sees values that do not match the expected values. The system can try to match the symptoms it sees to its knowledge base to identify possible signature problems. When it finds a match, it reports discoveries to the network operator or even mitigates them itself. Most tools report findings rather than resolve them due to the critically some of networks.

In this specific scenario, the underlying issue could be that a Layer 2 loop is being by a misplaced cable connection. Receiving a notification of a possible Layer 2 loop is much more helpful and actionable than hearing that “some users are reporting loss of connectivity.” This example illustrates the real power of automation: enabling and empowering people to do a better job.


Note

Keep in mind that with troubleshooting automation, it is crucial to have a good knowledge base. Without such a knowledge base, deriving insights from collected data will not be effective.


Task-Based Automation

Tasks are actions. In networking, you have to perform a whole bunch of actions across many devices several times. These actions can be simple or complex, entailing several steps. The need to perform actions can be business derived (for example, implementing a new feature) or technology derived (such as a version upgrade). The bigger the change, the higher the operational burden. Task-based automation attempts to reduce this burden and improve the speed and reliability of tasks.

We discuss how to implement task-based automation using Ansible in Chapter 4, “Ansible Basics,” and Chapter 5, “Using Ansible for Network Automation.”

What Task-Based Automation Is

Task-based automation involves automating tasks you would otherwise have to do manually. This category of automation does not need a data input or an automatic trigger to happen; most of the time, it is triggered by an operator.

Network operators often need to create new VLANs for new services. To do this, you have to connect to every device that requires the new VLAN in order to configure it. With task-based automation, you could have a system do this VLAN configuration task for you. Depending on the automation system chosen, you would need inputs such as which devices to configure, what to configure, and when to configure it.

Task-based type of automation is where most organizations start developing in-house solutions. These are the typical steps to start on your own automation journey:

Step 1. Survey what tasks are commonly executed.

Step 2. Assess the feasibility of automating them and the expected return value.

Step 3. Choose an appropriate tool.

Step 4. Develop use cases on the chosen tool.

As an exercise, think of a task that you commonly do or that your team commonly does. How would you automate it? What tool would you use? You can refer to the “Tools” section at the end of this chapter to help make that decision.

Task-Based Automation Use Cases

This section guides you through some common use cases for task-based automation. Don’t be scared if it seems complex at first. Most complex tasks can be broken down into smaller and less complex subtasks.

Interaction

Interacting with a component is the most generic use case. You can automate almost any interaction you do manually, including these:

• Opening a website

• Interacting with an application

• Inserting data in a database

• Create a spreadsheet in Excel

If you currently do something repeatedly, chances are it is worth thinking about automating it.

As an example of an interaction, say that you are constantly given Excel sheets with network parameters (for example, hostname, IP address, location) from newly deployed devices in branches. After receiving the sheets by email, you manually insert the parameters in your network management system database via the web GUI available at a URL. This type of activity is time-consuming and does not benefit from intellectual capital. You could automate this flow of interactions and liberate yourself from the administrative burden.

Data Collection

Collecting data is a day-to-day activity for many people. There are many data types and formats, and we cover common networking data types in Chapter 2. There are also many ways of collecting data, and many engineers still do it manually—connecting to devices and systems one by one to gather information. This system works, but it can be improved. A better way is to use automation, as described in this section (and also in Chapter 2).

Say that you want to collect data on a number of connected devices in your network. For the sake of this example, assume that all devices have either LLDP or CDP configured. One way to solve this task would be to use show command output for either LLDP or CDP (depending on the device vendor) to learn the hostnames of the neighbor devices, as shown in Example 1-1.

Example 1-1 show cdp neighbors Output on a Cisco Switch

Pod6-RCDN6-Border#show cdp neighbors
Capability Codes: R - Router, T - Trans Bridge, B - Source Route Bridge
                S - Switch, H - Host, I - IGMP, r - Repeater, P - Phone,
                  D - Remote, C - CVTA, M - Two-port Mac Relay

Device ID        Local Intrfce     Holdtme    Capability  Platform  Port ID
Switch-172-31-63-161.cisco.com
                 Gig 1/0/13        166             R S I  WS-C3850- Gig 1/0/23
Switch-172-31-63-161.cisco.com
                 Gig 1/0/14        149             R S I  WS-C3850- Gig 1/0/24
Pod6-SharedSvc.cisco.com
                 Gig 1/0/2         152             R S I  WS-C3650- Gig 1/0/24

Total cdp entries displayed : 3

You could connect to every device on the network, issue this command, and save the output and later create some document based on the information you gather. However, this activity could take a lot of time, depending on the number of devices on the network. In any case, it would be a boring and time-consuming activity.

With automation, if you have a list of target systems, you can use a tool to connect to the systems, issue commands, and register the output much—all very quickly. Furthermore, the tool could automatically parse the verbose show command and then save or display only the required hostnames.

This type of automation is easily achievable and liberates people to focus on tasks that only humans can do.


Note

You will learn later in this book how to collect and parse this type of information from network devices by using tools such as Python and Ansible.


Configuration

Applying point-in-time configurations to devices and systems is another common networking task. Typically, you need the desired configurations and a list of targets to apply the configurations to. The configurations may differ, depending on the target platform, and this can be challenging with a manual process.

Automating configuration activities can greatly improve speed and reduce human error. As in the previous use case, the automation tool connects to the targets, but now, instead of retrieving information, it configures the devices. In addition, you can automate the verification of the configuration.

The following steps show the typical configuration workflow:

Step 1. Connect to a device.

Step 2. Issue precheck show commands.

Step 3. Apply the configuration by entering the appropriate commands.

Step 4. Verify the applied configurations by using show commands.

Step 5. Issue postcheck show commands.

Step 6. Repeat steps 1–5 for all the devices on the list of target platforms.

You can do many kinds of configurations. These are the most common examples of configurations on network devices:

• Shutting down interfaces

• Tuning protocols (based on timers or features, for example)

• Changing interface descriptions

• Adding and removing access lists

In addition to using the simple workflow shown earlier, you can configure more complex workflows, such as these:

• You can configure devices in a specific order.

• You can take actions depending on the success or failure of previous actions.

• You can take time-based actions.

• You can take rollback actions.

In enterprise networking, the tool most commonly used for network automation is Ansible. Some of the reasons for its dominance in this domain are its extensive ecosystem that supports most types of devices, its agentless nature, and its low entry-level knowledge requirements. Chapter 5 illustrates how to access and configure devices using Ansible.

An automation tool may even allow you to define intent (that is, what you want to be configured), and the tool generates the required configuration commands for each platform. This level of abstraction enables more flexibility and loose coupling between components. (Keep this in mind when choosing an automation tool.)

Provisioning

Provisioning is different from configuration, but the two terms are often confused. Provisioning, which occurs before configuration, involves creating or setting up resources. After a resource is provisioned, it is made available and can be configured. For example, a user could provision a virtual machine. After that virtual machine boots and is available, the user could configure it with the software required for a particular use case.

As you know, the cloud has become an extension of company networks. In addition to public clouds (such as Amazon Web Services, Google Cloud Platform, and Microsoft Azure), organizations also use private clouds (for example, OpenStack or VMware vCloud suite). In the cloud, provisioning new resources is a daily activity. You can accomplish provisioning by using a graphical user interface, the command line, or application programming interfaces (APIs). Provisioning can also entail creating new on-premises resource such as virtual machines.

As you might have guessed by now, using a graphical interface is significantly slower and error prone than using automation for provisioning. Automating the provisioning of resources is one of the cloud’s best practices—for several reasons:

Resource dependencies: Resources may have dependencies, and keeping track of them is much easier for automation than for humans. For example, a virtual machine depends on previously created network interface cards. Network engineers may forget about or disregard such dependencies when doing provisioning activities manually.

Cost control: When you use on-premises resources, you tend to run them all the time. However, you don’t always need some resources, and provisioning and deprovisioning resources in the cloud as you need them can be a cost control technique. The cloud allows for this type of agility, as you only pay for what you use.

Ease of management: By automating the provisioning of resources, you know exactly what was provisioned, when, and by whom. Management is much easier when you have this information.

A number of tools facilitate cloud provisioning by defining text-based resource definitions (for example, Terraform and Ansible). See the “Tools” section, later in this chapter, for further details on these tools.

Reporting

Reports are a necessary evil, and they serve many purposes. Generating reports is often seen as a tedious process.

In networking, reports can take many shapes and forms, depending on the industry, technology, and target audience. The basic process for generating a report involves three steps:

Step 1. Gather the required information.

Step 2. Parse the information in the desired format.

Step 3. Generate the report.

This process can vary, but the three main steps always occur.

Network engineers often need to run reports on software versions. Due to vendors’ software life cycle policies, companies fairly often need to get an updated view of their software status. This type of report typically consists of the following:

• Devices (platforms and names)

• Current software versions (names and end dates)

• Target software versions (names)

Reports on the hardware life cycle are also common in networking. Vendors support specific hardware platforms for a specific amount of time, and it is important to keep an updated view of the hardware installed base in order to plan possible hardware refreshes.

In addition, configuration best practices and security advisory impact reports are common in networking.

Automation plays a role in reporting. Many tools have automatic report generating capabilities, no matter the format you might need (such as HTML, email, or markdown). The steps required to gather and parse information can be automated just like the steps described earlier for data collection; in fact, reporting features are typically embedded in data collection automation tools. However, there are cases in which you must provide the data input for the report generation. In such cases, you must automate the data gathering and parsing with another tool.

For the example of software versioning mentioned earlier, you could use a tool to connect to network devices to gather the current software version of each device. To get the target versions for the devices, the tool could use vendors’ websites or API exposed services. With such information, the automation tool could compile a report just as a human would.


Note

Most of the time, tailoring your own tool is the way to go in reporting. Each company has its own requirements for reports, and it can be difficult to find a tool that meets all of those requirements.


End-to-End Automation

So far, we have described several use cases and scenarios, but all of them have focused on single tasks or components. You know that a network is an ecosystem composed of many different platforms, pieces of vendor equipment, systems, endpoints, and so on. When you need to touch several of these components in order to achieve a goal, you can use end-to-end automation.


Note

End-to-end automation can be complex because different components might have different ways of being accessed, as well as different configurations and metrics. Do not let this put you off; just tread lightly. End-to-end automation is the most rewarding type of automation.


What End-to-End Automation Is

End-to-end automation is not a unique type of automation but a combination of data-driven and task-based automation. It can also be seen as a combination of several use cases of the same type. It is called end-to-end because its goal is to automate a flow from a service perspective. The majority of the time, your needs will fall within this category. Because data-driven and task-based automation are the building blocks for an end-to-end flow, you do need to understand those types as well.

Figure 1-5 shows the topology used in the following example. In this topology, to configure a new Layer 3 VPN service, several components must be touched:

• LAN router or switch

• CPE router

• PE router

• P router

All of these components can be from the same vendor, or they can be from different vendors. In addition, they can have the same software version, or they can have different versions. In each of those components, you must configure the underlay protocol, BGP, route targets, and so on. This would be considered an end-to-end automation scenario.

Images

Figure 1-5 Layer 3 VPN Topology


Tip

Think of an end-to-end process you would like to automate. Break it down into smaller data-driven or task-based processes. Your automation job will then be much easier.


End-to-End Automation Use Cases

All of the use cases described in this section are real-life use cases. They were implemented in companies worldwide, at different scales.


Note

All of the use cases discussed so far in this chapter could potentially be applied to an end-to-end scenario (for example, monitoring an entire network, running a compliance report of all devices).


Migration

Migrations—either from an old platform to a new platform or with a hardware refresh—are often time-consuming tasks, and some consider them to be a dreadful activity. They are complex and error prone, and most of the time, migrations impact business operations. Automation can help you make migrations smoother.

Consider the following scenario: You need to migrate end-of-life devices in a data center to new ones. The new ones are already racked and cabled and have management connectivity in a parallel deployment. There is a defined method for carrying out the migration:

Step 1. Collect old device configurations.

Step 2. Create new device configurations.

Step 3. Execute and collect the output of the precheck commands.

Step 4. Configure new devices with updated configurations.

Step 5. Execute and collect the output of postcheck commands.

Step 6. Shift the traffic from the old devices to the new devices.

Step 7. Decommission the old devices.

This is a fairly straightforward migration scenario, but depending on the number of devices and the downtime accepted by the business requirements, it might be challenging.

An engineer would need to use SSH to reach each old device to collect its configuration, make the required changes in a text editor, use SSH to reach each of the new devices, configure each device, and verify that everything works. Then the engineer can shift the traffic to the new infrastructure by changing routing metrics on other devices and verify that everything is working as expected.

You could automate steps of this process, and, in fact, you have already seen examples of automating these steps:

• Collect the old configurations.

• Generate the new configurations.

• Apply changes on the devices.

• Verify the configurations/services.

Remember that migrations are often tense and prone to errors. Having an automated step-by-step process can mean the difference between a rollback and a green light, and it can help you avoid catastrophic network outages.


Note

A rollback strategy can be embedded into an automated process and automatically triggered if any check, pre or post, is different from what is expected.


Configuration

We covered a configuration use case earlier, when we looked at task-based automation. However, it is important to highlight that end-to-end configuration is just as valuable, if not more so. Automating service configuration involves a collection of automated tasks.

Instead of simply configuring a VLAN on a list of devices, say that you want to run an application on a newly created virtual machine and enable it to communicate. Using the data center topology shown in Figure 1-6, you need to take these steps to achieve the goal:

Step 1. Create a virtual machine (component A).

Step 2. Install the application.

Step 3. Configure Layer 2 networking (component B).

Step 4. Configure Layer 3 networking (components C and D).

Images

Figure 1-6 Simplified Data Center Topology

These steps are simplified, of course. In a real scenario, you normally have to touch more components to deploy a new virtual machine and enable it to communicate in a separate VLAN. The complexity of the configuration depends on the complexity of the topology. Nonetheless, it is clear that to achieve the simple goal of having an application in a private data center communicate with the Internet, you require an end-to-end workflow. Achieving this manually would mean touching many components, with possibly different syntax and versions; it would be challenging. You would need to know the correct syntax and where to apply it.

An end-to-end automated workflow can ease the complexity and reduce human errors in this configuration. You would define syntax and other characteristics once, when creating the workflow, and from then on, you could apply it with the press of a button.

The more complex and extensive a workflow is, the more likely it is that a human will make a mistake. End-to-end configurations benefit greatly from automation.


Note

You will see a more in-depth explanation on how to achieve automation in this example in Chapter 6, “Network DevOps.”


Provisioning

As mentioned earlier, automating provisioning of resources on the cloud is key for taking advantage of cloud technology.

Most companies do not use a single cloud but a combination of services on multiple clouds as well as on-premises devices. Using more than one cloud prevents an organization from being locked to one vendor, helps it survive failures on any of the clouds, and makes it possible to use specialized services from particular vendors.

Provisioning resources across multiple clouds is considered end-to-end automation, especially if you also consider the need for the services to communicate with each other. Provisioning multiple different resources in a single cloud, as with a three-tier application, would also fall within this end-to-end umbrella.

Let’s consider a disaster recovery scenario. Say that a company has its services running on an on-premises data center. However, it is concerned about failing to meet its service-level agreements (SLAs) if anything happens to the data center. The company needs to keep its costs low.

A possible way of addressing this scenario is to use the cloud and automation. The company could use a provisioning tool to replicate its on-premises architecture to the cloud in case something happens to the data center. This would be considered a cold standby architecture.

The company could also use the same type of tool to scale out its architecture. If the company has an architecture hosting services in a region, and it needs to replicate it to more regions, one possible way to do this is by using a provisioning tool. (A region in this case refers to a cloud region, which is cloud provider’s presence in a specific geographic region.)


Note

Remember that when we refer to the cloud, it can be a public cloud or a private cloud. Provisioning use cases apply equally to both types.


Testing

Like migration, testing can be an end-to-end composite use case composed of different tasks such as monitoring, configuration, compliance verification, and interaction. Also like migration, testing can be automated. By having automated test suites, you can verify the network status and the characteristics of the network quickly and in a reproducible manner.

Network testing encompasses various categories, including the following:

• Functional testing

• Performance testing

• Disaster recovery testing

• Security testing

Functional testing tries to answer the question “Does it work?” It is the most common type of test in networking. If you have found yourself using ping from one device to another, that is a type of functional testing; you use ping to verify whether you have connectivity—that is, whether it works. The goal of more complex types of functional tests may be to verify whether the whole network is fulfilling its purpose end to end. For example, you might use traceroute to verify if going from Device A to Device B occurs on the expected network path, or you might use curl to verify that your web servers are serving the expected content.

Performance testing is focused on understanding whether something is working to a specific standard. This type of testing is used extensively on WAN links, where operators test the latency of all their service provider links. Another common use of performance testing is to stress a device to make sure it complies with the vendor specifications.

Disaster recovery testing consists of simulating a disaster-like situation, such as a data center outage, to verify that the countermeasures in place occur, and the service is not affected or is minimally affected. This is a more niche use case because you wouldn’t test it if you didn’t have countermeasures in place (for example, a dual data center architecture).

Finally, security testing is another common type of testing in networking. The goal of this type of testing is to understand the security status of the network. Are the devices secure? Are any open ports? Is everything using strong encryption? Are the devices updated with the latest security patches?

Although all the tests described previously can be done manually—and have been handled that way in the past—moving from manual testing to automated testing increases enterprise agility, enabling faster reactions to any identified issues. Consider a system that periodically verifies connectivity to your critical assets—for example, your payment systems or backend infrastructure—and that measures delay, jitter, and page load times. When you make a network change, such as replacing a distribution switch that breaks, the system can quickly test whether everything is working and allow you to compare the current performance against previously collected data. Performing this task manually would be error prone and time-consuming.

Advanced automated testing systems can trigger remediation actions or alarms.

Tools

There are numerous tools on the market today. The following sections present many of the tools that are worth mentioning in the context of network automation:

• DNA Center

• Cloud event-driven functions

• Terraform

• Ansible

• Chef

• Grafana

• Kibana

• Splunk

• Python

Some of these tools are not fixed tools but rather modular tools that can be tailored and adapted to achieve data-driven network automation goals. This book focuses on such modular tools.

Choosing the right tool for the job is as important in automation as is in other areas. When it comes to choosing a tool, it is important to evaluate your need against what the tool was designed to do. In addition, it is important that the staff responsible for using and maintaining the tool be comfortable with the choice.


Note

For each of the following tools, we highlight in bold the connection to previous illustrated use cases.


DNA Center

DNA Center (DNAC) is a Cisco-developed platform for network management, monitoring, and automation. It allows you to provision and configure Cisco network devices. It has artificial intelligence (AI) and machine learning (ML) to proactively monitor, troubleshoot, and optimize the network.

DNAC is software installed in a proprietary appliance, so you cannot install it outside that hardware.

If you are looking to automate a network environment with only Cisco devices, this platform is a quick win. Unfortunately, it does not support all device platforms—not even within Cisco’s portfolio.

In the context of automation, DNAC ingests network data (SNMP, Syslog, NetFlow, streaming telemetry) from devices, parses it, and enables you to quickly complete the following tasks:

Set up alerts based on defined thresholds: You can set up alerts based on metrics pulled from network devices (for example CPU, interface, or memory utilization). Alerts can have distinct formats, such as email or webhooks.

Baseline network behaviors: Baselining involves finding common behavior, such as that a specific switch uplink sees 10 Gigabits of traffic on Monday and only 7 Gigabits on Tuesday. Discovering a good baseline value can be challenging, but baselines are crucial to being able to identify erroneous behaviors. DNAC offers baselining based on ML techniques. You will see more on this topic in Chapter 3.

Aggregate several network metrics to get a sense of network health: Based on a vast amount of captured information, DNAC displays a consolidated view of the status of a network to give a sense of network health. It can quickly provide insights into devices and clients, without requiring an understanding of metrics or logs.

Act on outliers: Outliers can exist for many reasons. An attacker might have taken control of one of your endpoints, or equipment might have broken. Being able to quickly apply preventive measures can make a difference. DNAC allows you to automate what actions are taken in specific situations.

Verify compliance status and generate reports: DNAC enables you to view and export network compliance verification and reports from an appliance as an out-of-the-box functionality.

DNAC is not as modular as some other tools, and enhancing its out-of-the-box functionalities is not supported in most cases.

Cloud Event-Driven Functions

If you have a cloud presence, networking is a part of that. One way of automating in the cloud is to use serverless functions. Cloud functions are event driven, which requires a very specific data-driven automation type. Each cloud vendor has its own implementation, which makes multivendor environments very challenging. Developed code is not portable between different clouds even if they follow the same guidelines.

Event driven means actions are derived from events, or changes of state. Events can be messages, actions, alarms, and so on. An event-driven architecture is typically an asynchronous architecture.

AWS Lambda, Google Cloud Functions, and Azure Functions are examples of this technology.

Say that you want to scale out your architecture (virtual machines or containers) when the CPU utilization of the current instances is high. You can trigger an event when the CPU hits a defined threshold, and the cloud function provisions and configures new resources to handle the load.

Event-driven functions also shine at interacting with many different components. They are typically used to automate the integration between otherwise independent components. For example, when someone tries to connect to a virtual machine but provides invalid credentials, this action can generate an event. The event can then be sent to a cloud function that creates a security alert or a ticket on a network management platform.

Cloud event-driven functions are not limited to interacting with cloud components, although they run on the cloud. They can trigger actions anywhere, including on premises.

Terraform

Terraform is an open-source infrastructure as code (IaC) tool with focus on provisioning and management. Terraform uses a declarative model, in which you define the intended desired state, and Terraform figures out the best way to achieve it; that is, Terraform gathers the current state, compares it to the desired state, and creates a finite set of steps to achieve it. Terraform works best in cloud environments, where representing the infrastructure as code is easier.

Terraform uses the concept of provider, which is an abstraction layer for the interaction with the infrastructure. It supports multiple providers, including the following, and the list keeps expanding:

• Amazon Web Services

• Microsoft Azure

• Google Cloud Platform

• IBM Cloud

• VMware vSphere

• Cisco Systems

Terraform is command-line interface (CLI) based, so you just need to install it on a workstation to begin using it. You do not need a dedicated server, although it is common to use dedicated machines in production deployments. Another option is to use HashiCorp’s SaaS offering, which is often preferred by large enterprises.

To use Terraform, you define your desired resources in a .tf file. (You will see how to do this later in this section.)

For more complex workflows, your files becomes increasingly complex, you can use modules. A Terraform module is a collection of files that you can import to provide abstraction to otherwise complex or extensive logic.

It is important to note that Terraform stores the state of your infrastructure and refreshes that state before performing any operation on it. That state is used to calculate the difference between what you currently have and what is defined in your desired .tf file.

Using Terraform to create a virtual machine on Amazon Web Services entails the following steps, which are illustrated with examples:

Step 1. Create a .tf configuration file (see Example 1-2).

Step 2. Initialize Terraform (see Example 1-3).

Step 3. Optionally verify changes that are to be applied (see Example 1-4).

Step 4. Apply the changes (see Example 1-5).

Step 5. Optionally verify whether the applied changes are as expected (see Example 1-6).


Note

The following examples assume that Terraform is installed, and an AWS account is configured.


Example 1-2 Step 1: Creating a Configuration File

$ cat conf.tf
terraform {
  required_providers {
    aws = {
      source  = "hashicorp/aws"
    }
  }
}

provider "aws” {
  profile = "default"
  region  = "us-west-1"
}

resource "aws_instance” "ec2name” {
  ami           = "ami-08d9a394ac1c2994c”
  instance_type = "t2.micro”
}

Example 1-3 Step 2: Initializing Terraform

$ terraform init
Initializing the backend...

Initializing provider plugins...
- Finding latest version of hashicorp/aws...
- Installing hashicorp/aws v3.22.0...
- Installed hashicorp/aws v3.22.0 (signed by HashiCorp)

Terraform has created a lock file .terraform.lock.hcl to record the provider
selections it made above. Include this file in your version control repository
so that Terraform can guarantee to make the same selections by default when
you run "terraform init” in the future.

Terraform has been successfully initialized!

You may now begin working with Terraform. Try running "terraform plan” to see
any changes that are required for your infrastructure. All Terraform commands
should now work.

If you ever set or change modules or backend configuration for Terraform,
rerun this command to reinitialize your working directory. If you forget, other
commands will detect it and remind you to do so if necessary.

Example 1-4 Step 3: (Optional) Verifying the Changes That Are to Be Applied

$ terraform plan

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_instance.ec2name will be created
  + resource "aws_instance” "ec2name” {
      + ami                          = "ami-08d9a394ac1c2994c”
      + arn                          = (known after apply)
      + associate_public_ip_address  = (known after apply)
      + availability_zone            = (known after apply)
      + cpu_core_count               = (known after apply)
      + cpu_threads_per_core         = (known after apply)
      + get_password_data            = false
      + host_id                      = (known after apply)
      + id                           = (known after apply)
      + instance_state               = (known after apply)
      + instance_type                = "t2.micro”
      # Output omitted #

      + ebs_block_device {
          + device_name           = (known after apply)
          + volume_id             = (known after apply)
          + volume_size           = (known after apply)
          + volume_type           = (known after apply)
          # Output omitted #
        }

      + metadata_options {
          + http_endpoint               = (known after apply)
          + http_put_response_hop_limit = (known after apply)
          + http_tokens                 = (known after apply)
        }

      + network_interface {
          + delete_on_termination = (known after apply)
          + device_index          = (known after apply)
          + network_interface_id  = (known after apply)
        }
   + root_block_device {
          + delete_on_termination = (known after apply)
          + device_name           = (known after apply)
          + encrypted             = (known after apply)
          + iops                  = (known after apply)
          + kms_key_id            = (known after apply)
          + throughput            = (known after apply)
          + volume_id             = (known after apply)
          + volume_size           = (known after apply)
          + volume_type           = (known after apply)
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

------------------------------------------------------------------------

Note: You didn't specify an "-out” parameter to save this plan, so Terraform
can't guarantee that exactly these actions will be performed if
"terraform apply” is subsequently run.

Example 1-5 Step 4: Applying the Changes

$ terraform apply

An execution plan has been generated and is shown below.
Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_instance.ec2name will be created
  + resource "aws_instance” "ec2name” {
      + ami                          = "ami-08d9a394ac1c2994c"
      + arn                          = (known after apply)
      + associate_public_ip_address  = (known after apply)
      + availability_zone            = (known after apply)
      + cpu_core_count               = (known after apply)
      + cpu_threads_per_core         = (known after apply)
      + get_password_data            = false
      + host_id                      = (known after apply)
      + id                           = (known after apply)
      + instance_type                = "t2.micro"
      + source_dest_check            = true
      + subnet_id                    = (known after apply)
      # Output omitted #

      + ebs_block_device {
          + delete_on_termination = (known after apply)
          + device_name           = (known after apply)
          + iops                  = (known after apply)
          + volume_id             = (known after apply)
          + volume_size           = (known after apply)
          + volume_type           = (known after apply)
        }

      + metadata_options {
      + http_endpoint               = (known after apply)
          + http_put_response_hop_limit = (known after apply)
          + http_tokens                 = (known after apply)
        }

      + network_interface {
          + delete_on_termination = (known after apply)
          + device_index          = (known after apply)
          + network_interface_id  = (known after apply)
        }

      + root_block_device {
          + delete_on_termination = (known after apply)
          + device_name           = (known after apply)
          + encrypted             = (known after apply)
          + iops                  = (known after apply)
          + kms_key_id            = (known after apply)
          + throughput            = (known after apply)
          + volume_id             = (known after apply)
          + volume_size           = (known after apply)
          + volume_type           = (known after apply)
        }
    }

Plan: 1 to add, 0 to change, 0 to destroy.

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

aws_instance.ec2name: Creating...
aws_instance.ec2name: Still creating... [10s elapsed]
aws_instance.ec2name: Creation complete after 19s [id=i-030cd63bd0a9041d6]

Apply complete! Resources: 1 added, 0 changed, 0 destroyed.

Example 1-6 Step 5: (Optional) Verifying Whether the Applied Changes Are as Expected

$ terraform show

# aws_instance.ec2name:
resource "aws_instance” "ec2name” {
    ami                          = "ami-08d9a394ac1c2994c”
    arn                          = "arn:aws:ec2:us-west-1:054654082556:instance/i-
030cd63bd0a9041d6"
    associate_public_ip_address  = true
    availability_zone            = "us-west-1b”
    cpu_core_count               = 1
    cpu_threads_per_core         = 1
    disable_api_termination      = false
    ebs_optimized                = false
    get_password_data            = false
    hibernation                  = false
    id                           = "i-030cd63bd0a9041d6"
    instance_state               = "running"
    instance_type                = "t2.micro"
    ipv6_address_count           = 0
    ipv6_addresses               = []
    monitoring                   = false
    primary_network_interface_id = "eni-04a667a8dce095463"
    private_dns                  = "ip-172-31-6-154.us-west-1.compute.internal"
    private_ip                   = "172.31.6.154"
    public_dns                   = "ec2-54-183-123-73.us-west-1.compute.amazonaws.com"
    public_ip                    = "54.183.123.73"
    secondary_private_ips        = []
    security_groups              = [
        "default",
    ]
    source_dest_check            = true
    subnet_id                    = "subnet-5d6aa007"
    tenancy                      = "default"
    volume_tags                  = {}
    vpc_security_group_ids       = [
        "sg-8b17eafd",
    ]

    credit_specification {
        cpu_credits = "standard"
    }

    enclave_options {
        enabled = false
    }

    metadata_options {
        http_endpoint               = "enabled"
        http_put_response_hop_limit = 1
        http_tokens                 = "optional"
    }

    root_block_device {
        delete_on_termination = true
        device_name           = "/dev/xvda"
        encrypted             = false
        iops                  = 100
        throughput            = 0
        volume_id             = "vol-0c0628e825e60d591"
        volume_size           = 8
        volume_type           = "gp2"
    }
}

This example shows that you can provision a virtual machine in three simple required steps: initialize, plan, apply. It is just as easy to provision many virtual machines or different resource types by using Terraform.

Another advantage of using Terraform for provisioning operations is that it is not proprietary to a cloud vendor. Therefore, you can use similar syntax to provision resources across different environments, which is not the case when you use the cloud vendors’ provisioning tools. Cloud vendors’ tools typically have their own syntax and are not portable between clouds. Furthermore, Terraform supports other solutions that are not specific to the cloud, such as Cisco ACI, Cisco Intersight, and Red Hat OpenShift.

Ansible

Ansible is an open-source configuration IaC management tool, although it can be used for a wide variety of actions, such as management, application deployment, and orchestration. Ansible was developed in Python and has its own syntax using the YAML format (see Example 1-7).

Ansible is agentless, which means you can configure resources without having to install any software agent on them. It commonly accesses resources by using SSH, but other protocols can be configured (for example, Netconf, Telnet). Unlike Terraform, Ansible can use a declarative or a procedural language; in this book, we focus on the declarative language, as it is the recommended approach. Ansible uses a push model.

Ansible can run from a workstation, but it is common to install it in a local server, from which it can more easily reach all the infrastructure it needs to manage.


Note

The following example assumes that Ansible is installed.


Example 1-7 Using Ansible to Configure a New TACACS+ Key

$ cat inventory.yml
all:
  children:
    ios:
      hosts:
        switch_1:
          ansible_host: "10.0.0.1"
          ansible_network_os: cisco.ios.ios
      vars:
        ansible_connection: ansible.netcommon.network_cli
        ansible_user: "username"
        ansible_password: "cisco123"

$ cat playbook.yml
---
- name: Ansible Playbook to change TACACS keys for IOS Devices
  hosts: ios
  connection: local
  gather_facts: no

  tasks:
    - name: Gather all IOS facts
      cisco.ios.ios_facts:
        gather_subset: all

    - name: Set the global TACACS+ server key
      cisco.ios.ios_config:
        lines:
        - "tacacs-server key new_key"

$ ansible-playbook playbook.yml -i inventory.yml
PLAY [Ansible Playbook to change TACACS+ keys for IOS Devices]
*********************************************************************

TASK [Gather all IOS facts]
*********************************************************************
ok: [switch_1]

TASK [Set the global TACACS+ server key]
*********************************************************************
changed: [switch_1]

PLAY RECAP *********************************************************************
switch_1        : ok=1    changed=1    unreachable=0
failed=0     skipped=0    rescued=0    ignored=0

In this example, Ansible gathers device information and changes the TACACS+ key to a new one. There is a single Cisco device in the inventory file, and you can see that the key change succeeds for that specific device.

You could use the facts gathered to determine whether to make the change or not; for example, you might want to execute the change only if a source interface is defined. (You will learn more about such complex playbook workflows in the following chapters.)

Ansible is very extensible. It can also be used to provision resources, interact with systems, collect information, and even perform network migrations.


Note

Do not be worried if you don’t yet fully understand Example 1-7. Chapters 4 and 5 describe Ansible’s installation and concepts (including playbooks, inventory, and roles), as well as how to achieve common networking tasks.


Chef

Chef is a configuration management tool written in Ruby. It allows you to define IaC but, unlike the tools described so far, uses a pull model. Chef uses a procedural language.

Chef configurations are described in recipes (see Example 1-8). Collections of recipes are stored in cookbooks.

Typically, a cookbook represents a single task (for example, configuring a web server). You can use a community cookbook instead of building your own; they cookbooks cover most of the common tasks.

Chef’s architecture consists of three components (see Figure 1-7):

Nodes: A node represents a managed system. Nodes can be of multiple types and are the ultimate targets you want to configure.

Chef server: The Chef server provides a centralized place to store cookbooks and their respective recipes, along with metadata for the registered nodes. Nodes and the Chef server interact directly.

Workstation: A user uses a workstation to interact with Chef. A workstation is also where you develop and test cookbooks.

Chef is targeted at managing systems as it requires the installation of an agent. Nonetheless, it can manage infrastructure that allows the installation of the agent (for example, Cisco Nexus 9000).

Images

Figure 1-7 Chef Architecture


Note

The following example assumes that the Chef architecture is in place.


Example 1-8 Chef Recipe to Configure New Web Servers, Depending on Operating System Platform

package 'Install webserver' do
                case node[:platform]
                when 'redhat','centos'
                                package_name 'httpd'
                                action :install
                when 'ubuntu','debian'
                                package_name 'apache2'
                                action :install
                end
end

Chef managed systems (nodes) are constantly monitored to ensure that they are in the desired state. This helps avoid configuration drift.

Kibana

Kibana is an open-source tool for data visualization and monitoring. Kibana is simple to install and to use.

Kibana is part of the ELK stack, which is a collection of the tools Elasticsearch, Logstash, and Kibana. Logstash extracts information from collected logs (network or application logs) from different sources and stores them on Elasticsearch. Kibana is the visualization engine for the stack. A fourth common component is Beats, which is a data shipper. Beats provides a way of exporting data from any type of system to Logstash, using what an agent. Agents in the infrastructure capture data and ship it to Logstash.

It is a common misconception that Kibana collects data directly from a network—but it doesn’t. It uses the other stack members to do the collection.

You can use Kibana to set up dashboards and alerts and to do analytics based on parsing rules that you define (see Figure 1-8). It is a tool focused on log data processing.

Kibana offers powerful visualization tools, such as histograms and pie charts, but it is by far most used in integration with Elasticsearch as a visualization and query engine for collected logs.

Images

Figure 1-8 Kibana Search Dashboard

Grafana

Grafana is an open-source tool for data visualization, analytics, and monitoring. It can be deployed either on-premises or managed in the cloud; there is also a SaaS offering available. Whereas Kibana focuses on log data, Grafana is focused on metrics. You can use it to set up dashboards, alerts, and graphs from network data. You might use it, for example, to visualize the following:

• CPU utilization

• Memory utilization

• I/O operations per second

• Remaining disk space

Figure 1-9 shows an example of a Grafana dashboard that shows these metrics and more.

Images

Figure 1-9 Grafana Dashboard

Grafana does not store or collect data. It receives data from other sources where these types of metrics are stored—typically time series databases (for example, InfluxDB). However, it can also integrate with other database types (such as PostgreSQL) and cloud monitoring systems (such as AWS CloudWatch). In order for the data to be populated in these databases, you must use other tools, such as SNMP.

Because it is able to pull data from many sources, Grafana can provide a consolidated view that other tools fail to provide.

Some may not consider Grafana a network automation solution, but its monitoring capabilities, which you can use to track errors or application and equipment behaviors, along with its alerting capabilities, earn it a spot on this list.

Splunk

Splunk is a distributed system that aggregates, parses, and analyses data. Like Grafana, it focuses on monitoring, alerting, and data visualization. There are three different Splunk offerings: a free version, an enterprise version, and a SaaS cloud version.

Although known in the networking industry for its security capabilities, Splunk is on this list of automation tools due to its data ingestion capabilities. It can ingest many different types of data, including the following networking-related types:

• Windows data (registry, event log, and filesystem data)

• Linux data (Syslog data and statistics)

• SNMP

• Syslog

• NetFlow/IPFIX

• Application logs

The Splunk architecture consists of three components:

• Splunk forwarder

• Splunk indexer

• Splunk search head

The Splunk forwarder is a software agent that is installed on endpoints to collect and forward logs to the Splunk indexer. The Splunk forwarder is needed only when the endpoint that is producing the logs does not send them automatically, as in the case of an application. In the case of a router, you could point the Syslog destination directly at the indexer, bypassing the need for a forwarder.

There are two types of Splunk forwarders:

Universal forwarder: The most commonly used type is the universal forwarder, which is more lightweight than its counterpart. Universal forwarders do not do any preprocessing but send the data as they collect it (that is, raw data). This results in more transmitted data.

Heavy forwarder: Heavy forwarders perform parsing and send only indexed data to the indexer. They reduce transmitted data and decentralize the processing. However, this type of forwarder requires host machines to have processing capabilities.

The Splunk indexer transforms collected data into events and stores them. The transformations can entail many different operations, such as applying timestamps, adding source information, or doing user-defined operations (for example, filtering unwanted logs). The Splunk indexer works with universal forwarders; with heavy forwarders, the Splunk indexer only stores the events and does not perform any transformation. You can use multiple Splunk indexers to improve ingestion and transformation performance.

The Splunk search head provides a GUI, a CLI, and an API for users to search and query for specific information. In the case of a distributed architecture with multiple indexers, the search head queries several indexers and aggregates the results before displaying the results back to the user. In addition, you can define alerts based on received data and generate reports.

Figure 1-10 shows an example of a Splunk dashboard that shows software variations among a firewall installation base along with the number of devices and their respective logical context configurations.

Images

Figure 1-10 Splunk Firewall Information Dashboard

Python

Although Python is not an automation tool per se, it provides the building blocks to make an automation tool. Many of the automation tools described in this book are actually built using Python (for example, Ansible).

Other programming languages could also be used for automation, but Python has a lot of prebuilt libraries that make it a perfect candidate. The following are some examples of Python libraries used to automate networking tasks:

• Selenium

• Openpyxl

• Pyautogui

• Paramiko

• Netmiko

The main advantage of Python is that it can work with any device type, any platform, any vendor, and any system with any version. However, its advantage is also its drawback: The many possible customizations can make it complex and hard to maintain.

When using Python, you are actually building a tool to some extent instead of using a prebuilt one. This can make it rather troublesome.

Example 1-9 defines a device with all its connection properties, such as credentials and ports (as you could have multiple devices). This example uses the Netmiko Python module to connect to the defined device and issues the show ip int brief command, followed by the Cisco CLI commands to create a new interface VLAN 10 with the IP address 10.10.10.1. Finally, the example issues the same command as in the beginning to verify that the change was applied.

You can see that this example defines the device type. However, the Python library supports a huge variety of platforms. In the event that your platform is not supported, you could use another library and still use Python. This is an example of the flexibility mentioned earlier. A similar principle applies to the commands issued, as you can replace the ones shown in this example with any desired commands.

Example 1-9 Using Python to Configure and Verify the Creation of a New Interface VLAN

from netmiko import ConnectHandler

device = {
"device_type": "cisco_ios",
"ip": "10.201.23.176",
"username": "admin",
"password": "cisco123",
"port” : 22,
"secret": "cisco123"
}

ssh_connect = ConnectHandler (**device)

print("BEFORE")
result = ssh_connect.send_command("show ip int brief")
print(result)

configcmds=["interface vlan 10", "ip add 10.10.10.1 255.255.255.0"]
ssh_connect.send_config_set(configcmds)

print("AFTER")
result = ssh_connect.send_command("show ip int brief")
ssh_connect.disconnect()
print("")
print(result)

Figure 1-11 shows a sample execution of the script in Example 1-9.

Images

Figure 1-11 Sample Python Script Execution to Change an Interface IP Address

Summary

We have covered three types of automation in this chapter:

Data-driven automation: Actions triggered by data

Task-based automation: Manually triggered tasks

End-to-end automation: A combination of the other two types

This chapter describes a number of use cases to spark your imagination about what you might want to automate in your company. Now you know several situations where automation can enable you to achieve better results.

This chapter also takes a look at several automation tools. It describes what they are and in which use cases they shine. In later chapters we will dive deeper into some of these tools.

Review Questions

You can find answers to these questions in Appendix A, “Answers to Review Questions.”

1. You must append a prefix to the hostname of each of your devices, based on the device’s physical location. You can determine the location based on its management IP address. What type of automation is this?

a. Data-driven

b. Task-based

c. End-to-end

2. What use case is described in the question 1 scenario?

a. Optimization

b. Configuration

c. Data collection

d. Migration

3. Lately, your network has been underperforming in terms of speed, and it has been bandwidth constrained. You decide to monitor a subset of key devices that you think are responsible for this behavior. In this scenario, which model is more suitable for data collection?

a. Push model

b. Pull model

4. Which model does SNMP traps use?

a. Push model

b. Pull model

5. Describe configuration drift and how you can address it in the context of network devices.

6. Which of these tools can help you to achieve a configuration use case in an environment with equipment from multiple vendors and different software versions?

a. Splunk

b. Ansible

c. DNA Center

d. Grafana

e. Kibana

f. Terraform

7. The number of users accessing your web applications has grown exponentially in the past couple months, and your on-premises infrastructure cannot keep up with the demand. Your company is considering adopting cloud services. Which of the following tools can help with automated provisioning of resources in the cloud, including networking components? (Choose two.)

a. Splunk

b. Cloud event-driven functions

c. DNA Center

d. Terraform

e. Python

8. The number of servers in your company has kept growing in the past 5 years. Currently, you have 1000 servers, and it has become impossible to manually manage them. Which of the following tools could you adopt to help manage your servers? (Choose two.)

a. Kibana

b. Terraform

c. DNA Center

d. Chef

e. Ansible

9. Your company has not been using the log data from devices except when an issue comes up, and someone needs to investigate the logs during the troubleshooting sessions. There is no centralized logging, even when performing troubleshooting, and engineers must connect to each device individually to view the logs. Which of the following tools could address this situation by centralizing the log data and providing dashboards and proactive alerts? (Choose two.)

a. Kibana

b. Grafana

c. Splunk

d. DNA Center

10. You are using Terraform to provision your cloud infrastructure in AWS. Which of the following is a valid step but is not a required?

a. Initialize

b. Plan

c. Apply

d. Create

11. You are tasked with automating the configuration of several network devices. Due to the very strict policy rules of your company’s security department, you may not install any agent on the target devices. Which of the following tools could help you achieve automation in this case?

a. Chef

b. Terraform

c. Ansible

d. DNA Center

12. True or false: Kibana excels at building visualization dashboards and setting up alerts based on metric data.

a. True

b. False

13. True or false: You can use cloud event-driven functions to interact with on-premises components.

a. True

b. False

..................Content has been hidden....................

You can't read the all page of ebook, please click here login for view all page.
Reset
18.205.59.250