For this chapter, we will build a very simple network management application, and then write different types of tests and check their coverage. This network management application is focused on digesting alarms, also referred to as network events. This is different from certain other network management tools that focus on gathering SNMP alarms from devices.
For reasons of simplicity, this correlation engine doesn't contain complex rules, but instead contains simple mapping of network events onto equipment and customer service inventory. We'll explore this in the next few paragraphs as we dig through the code.
With the following steps, we will build a simple network management application.
network.py
to store the network application.class Event(object): def __init__(self, hostname, condition, severity, event_time): self.hostname = hostname self.condition = condition self.severity = severity self.id = -1 def __str__(self): return "(ID:%s) %s:%s - %s" % (self.id, self.hostname, self.condition, self.severity)
hostname
: It is assumed that all network alarms originate from pieces of equipment that have a hostname.condition
: Indicates the type of alarm being generated. Two different alarming conditions can come from the same device.severity
: 1
indicates a clear, green status; and 5
indicates a faulty, red status.id
: The primary key value used when the event is stored in a database.network.sql
to contain the SQL code.CREATE TABLE EVENTS ( ID INTEGER PRIMARY KEY, HOST_NAME TEXT, SEVERITY INTEGER, EVENT_CONDITION TEXT );
network.py
.from springpython.database.core import* class EventCorrelator(object): def __init__(self, factory): self.dt = DatabaseTemplate(factory) def __del__(self): del(self.dt) def process(self, event): stored_event, is_active = self.store_event(event) affected_services, affected_equip = self.impact(event) updated_services = [ self.update_service(service, event) for service in affected_services] updated_equipment = [ self.update_equipment(equip, event) for equip in affected_equip] return (stored_event, is_active, updated_services, updated_equipment)
The __init__
method
contains some setup code to create a DatabaseTemplate
. This is a Spring Python utility class used for database operations. See http://static.springsource.org/spring-python/1.2.x/sphinx/html/dao.html for more details. We are also using sqlite3 as our database engine, since it is a standard part of Python.
The process
method contains some simple steps to process an incoming event.
EVENTS
table. This includes evaluating whether or not it is an active event, meaning that it is actively impacting a piece of equipment.store_event
algorithm.def store_event(self, event): try: max_id = self.dt.query_for_int("""select max(ID) from EVENTS""") except DataAccessException, e: max_id = 0 event.id = max_id+1 self.dt.update("""insert into EVENTS (ID, HOST_NAME, SEVERITY, EVENT_CONDITION) values (?,?,?,?)""", (event.id, event.hostname, event.severity, event.condition)) is_active = self.add_or_remove_from_active_events(event) return (event, is_active)
This method stores every event that is processed. This supports many things including data mining and post mortem analysis of outages. It is also the authoritative place where other event-related data can point back using a foreign key.
store_event
method looks up the maximum primary key value from the EVENTS
table.event.id
.EVENTS
table.def add_or_remove_from_active_events(self, event): """Active events are current ones that cause equipment and/or services to be down.""" if event.severity == 1: self.dt.update("""delete from ACTIVE_EVENTS where EVENT_FK in ( select ID from EVENTS where HOST_NAME = ? and EVENT_CONDITION = ?)""", (event.hostname,event.condition)) return False else: self.dt.execute("""insert into ACTIVE_EVENTS (EVENT_FK) values (?)""", (event.id,)) return True
When a device fails, it sends a severity
5
event. This is an active event and in this method, a row is inserted into the ACTIVE_EVENTS
table, with a foreign key pointing back to the EVENTS
table. Then we return back True
, indicating this is an active event.
ACTIVE_EVENTS
to the SQL script.CREATE TABLE ACTIVE_EVENTS ( ID INTEGER PRIMARY KEY, EVENT_FK, FOREIGN KEY(EVENT_FK) REFERENCES EVENTS(ID) );
This table makes it easy to query what events are currently causing equipment failures.
Later, when the failing condition on the device clears, it sends a severity
1
event. This means that severity
1
events are never active, since they aren't contributing to a piece of equipment being down. In our previous method, we search for any active events that have the same hostname and condition, and delete them. Then we return False
, indicating this is not an active event.
def impact(self, event): """Look up this event has impact on either equipment or services.""" affected_equipment = self.dt.query( """select * from EQUIPMENT where HOST_NAME = ?""", (event.hostname,), rowhandler=DictionaryRowMapper()) affected_services = self.dt.query( """select SERVICE.* from SERVICE join SERVICE_MAPPING SM on (SERVICE.ID = SM.SERVICE_FK) join EQUIPMENT on (SM.EQUIPMENT_FK = EQUIPMENT.ID) where EQUIPMENT.HOST_NAME = ?""", (event.hostname,), rowhandler=DictionaryRowMapper()) return (affected_services, affected_equipment)
EQUIPMENT
table to see if event.hostname
matches anything.SERVICE
table to the EQUIPMENT
table through a many-to-many relationship tracked by the SERVICE_MAPPING
table. Any service that is related to the equipment that the event was reported on is captured.EQUIPMENT
, SERVICE
, and SERVICE_MAPPING
.CREATE TABLE EQUIPMENT ( ID INTEGER PRIMARY KEY, HOST_NAME TEXT UNIQUE, STATUS INTEGER ); CREATE TABLE SERVICE ( ID INTEGER PRIMARY KEY, NAME TEXT UNIQUE, STATUS TEXT ); CREATE TABLE SERVICE_MAPPING ( ID INTEGER PRIMARY KEY, SERVICE_FK, EQUIPMENT_FK, FOREIGN KEY(SERVICE_FK) REFERENCES SERVICE(ID), FOREIGN KEY(EQUIPMENT_FK) REFERENCES EQUIPMENT(ID) );
update_service
method that stores or clears service-related events, and then updates the service's status based on the remaining active events.def update_service(self, service, event): if event.severity == 1: self.dt.update("""delete from SERVICE_EVENTS where EVENT_FK in ( select ID from EVENTS where HOST_NAME = ? and EVENT_CONDITION = ?)""", (event.hostname,event.condition)) else: self.dt.execute("""insert into SERVICE_EVENTS (EVENT_FK, SERVICE_FK) values (?,?)""", (event.id,service["ID"])) try: max = self.dt.query_for_int( """select max(EVENTS.SEVERITY) from SERVICE_EVENTS SE join EVENTS on (EVENTS.ID = SE.EVENT_FK) join SERVICE on (SERVICE.ID = SE.SERVICE_FK) where SERVICE.NAME = ?""", (service["NAME"],)) except DataAccessException, e: max = 1 if max > 1 and service["STATUS"] == "Operational": service["STATUS"] = "Outage" self.dt.update("""update SERVICE set STATUS = ? where ID = ?""", (service["STATUS"], service["ID"])) if max == 1 and service["STATUS"] == "Outage": service["STATUS"] = "Operational" self.dt.update("""update SERVICE set STATUS = ? where ID = ?""", (service["STATUS"], service["ID"])) if event.severity == 1: return {"service":service, "is_active":False} else: return {"service":service, "is_active":True}
Service-related events are active events related to a service. A single event can be related to many services. For example, what if we were monitoring a wireless router that provided Internet service to a lot of users, and it reported a critical error? This one event would be mapped as an impact to all the end users. When a new active event is processed, it is stored in SERVICE_EVENTS
for each related service.
Then, when a clearing event is processed, the previous service event must be deleted from the SERVICE_EVENTS
table.
SERVICE_EVENTS
to the SQL script.CREATE TABLE SERVICE_EVENTS ( ID INTEGER PRIMARY KEY, SERVICE_FK, EVENT_FK, FOREIGN KEY(SERVICE_FK) REFERENCES SERVICE(ID), FOREIGN KEY(EVENT_FK) REFERENCES EVENTS(ID) );
DROP TABLE IF EXISTS SERVICE_MAPPING; DROP TABLE IF EXISTS SERVICE_EVENTS; DROP TABLE IF EXISTS ACTIVE_EVENTS; DROP TABLE IF EXISTS EQUIPMENT; DROP TABLE IF EXISTS SERVICE; DROP TABLE IF EXISTS EVENTS;
INSERT into EQUIPMENT (ID, HOST_NAME, STATUS) values (1, 'pyhost1', 1); INSERT into EQUIPMENT (ID, HOST_NAME, STATUS) values (2, 'pyhost2', 1); INSERT into EQUIPMENT (ID, HOST_NAME, STATUS) values (3, 'pyhost3', 1); INSERT into SERVICE (ID, NAME, STATUS) values (1, 'service-abc', 'Operational'), INSERT into SERVICE (ID, NAME, STATUS) values (2, 'service-xyz', 'Outage'), INSERT into SERVICE_MAPPING (SERVICE_FK, EQUIPMENT_FK) values (1,1); INSERT into SERVICE_MAPPING (SERVICE_FK, EQUIPMENT_FK) values (1,2); INSERT into SERVICE_MAPPING (SERVICE_FK, EQUIPMENT_FK) values (2,1); INSERT into SERVICE_MAPPING (SERVICE_FK, EQUIPMENT_FK) values (2,3);
def update_equipment(self, equip, event): try: max = self.dt.query_for_int( """select max(EVENTS.SEVERITY) from ACTIVE_EVENTS AE join EVENTS on (EVENTS.ID = AE.EVENT_FK) where EVENTS.HOST_NAME = ?""", (event.hostname,)) except DataAccessException: max = 1 if max != equip["STATUS"]: equip["STATUS"] = max self.dt.update("""update EQUIPMENT set STATUS = ?""", (equip["STATUS"],)) return equip
Here, we need to find the maximum severity from the list of active events for a given host name. If there are no active events, then Spring Python raises a DataAccessException
and we translate that to a severity of 1
.
We check if this is different from the existing device's status. If so, we issue a SQL update. Finally, we return the record for the device, with its status updated appropriately.
This application uses a database-backed mechanism to process incoming network events, and checks them against the inventory of equipment and services to evaluate failures and restorations. Our application doesn't handle specialized devices or unusual types of services. This real-world complexity has been traded in for a relatively simple application, which can be used to write various test recipes.
Events typically map to a single piece of equipment and to zero or more services. A service can be thought of as a string of equipment used to provide a type of service to the customer. New failing events are considered active until a clearing event arrives. Active events, when aggregated against a piece of equipment, define its current status. Active events, when aggregated against a service, defines the service's current status.
18.223.170.223