No discussion of network management in IP networks would be complete without including the Simple Network Management Protocol (SNMP). I want to stress that SNMP is primarily used for fault management and, to a lesser extent, for configuration and performance management. It is definitely not the only tool required for a complete network-management system, but it is an important one.
SNMP is a UDP-based network protocol. It has been adapted to run over IPX, as well as IP. However, IP is by far the most common network protocol for SNMP.
SNMP has three general functions. It can request information from a remote device using a get command. It can be used to configure the remote device with a set command. Or, the remote device can send information to the network-management server without having been prompted, which is called a trap. A trap is usually sent when there has been a failure of some kind. In general, a trap can be sent for any reason deemed useful or appropriate for this particular device. However, the main application alerts the network-management server of a failure of some kind.
In general, two types of devices speak SNMP. The remote device that is managed has a relatively small engine called an SNMP agent. The agent is a piece of software that responds to get and set packets. It also monitors the functioning of the device it runs on and sends out trap packets whenever certain conditions are met.
The other general type of device is the SNMP server. This server is typically a relatively powerful computer whose only function is to monitor the network. The server polls remote devices using get and set commands. The IP address of the server is configured in the remote agents so that they will know where to send trap messages.
Many network engineers prefer not to use SNMP for configuration. This is because they believe there are too many serious security problems with the model, making it relatively easy to attack and reconfigure key pieces of network equipment. These problems can make configuration much more difficult. However, if there is a security concern, then turning off SNMP write access on your network devices is worthwhile.
There are several commercial SNMP server systems. They usually come with a number of complex features such as the ability to discover and map the network and display it graphically. Almost all modern network equipment includes an SNMP agent, at least as an optional feature.
The amount of information that can be exchanged with SNMP is enormous. Every device that has an SNMP agent keeps track of a few basic variables that the server can query with get commands. Thousands of other optional variables are appropriate for different types of devices. For example, a router with a Token Ring interface allows the server to poll for special parameters that are relevant to Token Ring. If this router doesn't have any Ethernet ports, then it doesn't make sense for it to keep track of collisions, since there will never be any. However, it does need to keep track of beacon events, for example.
This same router also has a number of special-purpose variables that are unique to this type of equipment and this particular vendor. All of these different variables are accessed by a large tree structure called the Management Information Base (MIB). People talk about "the MIB" and different vendor-specific "MIBs." However, it is all one large database. The only difference is that some parts of it are used on some types of devices, some parts of it are defined by particular hardware vendors, and others are globally relevant. Thus, I prefer to talk about vendor- or technology-specific "MIB extensions."
Every network-hardware vendor has its own set of MIB extensions. These extensions allow different vendors to implement special customizations that express how they handle different interface types, for example. They also allow the different vendors to give information on things such as CPU load and memory utilization in a way that is meaningful to their particular hardware configuration.
Three different revisions of SNMP are currently in popular use—SNMP-1, 2, and 3. The differences between these revisions are relatively subtle. They primarily concern factors such as security. The important thing is to ensure that your SNMP server understands to which version of SNMP the agent on each device expects to speak. Most networks wind up being a hybrid of these different SNMP flavors.
In general, a network is monitored with a combination of polling and trapping. Devices are polled on a schedule—every few minutes, for example. But you need a way to determine if something bad has happened in between polls. This requires the device to send trap packets whenever important events occur. On the other hand, traps alone are not sufficient because some failures prevent the remote device from sending a trap. If the failure you are concerned about loses the only network path from the remote device to the network-management server, then there is no way to deliver the trap. Thus, failures of this type can only be seen by polling, so any successful network-management system always uses a combination of polling and trapping.
Setting an appropriate polling interval is one of the most important network-management decisions. You want to poll as often as possible so that you will know as soon as something has failed. Polling a device too frequently can have two bad side effects.
First, polling too often, particularly on slow WAN links, has the potential to cause serious bandwidth problems. For example, suppose each poll and each response is a 1500 byte packet. Then, each time you poll, you send a total of 3000 bytes through the network. If you poll each of 100 different remote devices all through the same WAN serial interface (a common configuration in Frame Relay networks), then each poll cycle generates 300 kilobytes of traffic. Therefore, if you poll each of these devices once every 30 seconds, then this generates an average of 10kbps on the link just because of polling traffic.
These numbers are relatively small, but in a large network they can become large very quickly. If instead of polling 100 devices, you have a network with 100,000 devices, then that 10kbps becomes 10Mbps. This increase will cause a noticeable load on even a Fast Ethernet segment.
The second problem with a short polling interval, however, is much more dangerous. Consider the example of 100 remote devices again. Suppose one of these devices is not available. The usual prescription is that the server will try three to five times, waiting a default timeout period for a response. The default timeout is usually between 1 and 5 seconds, so the server will have to wait between 3 and 25 seconds for this device before it can move on to the next one in the list. As a result, if there are several simultaneous problems, or a single problem affects several downstream devices, the management server can get stuck in its polling cycle. When this happens, it spends so much time trying to contact the devices that are not available that it loses the ability to monitor the ones that are still up effectively.
A number of different SNMP server vendors have come up with different ways of getting around this polling-interval problem. Some vendors allow the server to know about downstream dependencies—if a router fails, then the server stops trying to contact the devices behind it.
Another clever method for dealing with the same problem is to break up the queue of devices to be polled into a number of shorter queues. These shorter queues are then balanced so that they can poll every device within one polling cycle even if most devices in the list are unreachable. The most extreme example of this is when the queues contain only one poll each. This means that all polling is completely asynchronous, so no failure on one device can delay the polling for another device. This situation loses some of the efficiencies of using queues, however, and may consume significantly more memory and CPU resources on the server. Some servers can use some variation of both methods simultaneously for maximum efficiency.
Whether the server discovers a problem by means of a poll or a trap, it then has to do something with this information. Most commercial network-management systems include a graphical-display feature that allows the network manager to see at a glance when there is a problem anywhere on the network. This idea sounds great, but in practice, it is less useful than it appears. The problem is that, in a very large network, a few devices are always in trouble. So the network manager just gets used to see a certain amount of red flashing trouble indicators. To tell when a new failure has really occurred, it is necessary to watch the screen for changes constantly. Constantly watching the screen can strain one's eyes, which tend to get sore from such activities, so network managers have different methods for dealing with this problem.
Some people don't look at the map, but look at a carefully filtered text-based list of problems. This list can be filtered and sorted by problem type. It is even possible to have these text messages sent automatically to the alphanumeric pagers of the appropriate network engineers.
Another popular system is to use the network-management software to open trouble tickets automatically. These tickets must be manually verified by staff on a help desk. If they see no real problem, they close the ticket. If they do see a real problem, then they escalate appropriately.
Any combination of solutions like this should work well, but beware of network-management solutions that are purely graphical because they are only useful in very small networks.
SNMP monitoring has many uses. Until now I have focused on fault management. But it can also generate useful performance-management data. For example, one of the simplest things you can do is set up the server to simply send a get message to find out the number of bytes that were sent by a particular interface. If this poll is done periodically—say, every five minutes—the data can be graphed to show the outbound utilization on the port. In this way, you can readily obtain large historical databases of trunk utilization for every trunk in the network. Usually, the only limitation on this sort of monitoring is the amount of physical storage on the server.
Besides port utilization, of course, you can use this method to monitor anything for which there is a MIB variable. You can monitor router CPU utilization, dropped packets, and even physical temperature with some device types.
Another interesting, underused application of network-management information is to have automated processes that sift through the trap logs looking for interesting but noncritical events. For example, you might choose to ignore interface resets for switch ports that connect to end-user workstations. End users reboot their workstations frequently, so seeing such an event in the middle of the day is usually considered an extremely low priority. The network manager generally just ignores these events completely. But what if one port resets itself a thousand times a day? If you ignore all port-reset events, you will never know this information. This problem is actually fairly common.
It is a good idea to have automated scripts that pass through the event logs every day looking for interesting anomalies like this. Some organizations have a set of scripts that analyze the logs every night and send a brief report to a network engineer. This sort of data can provide an excellent early warning of serious hardware or cabling problems. In fact, these reports highlight one of the most interesting and troubling aspects to network management. The problem is almost never that the information is not there. Rather, there is usually so much information that the server has to ignore almost all of it.
In a modestly sized network of a few thousand nodes, it is relatively common to receive at least one new event every second. A human being cannot even read all of the events as they come in, much less to figure out what problems they might be describing. Instead, you have to come up with clever methods for filtering the events. Some events are important. These events are passed immediately to a human for support. Other events are interesting, but not pressing, and are written to a log for future analysis. Other events are best considered mere noise and ignored.
The most sophisticated network-management servers are able to correlate these events to try to determine what is actually going on. For example, if the server sees that a thousand devices have suddenly gone down, one of which is the gateway to all others, then it is probably the gateway that has failed.
The server can in principle do even more clever event correlation by examining the noncritical events. For example, it might see that a router drops packets on a particular interface. There are many reasons for dropping packets. Perhaps there is a serious contention problem on the link. If, at the same time, the free memory on the same router is low and the CPU utilization is high, then this router is probably not powerful enough to handle the load. Perhaps this router has been configured to do too much processing of packets in the CPU instead of in the interface hardware. If the dropped packets occur when the router receives a large number of broadcast packets, then it may be a broadcast storm and not a router problem at all.
Setting up this sort of sophisticated event correlation can be extremely difficult and time consuming. Some relatively recent software systems are able to do much of this correlation out of the box. They tend to be rather expensive, but they are certainly more reliable than homemade systems.
In general, the network manager needs to monitor key devices such as switches and routers to see if they work properly. The simplest and most common sort of polling is the standard ping utility. Since every device that implements the IP protocol has to respond to ping, this is a good way to see if the device is currently up. In fact, a few devices, particularly firewalls, deliberately violate this rule for security reasons. However, if a device has disabled ping responses for security reasons, it will probably have SNMP disabled as well, so it has to be managed out-of-band anyway.
Ping is really not a great way to see what is going on with the device. If the device supports SNMP at all, it is better to ask it how long it has been up rather than simply ask whether it is there. This way, you can compare the response with the previous value. If the last poll showed that the device was up for several days and the current poll says that it was up for only a few minutes, then you know that it has restarted in the meantime. This may indicate a serious problem that you would otherwise have missed. The SNMP MIB variable for up time is called sysUpTime. It is conventional to call the difference between the current value and the previous value for the same parameter on the same device delta.
Table 9-1 shows several different standard tests that are done by a network-management system. Some of these tests, such as the coldStart, linkUp, and linkDown events, are traps. Note that it is important to look even for good events such as linkUp because the device may have an interface that flaps. In this case, the traps saying that the interface has failed may be undelivered because of that failure.
Parameter |
MIB variable |
Test |
Comments |
---|---|---|---|
Reachability |
ICMP (not SNMP) |
Time > N, % not responded |
All devices, including those that don't support SNMP |
Reboot |
coldStart |
Trap |
Indicates that the SNMP agent has restarted |
Uptime |
sysUptime |
delta < 0 |
Number of seconds since the SNMP agent started running |
ifOperStatus |
delta ! = 0 |
Shows that the status of the interface has changed |
|
ifInOctets |
Record |
The number of bytes received |
|
ifInDiscards |
delta > N |
Incoming packets that had to be dropped |
|
ifInErrors |
delta > N |
Incoming packets with Layer 2 errors |
|
Interface Status (for every active Interface on the device) |
ifOutOctets |
Record |
The number of bytes sent |
ifOutDiscards |
delta > N |
Outgoing packets that had to be dropped |
|
ifOutErrors |
delta > N |
Outgoing packets sent with errors (should always be zero) |
|
ifInNUcastPkts |
delta > N |
Incoming multicast and broadcast packets |
|
ifOutNUcastPkts |
delta > N |
Outgoing multicast and broadcast packets |
|
linkDown |
Trap |
Indicates that an interface has gone down |
|
linkUp |
Trap |
Indicates that an interface has come up |
This set of variables tells the network manager just about everything she needs to know for most types of devices. Many other important MIB variables are specific to certain technologies, however. For example, parameters such as CPU and memory utilization are important, but these parameters are different for each different hardware vendor. Consult the hardware documentation for the appropriate names and values of these parameters.
For routers, one is usually interested in buffering and queuing statistics. Again, this information is in the vendor-specific MIB extensions. There are also certain technology-specific MIB extensions. For example, an 802.3 MIB extension includes a number of useful parameters for Ethernet statistics. Similarly, there are useful MIB variables for Token Ring, ATM, T1 circuits, Frame Relay, and so forth. In all of these cases, it is extremely useful to sit down and read through the MIB, looking at descriptions of each variable. In this way, one can usually find out if there is a convenient way to measure particular performance issues or to look for particular fault problems that may be unique to the network.
All of the SNMP polling I have discussed so far has been periodic and scheduled. However, the same SNMP server software can also do ad hoc queries. This means that the network manager can use the system to generate a single poll manually. This can be an extremely useful tool for fault isolation and troubleshooting. For example, this facility can quickly query a set of different devices to see which ones have high CPU loads, errors, or whatever you happen to be looking for. Using this facility is usually much faster than logging into all of these devices manually and poking around on the command line. In fact, the network-management software for many types of hardware makes it possible to do a large list of standard ad hoc queries on a device automatically.
Many hardware vendors make SNMP software called instance managers. This software gives a relatively detailed, graphical view of the complete state of one device all at once. Usually, these instance managers also provide the ability to make configuration changes via SNMP as well.
For Ethernet switches, it is often true that the instance-manager software is the fastest and most efficient way to do basic configuration changes such as manipulating VLAN memberships.
This topic actually brings up one of the most serious issues with SNMP. With SNMP Version 1 and, to a lesser extent, with Versions 2 and 3, it is remarkably easy to subvert the security. It is not difficult to load publicly available SNMP server software onto a PC. This software can then be used to reconfigure key pieces of network equipment.
Even nonmalicious users and applications can cause problems. For example, some ill-behaved server-management software automatically attempts to discover the network path to the remote managed-server devices. In doing so, this software generally does detailed SNMP polling of key network devices. Once the path is discovered, this software then periodically polls these network devices as if it were managing the network instead of just the servers.
This situation is not in itself a problem because the server-management software is only polling and not actually changing anything. Remember that the SNMP agent running on a router or a switch is a CPU-bound software process. It uses memory from the main memory pool. If a server program repeatedly polls this device, requesting large parts of its MIB, it can overload the device's CPU. Many such programs all requesting this data at the same time can cause network problems.
As I have stressed repeatedly throughout this book, there is no reason for any end device to ever have to know the topology of a well-built network. It may be necessary in these cases to implement access controls on the SNMP agents of key devices. These access controls have the effect of preventing the agent from speaking SNMP with any devices other than the officially sanctioned SNMP servers.
Some network engineers go further and actually block SNMP from passing through the network if it does not originate with the correct server. However, this measure is extreme. There may be well-behaved applications that happen to use SNMP to communicate with their well-behaved clients. In this case, the network should not prevent legitimate communication.
SNMP allows much flexibility in how network managers deal with the network. They can set up the network-management server to automatically poll a large list of devices on a schedule looking for well-defined measures of the network's health. If the results are within the expected range of results, then the server concludes that this part of the network works properly. If the result is outside of the expected range, then the server treats it as a problem and somehow prompts an engineer to investigate further. As a rule, it is best if these noninvasive monitoring and polling activities are done automatically without requiring any user intervention. It is a routine repetitive task—exactly the kind of thing that computers are good at. The server can do many different types of things automatically. It is particularly useful to download a copy of the configuration information for every device in the network. This downloading is usually scheduled to execute once per night or once per week in networks that seldom change.
Sometimes a network device fails completely and needs to be replaced. When this happens, it is necessary to configure the new device to look like the one replaces. If the network-management server maintains an archive of recent software configurations for all network devices, then this task is relatively easy.
Another good reason to maintain automatic configuration backups is to note changes. For example, many organizations automatically download the configurations from every router and every switch each night. They then run a script that compares each new image to the previous night's backup. This information is encapsulated into a report that is sent to a network engineer. Usually, the only changes are the ones the engineer remembers making. But a report like this can be an extremely useful and efficient way to discover if somebody has tampered with the network.
All of these fully automatic processes are completely noninvasive. Some organizations also use a noninvasive suite of test scripts. These test scripts are executed automatically if the network-management software sees a potentially serious problem. The result of these automated tests can be helpful in isolating the problem quickly.
Sometimes network managers want to partially automate invasive procedures as well. For example, it is relatively common in a large network to have a script automatically change login passwords on routers. This way, every router can be changed in a single night. With any sort of invasive procedure, it is usually wise only to partially automate it. A user should start the script and monitor its progress. That person should then verify that the change is correct.
Some network managers go further and allow full automation of invasive procedures such as scheduled VLAN topology or filtering changes. In some cases, invasive scripts are run to reconfigure network devices automatically in response to certain failures. I do not recommend this type of automation; it is simply too dangerous in a complex network. Usually, the automated change assumes a well-defined starting point. However, it is possible to have an obscure problem in the network create a different initial configuration than what the automated procedure expects. This configuration could cause the scripted changes to give unexpected, perhaps disastrous, results. It is also possible to have an unexpected event while the reconfiguration is in progress. In this case, the network might be left in some strange state that requires extensive manual work to repair.
A large network is generally such a complex system that it is not wise to assume that it will behave in a completely predictable way. There are simply too many different things that can happen to predict every scenario reliably. If weird things happen, you should be in control, rather than allow a naïve program to continue reconfiguring the network and probably make the problem worse.