2.3 Failure Modes

Until I have talked about the various standard network topologies, it will be difficult to have an in-depth discussion of failure modes. But I can still talk about failure modes in general. Obviously, the worst failure mode is a single point of failure for the entire network. But, as the previous section showed, the overall stability of the network may be governed by less obvious factors.

At the same time, this proves that any place where you can implement redundancy in a network drastically improves the stability for that component. In theory it would be nice to be able to do detailed calculations as earlier. Then you could look for the points where the weighted failure rates are highest. But in a large network this is often not practical. There may be thousands of components to consider. So this is where the simpler qualitative method described earlier is useful.

What the quantitative analysis of the last section shows, though, is that it is a serious problem every time you have a failure that can affect a large number of users. Even worse, it showed that the probability of failure grows quickly with each additional possible point of failure. The qualitative analysis just finds the problem spots; it doesn't make it clear what the consequences are. Having one single point of failure in your network that affects a large number of users is not always such a serious problem, particularly if that failure never happens. But the more points like this that you have, the more likely it is that these failures will happen.

Suppose you have a network with 100,000 elements that can fail. This may sound like a high number, but in practice it isn't out of the ordinary for a large-scale LAN. Remember that the word "element" includes every hub, switch, cable, fiber, card in every network device, and even your patch panels.

If the average MTBF for these 100,000 elements is 100,000 hours (which is probably a little low), then on net you can expect about one element per day to break. Even if there is redundancy, the elements will still break and need to be replaced: it just won't affect production traffic. Most of these failures will affect very small numbers of users. But the point is that, the larger your network, the more you need to understand what can go wrong, and the more you will need to design around these failure modes.

So far I have only discussed so-called hard failures. In fact, most LAN problems aren't the result of hard failures. There are many kinds of failures that happen even when the network hardware is still operating. These problems fall into a few general categories: congestion, traffic anomalies, software problems, and human error.

2.3.1 Congestion

Congestion is the most obvious sort of soft problem on a network. Everybody has experienced a situation where the network simply cannot handle all of the traffic that is passing through it. Some packets are dropped; others are delayed.

In dealing with congestion, it is important to understand your traffic flows. In Figure 2-5, user traffic from the various user floors flows primarily to the Internet, the application servers, and the mainframe. But there is very little floor-to-floor traffic. This allows you to look for the bottlenecks where there might not be enough bandwidth. In this example all traffic flows through the two Core VLANs. Is there sufficient capacity there to deal with all of the traffic?

Congestion is what happens when traffic hits a bottleneck in the network. If there is simply not enough downstream capacity to carry all of the incoming traffic, then some of it has to be dropped. But before dropping packets, most network equipment will attempt to buffer them.

Buffering basically means that the packets are temporarily stored in the network device's memory in the hopes that the incoming burst will relent. The usual example is a bucket with a hole in the bottom. If you pour water into the bucket, gradually it will drain out through the bottom.

Suppose first that the amount you pour in is less that the total capacity of the bucket. In this case the water will gradually drain out. The bucket has changed a sudden burst of water into a gradual trickle.

On the other hand, you could just continue pouring water until the bucket overflows. An overflow of data means that packets have to be dropped, there simply isn't enough memory to keep them all. The solution may be just to get a bigger bucket. But if the incoming stream is relentless, then it doesn't matter how big the bucket is: it will never be able to drain in a controlled manner.

This is similar to what happens in a network when too much data hits a bottleneck. If the burst is short, the chances are good that the network will be able to cope with it easily. But a relentless flow that exceeds the capacity of a network link means that a lot of packets simply can't be delivered and have to be dropped.

Some network protocols deal well with congestion. Some connection-based protocols such as TCP have the ability to detect that some packets have been dropped. This allows them to back off and send at a slower rate, usually settling just below the peak capacity of the network. But other protocols cannot detect congestion, and instead they wind up losing data.

Lost data can actually make the congestion problem worse. In many applications, if the data is not received within a specified time period, or if only some of it is received, then it will be sent again. This is clearly a good idea if you are the application. But if you are the network, it has the effect of making a bad problem worse.

Ultimately, if data is just not getting through at all for some applications, they can time out. This means that the applications decide that they can't get their jobs done, so they disconnect themselves. If many applications disconnect, it can allow the congestion to dissipate somewhat. But often the applications or their users will instead attempt to reconnect. And again, this connection-setup traffic can add to the congestion problem.

Congestion is typically encountered on a network anywhere that connections from many devices or groups of devices converge. So, the first common place to see congestion is on the local Ethernet or Token Ring segment. If many devices all want to use the network at the same time, then the Data Link protocol provides a method (collisions for Ethernet, token passing for Token Ring) for regulating traffic. This means that some devices will have to wait.

Worse congestion problems can occur at points in the network where traffic from many segments converges. In LANs this happens primarily at trunks. In networks that include some WAN elements, it is common to see congestion at the point where LAN traffic reaches the WAN.

The ability to control congestion through the Core of a large-scale LAN is one of the most important features of a good design. This requires a combination of careful monitoring and a scalable design that makes it easy to move or expand bottlenecks. In many networks congestion problems are also mitigated using a traffic-prioritization system. This issue is discussed in detail in Chapter 10.

Unlike several of the other design decisions I have discussed, congestion is an ongoing issue. At some point there will be a new application, a new server. An old one will be removed. People will change the way they use existing services, and that will change the traffic patterns as well. So there must be ongoing performance monitoring to ensure that performance problems don't creep up on a network.

2.3.2 Traffic Anomalies

By traffic anomalies, I mean that otherwise legitimate packets on the network have somehow caused a problem. This is distinct from congestion, which refers only to loading problems. This category includes broadcast storms and any time a packet has confused a piece of equipment. Another example is a server sending out an erroneous dynamic routing packet or ICMP packet that caused a router to become confused about the topology of the network. These issues will be discussed more in Chapter 6.

But perhaps the most common and severe examples are where automatic fault-recovery systems, such as Spanning Tree at Layer 2, or dynamic routing protocols, such as Open Shorted Path First (OSPF) at Layer 3, become confused. This is usually referred to as a convergence problem. The result can be routing loops, or just slow unreliable response across the network.

The most common reason for convergence problems at either Layer 2 or 3 is complexity. Try to make it easy for these processes by understanding what they do. The more paths available, the harder it becomes to find the best path. The more neighbors, the worse the problem of finding the best one to pass a particular packet to.

A broadcast storm is a special type of problem. It gets mentioned frequently, and a lot of switch manufacturers include features for limiting broadcast storms. But what is it really? Well, a broadcast packet is a perfectly legitimate type of packet that is sent to every other station on the same network segment or VLAN. The most common example of a broadcast is an IP ARP packet. This is where a station knows the IP address of a device, but not the MAC address. To address the Layer 2 destination part of the frame properly, it needs the MAC address. So it sends out a request to everybody on the local network asking for this information, and the station that owns (or is responsible for forwarding) this IP address responds.

But there are many other types of broadcasts. A storm usually happens when one device sends out a broadcast and another tries to be helpful by forwarding that broadcast back onto the network. If several devices all behave the same way, then they see the rebroadcasts from one another and rebroadcast them again. The LAN is instantly choked with broadcasts.

The way a switch attempts to resolve this sort of problem usually involves a simple mechanism of counting the number of broadcast packets per second. If it exceeds a certain threshold, it starts throwing them away so that they can't choke off the network. But clearly the problem hasn't gone away. The broadcast storm is just being kept in check until it dies down on its own.

Containment is the key to traffic anomalies. Broadcast storms cannot cross out of a broadcast domain (which usually means a VLAN, but not necessarily). OSPF convergence problems can be dealt with most easily by making the areas small and simple in structure. Similarly, Spanning Tree problems are generally caused by too many interconnections. So in all cases, keeping the region of interest small and simple helps enormously.

This doesn't mean that the network has to be small, but it does support the hierarchical design models I discuss later in this book.

2.3.3 Software Problems

Software problems are a polite term for bugs in the network equipment. It happens. Sometimes a router or switch will simply hang, or sometimes it will start to misbehave in some peculiar way.

Routers and switches are extremely complex specialized computers. So software bugs are a fact of life. But most network equipment is remarkably bug-free. It is not uncommon to encounter a bug or two during initial implementation phases of a network. But a network that avoids using too many novel features and relies on mature products from reputable vendors is generally going to see very few bugs.

Design flaws are much more common than bugs. Bugs that affect Core pieces of code, like standard IP routing or OSPF, are rare in mature products. More rare still are bugs that cannot be worked around by means of simple design changes.

2.3.4 Human Error

Unfortunately, the most common sort of network problem is where somebody changed something, either deliberately or accidentally, and it had unforeseen consequences. There are so many different ways to shoot oneself in the foot that I won't bother to detail them here. Even if I did, no doubt tomorrow we'd all go out and find new ones.

There are design decisions that can limit human error. The most important of these is to work on simplicity. The easier it is to understand how the network is supposed to work, the less likely that somebody will misunderstand it. Specifically, it is best to make the design in simple, easily understood building blocks. Wherever possible, these blocks should be as similar as possible. One of the best features of the otherwise poor design shown in Figure 2-5 is that it has an identical setup for all of the user floors. Therefore, a new technician doesn't need to remember special tricks for each area; they are all the same.

The best rule of thumb in deciding whether a design is sufficiently simple is to imagine that something has failed in the middle of the night and somebody is on the phone in a panic wanting answers about how to fix it. If most of the network is designed using a few simple, easily remembered rules, the chances are good that you'll be able to figure out what they need to know. You want to be able to do it without having to race to the site to find your spreadsheets and drawings.