2.1 Defining Reliability

There are two main components to my definition of reliability. The first is fault tolerance. This means that devices can break down without affecting service. In practice, you might never see any failures in your key network devices. But if there is no inherent fault tolerance to protect against such failures, then the network is taking a great risk at the business' expense.

The second key component to reliability is more a matter of performance and capacity than of fault tolerance. The network must meet its peak load requirements sufficiently to support the business requirements. At its heaviest times, the network still has to work. So peak load performance must be included in the concept of network reliability.

It is important to note that the network must be more reliable than any device attached to it. If the user can't get to the server, the application will not work—no matter how good the software or how stable the server. In general, a network will support many users and many servers. So it is critically important that the network be more reliable than the best server on it.

Suppose, for example, that a network has one server and many workstations. This was the standard network design when mainframes ruled the earth. In this case, the network is useless without a server. Many companies would install backup systems in case key parts of their mainframe failed. But this sort of backup system is not worth the expense if the thing that fails most often is connection to the workstations.

Now, jump to the modern age of two- and three-tiered client-server architectures. In this world there are many servers supporting many applications. They are still connected to the user workstations by a single network, though. So this network has become the single most important technological element in the company. If a server fails, it may have a serious effect on the business. The business response to this risk is to provide a redundant server of some kind. But if the network fails, then several servers may become inaccessible. In effect, the stability of the network is as important as the combined importance of all business applications.

2.1.1 Failure Is a Reliability Issue

In most cases, it's easiest to think about reliability in terms of how frequently the network fails to meet the business requirements, and how badly it fails. For the time being, I won't restrict this discussion to simple metrics like availability because this neglects two important ways that a network can fail to meet business requirements.

First, there are failures that are very short in duration, but which interrupt key applications for much longer periods. Second, a network can fail to meet important business requirements without ever becoming unavailable. For example, if a key application is sensitive to latency, then a slow network will be considered unreliable even if it never breaks.

In the first case, some applications and protocols are extremely sensitive to short failures. Sometimes a short failure can mean that an elaborate session setup must be repeated. In worse cases, a short failure can leave a session hung on the server. When this happens, the session must be reset by either automatic or manual procedures, resulting in considerable delays and user frustration. The worst situation is when that brief network outage causes loss of critical application data. Perhaps a stock trade will fail to execute, or the confirmation will go missing, causing it to be resubmitted and executed a second time. Either way, the short network outage could cost millions of dollars. At the very least, it will cause user aggravation and loss of productivity.

Availability is not a useful metric in these cases. A short but critical outage would not affect overall availability by very much, but it is nonetheless a serious problem.

Lost productivity is often called a soft expense. This is really an accounting issue. The costs are real, and they can severely affect corporate profits. For example, suppose a thousand people are paid an average of $20/hour. If there is a network glitch of some sort that sends them all to the coffee room for 15 minutes, then that glitch just cost the company at least $5,000 (not counting the cost of the coffee). In fact, these people are supposed to be creating net profit for the company when they are working. So it is quite likely that there is an additional impact in lost revenue, which could be considerably larger. If spending $5,000 to $10,000 could have prevented this brief outage, it would almost certainly have been worth the expense. If the outage happens repeatedly, then multiply this amount of money by the failure frequency. Brief outages can be extremely expensive.

2.1.2 Performance Is a Reliability Issue

The network exists to transport data from one place to another. If it is unable to transport the volume of data required, or if it doesn't transfer that data quickly enough, then it doesn't meet the business requirements. It is always important to distinguish between these two factors. The first is called bandwidth, and the second latency.

Simply put, bandwidth is the amount of data that the network can transmit per unit time. Latency, on the other hand, is the length of time it takes to send that data from end to end. The best analogy for these is to think of transporting real physical "stuff."

Suppose a company wants to send grain from New York to Paris. They could put a few bushels on the Concorde and get it there very quickly (low latency, low bandwidth, and high cost per unit). Or they could fill a cargo ship with millions of bushels, and it will be there next week (high latency, high bandwidth, and low cost per unit). Latency and bandwidth are not always linked this way. But the trade-off with cost is fairly typical. Speeding things up costs money. Any improvement in bandwidth or latency that doesn't cost more is generally just done without further thought.

Also note that the Concorde is not infinitely fast, and the cargo ship doesn't have infinite capacity. Similarly, the best network technology will always have limitations. Sometimes you just can't get any better than what you already have.

Here the main concern should be with fulfilling the business requirements. If they absolutely have to get a small amount of grain to Paris in a few hours, and the urgency outweighs any expense, they would certainly choose the Concorde option. But, it is more likely that they have to deliver a very large amount cost effectively. So they would choose the significantly slower ship. And that's the point here. The business requirements and not the technology determine what is the best way.

If the business requirements say that the network has to pass so many bytes of data between 9:00 A.M. and 5:00 P.M., and the network is not able to do this, then it is not reliable. It does not fulfill its objectives. The network could pass all of the required data, but during the peak periods, that data has to be buffered. This means that there is so much data already passing through the network that some packets are stored temporarily in the memory of some device while they wait for an opening.

This is similar to trying to get onto a very busy highway. Sometimes you have to wait on the on-ramp for a space in the stream of traffic to slip your car into. The result is that it will take longer to get to your destination. The general congestion on the highway will likely also mean that you can't go as fast. The highway is still working, but it isn't getting you where you want to go as fast as you want to get there.

If this happens in a network, it may be just annoying, or it may cause application timeouts and lost data, just as if there were a physical failure. Although it hasn't failed, the network is still considered unreliable because it does not reliably deliver the required volume of data in the required time. Put another way, it is unreliable because the users cannot do their jobs.

Another important point in considering reliability is the difference between similar failures at different points in the network. If a highway used by only a few cars each day gets washed away by bad weather, the chances are that this will not have a serious impact on the region. But if the one major bridge connecting two densely populated regions were to collapse, it would be devastating. In this case one would have to ask why there was only one bridge in the region. There are similar conclusions when looking at critical network links.

This is the key to my definition of reliability. I mean what the end users mean when they say they can't rely on the network to get their jobs done. Unfortunately, this doesn't provide a useful way of measuring anything. Many people have tried to establish metrics based on the number of complaints or on user responses to questionnaires. But the results are terribly unreliable. So, in practice, the network architect needs to establish a model of the user requirements (most likely a different model for each user group) and determine how well these requirements are met.

Usually, this model can be relatively simple. It will include things like:

What end-to-end latency can the users tolerate for each application?
What are the throughput (bandwidth) requirements for each application (sustain and burst)?
What length of outage can the users tolerate for each application?

These factors can all be measured, in principle. The issue of reliability can then be separated from subjective factors that affect a user's perception of reliability.