10.2 RADIUS and Availability

High availability has become the latest buzzword in Internet service. Advertisements abound for network operation centers (NOCs) with triple-capacity electric generators, dual HVAC systems, geographical dispersion, waterless combustion control, and other facilities to handle problems. While these certainly are methods to obtain and retain high availability, it seems that sometimes people lose sight of the point of such exercises: to maintain the existence and offering of services when others systems on all "sides" of the service are failing. I say "sides" to refer to the hierarchical tree in which most systems reside: there are often machines relying on a specific box, and that box relies on other boxes, and it also may work in tandem with others.

There are several strategies for planning for failure, which is the main tenet in high availability. The one most disaster-planning experts use is to account for what would be a worst-case scenario for your implementation. There are several questions to ask yourself when designing a highly available system:

Am I familiar with the normal traffic and availability of my systems?: Am I aware of the inherent weaknesses my implementation has? You need to know what the normal behavior of your system is when deciding how best to concentrate your efforts to make it available.
Do I have a single point of failure in my network?: That is, is there one device that provides such critical service that if it went down, users could not obtain the service they need? Single points of failure are disastrous to all kinds of redundancy because they make it moot: if your system goes down, it's completely unavailable.
What events could coincide that would overwhelm the capacity of my network?: This scenario often comes into play when a downed system that is not part of the implementation causes certain events to happen inside the system. You'll see more of this later in the chapter.
How can I eliminate single points of failure?: Would several systems performing the same task as the same time (a cluster) cure this ailment? Conversely, what systems can fail without bringing down the entire network? Prioritizing the systems to which you apply availability strategies helps you keep the cost in check while ensuring the greatest possible uptime for your system.
How can I be proactive about reducing errors and outages?: It should be no surprise to an administrator that most errors considered catastrophic to a network are the cause of events that have been long in the making. Monitoring your systems for potential errors and their indications help to ensure problems are handled and eliminated before they even become problems.

These questions give you a fairly complete estimate of your implementation's weak points, both inside and outside of your control. Let's step through each of the questions with regard to designing a RADIUS implementation.

10.2.1 Determining Normal System Behavior

To establish a proper and accurate baseline for your system, there are two types of requirements you need to consider: explicit requirements, which are those mandated by your users or your company's management; and derived (or implicit) requirements, which mainly stem from the explicit requirements. For example, you may be required to make all reasonable efforts to have service restored within 15 minutes of downtime. The 15-minute window is an explicit requirement. However, you may also require that your systems have hot-swap hard drives so that you can indeed switch out a dead disc within 15 minutes. Your hot-swap requirement is derived from the explicit requirements.

Let's take a look at each of these now.

10.2.1.1 Explicit requirements

Some RADIUS implementations must deal with a constant, heavy stream of users needing its services. In this case, a measurement called packets per second is used, which quite obviously is a threshold of how many packets per second can be received and processed by the server systems. A few calculations are in order to determine what this qualification should be.

First, determine the number of packets that will be processed in order to start and stop one transaction. In RADIUS, this involves four packets to start the transaction and two to stop it.
Next, consider the average load of your system. For this example, say you have a capacity of 12,000 ports. Assume that normal operating load is about 40% of capacity, which means that 4,800 ports are in use. At peak times you may have 85% of your user base connected, which means 10,200 ports in use.
Then factor in a contingency for a primary circuit to go bad. Let's assume a circuit goes bad at a peak usage time, which can cause a maximum 10,200 accounting-stop packets to filter through your implementation (a worst-case scenario).

Can you handle that? It may seem like a simple question, but it's one that must be answered. It's also important to consider the number of packets per transaction, as you may be faced with a "transactions per second" constraint. In this case, RADIUS call-check may be used, which can cause more packets to constitute a single transaction. Of course, each implementation is different, so careful consideration of your environment and what your requirements are is prudent.

Some administrators believe that the primary factors in designing these systems are, of course, availability and response time. Consider the effect of a slow system: if a server is under heavy load, it will take a while to respond to a request. The client system will probably send the message again after a timeout period, which adds another packet to the load. You can begin to see the cascading effect of packet load, which likely will ultimately result in a system crash.

Another requirement to consider is the average time between system failures, or the MTBF factor. You may be familiar with this statistic from its presence on nearly all hard disc drives. For smaller RADIUS setups, in which downtime is measured in under one thousand minutes per year (close to 17 hours), then MTBF may not be as important a measure as more active monitors like packet per second and MTRS (more on that later). For larger providers, however, a figure on acceptable MTBF should be determined as part of a high-availability plan. A common figure for which companies strive is 99.999% uptime. Most systems administrators install some sort of real-time monitoring system for their systems, but the key to maintaining five-nines is tackling problems before they become problems, per se. Diligence in examining log files and using accounting data to plan and analyze usage trends should both be part of a disaster-prevention plan. However, the techniques established in this chapter aren't meant to provide for five-nines uptime. Designing for that level of availability requires consultants and expensive equipment.

Finally, while examining explicit requirements to high availability, we have to admit that all systems are going to fail at some point. There is no such thing as 100% uptime over any practical, meaningful length of time. The mean time to restore service, or MTRS, figure looks at how long it takes you or your systems to recover from an outage. This could be the most critical aspect of all of your requirements. Customers can tolerate some downtime, and few businesses run operations so important that any downtime is a catastrophic nightmare. But your customers, fellow employees, vendors, and partners will all scream if an outage is prolonged. It's likely your company earns more brownie points responding to problems quickly than it does ensuring no downtime occurs.

There are a few ways to minimize the time it takes to restore service. With hardware, it is always good practice to keep hot spares of critical components around: hubs, switches, routers, hard drives, processors, memory, and disk controllers are all parts with limited (albeit long) lives. It's also advisable to keep copies of critical system software, including the operating system and your RADIUS implementation software, close to machines on which they're installed. Regular data backups—and perhaps even more importantly, regular test restorations—should be conducted. Remember: the value of a backup is reduced to zero if it cannot be restored.

10.2.1.2 Derived requirements

The processing power of your server is one requirement that can be mandated by a threshold or directive set to maintain service availability. For example, if you need to process 25 packets per second, you require a fairly powerful server. But there may be a better solution to increase your power: since you have multiple POPs separated around a state but your RADIUS servers are in one main location, then you may investigate using several systems in parallel and building a RADIUS server cluster.

Network connectivity is also another under-compensated area. With heavy loads of traffic, it's critical to ensure your network connections are rock solid. Quality server-based network interface cards (NICs) from tier-1 manufacturers like 3Com and Intel are a must in this situation, and in most cases another NIC bound to another IP for the same machine will allow the box to spread the traffic load over the two cards. Additionally, if one dies, the other can still handle traffic and serve as an online backup until a new card can be installed.

10.2.2 Points of Failure

TechWeb's TechEncyclopedia defines redundancy as "...peripherals, computer systems and network devices that take on the processing or transmission load when other units fail." More granularly, redundancy is the inclusion of independent yet identical sets of components to provide an alternate way of providing a function. The concept of redundancy can be applied to individual computers, sets of computers, networks, and even entire implementation design. Moreover, systems can be as redundant as needed; as with everything, however, there is a "happy medium."

I should discuss certain terminology in regard to redundancy before proceeding further. A technique to balance and distribute loads across servers working in tandem is called a "round robin" strategy. For example, let's say I have three RADIUS servers in one POP. The NAS is configured to send calls in order to one of the three RADIUS servers; the mentality behind this is that the traffic load will be evenly placed among the three servers by choosing the next available server in the "list" upon dial-in so that no one server is operating under a much heavier load than the others.

Secondly, "failover" is a term used to describe when an administrator has ensured service availability by enabling a service to cut over to another standby server when a primary server fails. This is most commonly found in groups of two servers in one geographic location, such as a particular POP in a city. There may be two RADIUS servers, for example: one configured to handle all of the requests normally and another to take over the duties if the first fails. Hence, the RADIUS service "fails over" to the backup server. (It's been known, however, for fail-over systems not to resume back to the normal servers when the failure condition is resolved. This can inadvertently direct large amounts of traffic to your failover servers, which might not be designed to handle such a load. It's something to be aware of.)

Redundancy is often found inside specific pieces of hardware, particularly servers. Most industrial-strength servers include the capability, if not the hardware, to maintain redundancy in critical machine subsystems: power, processors, and controllers. In most cases, the computer's BIOS will detect a failure of a critical component within milliseconds and automatically cut over to the auxiliary device, which is already installed in the machine's chassis. This is one of the benefits you receive for paying extra money for a standard server-class system instead of commissioning a regular personal computer.

Hubs and switches are also critical devices that are often overlooked. Ports on these devices can suddenly fail, leaving you without any connectivity to your machine. Unfortunately, switches and other concentrators designed for heavy use in a datacenter are often very expensive, so you must weigh the benefit of keeping an identical spare switch on site. Ideally, your machine will have two network cards present that are connected to two separate switches. This eliminates two probable points of failure: the port on a switch (or the entire switch, in the event of a power loss to a rack or a unit) and the NIC inside the chassis. This is cheap insurance and not very difficult to configure.

There is an issue to consider with the dual-NIC approach, however. You must have a way to route traffic between both cards. Otherwise, when one card (or concentrator port) fails, your traffic routes will fail to switch over to the functioning interface. This phenomenon is known as "convergence failure." To cure this, run a routing daemon such as gated across all of your interfaces. Another problem that tends to creep up is managing IP addresses. These numbers often change, and this can create havoc for system administrators trying to announce and provide these numbers to the public. It also creates issues with proxy servers and other systems that see your servers as clients. To absolve the renumbering issue, use virtual loopback addresses. These function as aliases to your real address so that your public customers and other remote clients can use these numbers to reach your system no matter what numbering it uses.

10.2.3 Planning to Fail

Having multiple servers ready to take over in case of failure is one of the most effective ways of combating downtime. Unfortunately, having multiple servers increases the total cost of ownership of the entire implementation, and many times management may want to increase availability but at the same time spend as little money as possible. However, budgeting for high availability systems is much like budgeting for any type of insurance, whether business or personal—you pay money up front for the time you will need it, but when you need it, you need it badly.

A fellow author and systems designer/administrator once told of the concept of building for failure. I find that a healthy attitude to take. Companies often build technology infrastructures involving systems critical to their day-to-day operation and then later discover the need for fault tolerance and uptime increase. Renovating an existing implementation to conform to strict availability standards is extremely expensive. However, up-front planning reduces much of this cost, and allows you to take high availability to a level that otherwise may have been cost prohibitive.

There are a few different levels, or "temperatures," of high availability (HA), ranging from inexpensive and least timely to most expensive and instantaneously available. It's easiest to delineate these temperatures into three groups, but that distinction made here should not be treated as a statement that other combinations of HA systems are not available. In fact, combinations are often necessary because of unique infrastructures and system peculiarities. The point of HA is to strategize your network layout and design to plan for every malicious network event and minimize downtime as much as possible.

Cold standby servers offer the least protection from outages, but they are also the most cost-effective standby systems. Most often, the cold standby RADIUS server is actually a box performing another network duty (SMTP service, for example), but the administrator installs and configures the RADIUS server software on that machine and then shuts the service down. The problem with cold servers is that the administrator must know there is a problem with the primary servers, and he must actually perform the cutover to the standby server manually. While it's not expensive at all to keep a cold standby around, it provides very limited failover services and maximum uptime during an outage.

The next step up on the availability thermometer is a warm standby server. Warm standby servers are most likely identical to the primary, in-service machines both in hardware configuration and software maintenance. However, these servers are powered on and able to take over service for a primary server should it go down in a matter of seconds. Software APIs residing on both machines normally can make and receive calls to determine when the standby server should take over duties from the active server.

Hot standby servers are the most expensive and most effective way to ensure your implementation has the most uptime possible. Hot backups generally run the system software actively, which means a method of synchronization is present between the active and standby servers to make sure session information and real-time data is mirrored between the two. However, the standby server is not contacted unless all primary servers have gone offline or are otherwise unable to perform service.

10.2.4 Proactive System Management

An equally important part of maintaining a RADIUS implementation with the least downtime possible is keeping up with your system and examining it on a daily (or sometimes even more often) basis. There is a glut of monitoring tools on the market now, and there are as many freely available open source tools that can be had for the simple price of compilation and configuration. Most of these tools profile various metrics of your system in two key areas: service statistics and system statistics.

Service monitoring is designed to see two things: whether the service is functional, period, and then what kind of load under which the service is operating. The most effective way to test the first tenet is to have a packet generator send RADIUS packets emulating an incoming NAS connection. If a response is received from the RADIUS server, I know it's operating. Beyond that, I want to see some statistics about the environment in which the service is being provided.

Logons per second

This statistic measures the number of successful authentications (through counting the number of Access-Accept packets) per second through your system. You can also monitor the start type of Accounting-Request packets, although you lose the ability to see the reject ratio: how many requests were granted to every reject.

Look for: abnormally high counts for this statistic. This would indicate a general network problem that would disconnect a user. He'd then attempt to reconnect, sometimes multiple times, increasing this counter and indicating a problem that needs attention.

Also: abnormally low counts. This could indicate a network problem.

Logoffs per second

This metric counts the number of disconnects per second from your system by counting the stop-type Accounting-Request packets.

Look for: abnormally high numbers. It would indicate a mass network problem, a faulty NAS port, a problem with a circuit, or a bad remote-access card.

Also: abnormally low counts. This could indicate an accounting problem, a monitoring (SNMP) problem, or an idle timeout problem.

Rejects per second

This number monitors the amount of rejected authentication attempts per second. Coupled with smart NAS equipment, which often includes the suspected motive behind the disconnect, you can often apply logic to certain disconnect types and determine a problem from there.

Look for: abnormally high counts. This would indicate a problem with the local user authentication database. It also may point to a problem with a remote machine's database if the RADIUS server(s) is/are acting as a proxy.

Reject cause threshold

You may decide to create and track the suspected reasons for disconnect. Then your monitoring software can increase the count for each type of disconnect as the logoff occurred, and when a certain type of disconnect reached a certain count, you would be alarmed. This is an ideal form of proactive management, in that this threshold can be set before a minor problem turns into a major one.

Look for: high abnormal disconnects. You also may want to investigate if your monitoring software has this feature embedded or included.

Total packets per second, across all interfaces

Sniffing your network interfaces and counting the packets can determine this metric. It's worth the trouble, since packets per second is a great way to monitor your performance under a known and expected load. It's also useful for historical mapping and trend analysis.

Look for: higher than normal counts. This may indicate problems such as high load, abnormal disconnects, and other difficulties which would entail packet "flooding."

Also: abnormally low counts. This could indicate a network problem, an accounting problem, or a monitoring (SNMP) problem.

The same strategies for service monitoring can be applied to monitoring the health and activity of the hardware on which the service runs. It's important to determine a baseline with these metrics, as with any other metric, since the thresholds to which you want an alarm must be set at 25%-30% above your normal system activity tolerance. Here are a few key aspects of your system that need to be checked often.

Load average

The load average depicts the average load over an entire system. On Unix machines, the load average can be determined from simply running a utility from the command line. It's important to remember that the load average most affects multiple processor machines: common practice shows that the load on a dual SMP system should never exceed 1.0 for an appreciable length of time.

Look for: an average significantly above your baseline; a load in excess of 1.0 for more than one hour at a time on a dual CPU system or proportionally larger on a larger multiprocessor system. These tend to indicate bottlenecks in your system, zombie processes, and other maladies that need to be addressed.

Memory statistics

Profiling memory usage is a trick that requires forays into more gray areas than the other metrics. Memory usage is relative, in that highly loaded RADIUS servers can use more memory and be within acceptable tolerances than lightly loaded servers. The different RADIUS servers in use, coupled with the various methods of holding packets and configurations in memory, prevent anyone from determining a single threshold at which memory use becomes rampant and detrimental. It's unique to each machine.

A better way to track memory usage is to correlate your measurements over a period of time with some of the other service metrics, such as total packets per second or logons per second, and another system metric, such as the load average. Over a period of three or four weeks you can begin to determine what an average load requires of the memory in your system.

Memory-use methods should also be analyzed. Unix and Unix-based operating systems will, by design, consume almost all of the available memory for a particular system. However, the way it uses this RAM most efficiently is by making it available as buffers. If the usage of a machine is primarily for buffers, than all is well. Any decent memory-usage utility will depict the current style of usage for a machine's memory.

Look for: an appropriate threshold for your machine. This can be determined, as mentioned previously, with time and base points on other metric. Servers saddled with 1 GB of RAM can approach 80% memory usage with more ease and less trouble than boxes equipped with 128 MB of RAM. Throwing memory at a machine is a cheap way to alleviate a usage problem, but you may want to examine other aspects of your system to determine if there is a memory leak or improper memory management in a running process.

Disk Usage

While disk space is getting cheaper by the minute in this day and age, it still makes sense to examine whether the space you already have is being used wisely. Accounting servers are notorious for accumulating gluts of information, although large user databases can occupy much space on RADIUS servers only handling authentication and authorization.

Look for: abnormally high disk-usage growth, particularly after a configuration change. Consider redundant disk arrays for extra space and added reliability and security. Also, for Unix and Linux machines, examine your partition structure, and ensure your logs and other files that need room to grow are placed on partitions with ample space.

Processes

Some monitoring systems watch processes at certain intervals, as scheduled by the Unix commands at or cron. You will want to monitor the critical processes for your server, which can depend on the software being used.

Look for: critical processes that are stopped or "zombied"; a high number of automatic alarm-restarts from your monitoring program. These events can indicate a configuration problem with your software or operating system or abnormally high loads.

10.2.5 Case Studies in Deployment and Availability

Once you've focused on securing the availability of your hardware and software through redundancy, you should examine making the entire RADIUS service as a whole more available. It's important to remember, however, that consultants who specialize in designing a network topology to be highly available make six-figure salaries doing just that, so to present every opportunity to make a system highly available is beyond the scope of the concept here. Like all plans for failure, you as the designer must strive to reach the "sweet spot" between cost and results.

In that spirit, I'll present two example network topologies that accomplish the most redundancy and availability without breaking the bank. I will cover the availability and redundancy strategies used in each design; then, you can take the best practices outlined here and use them as a starting point for your own design. And remember, part of being a designer is knowing when to bring in the big guns: don't be afraid to call a consultant if you realize that you're in over your head. It would simply be a waste of time, money, and system resources to continue at that point.

10.2.5.1 Scenario 1: A small, regional ISP

Raleigh Internet, Inc., is a small Internet service provider operating within the Research Triangle Park region of North Carolina. The provider offers service to residents of the region and the surrounding counties. Raleigh Internet has created points of presence in three locations: at its head office in Durham, with 1,000 ports; in a co-located telco area in Chapel Hill, with 1,500 ports; and a rented set of 2,500 ports from a network conglomerate to serve the Raleigh city proper. Their average user load is 35% on each POP, for a total active port count under normal load of 1,750 ports. They wish to provide as much service availability as possible, but the budget is certainly not unlimited and 99.999% uptime is not an explicit requirement. The ISP does need to maintain support for processing at 90% load (4,500 ports) across all its POPs without problems.

The company maintains a single set of RADIUS servers in its Durham office, along with its arsenal of other service machines for mail, personal web pages, Usenet, and additional services. It doesn't want to maintain separate RADIUS servers able to perform authentication on their own at each POP because of the administrative overhead involved in change management: for example, what if a user in Raleigh changed his password and then went to work in Chapel Hill? How would the password change propagate from the Raleigh machine to the Chapel Hill server? In addition, Raleigh Internet needs to maintain the ability to continue to authenticate users in the event one server goes down.

The solution for Raleigh Internet would look something like the topology depicted in Figure 10-1.

Figure 10-1. An availability solution for a small ISP

In this network design, each POP maintains two RADIUS machines that act solely as a proxy. The proxy servers are configured to send packets to the main RADIUS machines at the Durham office. The Durham office contains three RADIUS servers working in tandem with a real copy of the authentication database. Having the database on a single cluster of machines makes change management and propagation issues less of a problem. This solution also allows for the contingency of a RADIUS server in each POP going down: the remaining servers can still proxy to the cluster of real servers in Durham and continue processing AAA requests.

The ISP decided against having real servers in each POP for two reasons: one, it didn't want to invest in expensive programming and consulting expenses to rectify the propagation problems. The benefits of having five-nines uptime, at least to management, were not worth the cost of ensuring more availability, since most of Raleigh Internet's customer base is in the residential market. Their customers didn't demand such uptime requirements since having access to the Internet wasn't mission critical.

In summary, here are the key strategies involved with this scenario:

Determining a system baseline: The nominal active load of 1,750 ports system-wide was calculated.
Managing explicit and derived requirements: The ISP needs to be able to service 4,500 active ports at a maximum across all POPs without any special modifications. This entails having machines capable of handling such a heavy load. Hence, the three-machine cluster was specified at the Durham central office. As well, proxies at each POP ensured immediate attention to new and existing connections.
Analyzing risks of single points of failure: Since in the original design, if the RADIUS server in the Durham office went down, nothing could authenticate, the cluster was added to maintain service. As well, dual servers at each POP that work in failover mode (i.e., a primary and a backup server) ensure if one proxy goes down, requests can still funnel to the central office.

10.2.5.2 Scenario 2: A corporation with branch offices

Acme Machine Tools, LLC, is a midsize manufacturer of shop automation and general construction equipment with just over 2,000 employees. Acme has a main office in Chicago, with three branch offices in Tempe, Dallas, and Birmingham. The company has NAS gear in Tempe, Dallas, and Birmingham, each with 500 ports and a nominal active port load of about 75 ports. The company would like to support a 98% maximum load at each POP (490 ports each), for a maximum system-wide load of 1,470 active ports. The company has hired RADIUS administrators for each POP as well as one for the corporate office.

Acme wants to create a service that will allow its employees in the corporate offices to work from home and gain access to the corporate network by dialing in to each city. It also has a fleet of mobile workers that roam around the entire country while making sales calls, and they need to be granted access as well. It is assumed that each corporate employee who works from home will only dial in to the set of ports for his respective location (i.e., Jill from Tempe will only dial the Tempe number since she does not travel). However, the sales fleet needs access to the corporate network and from there, the Internet, from wherever they happen to be. It is also assumed that the work-from-home option is not offered to employees in the Chicago area.

The company wants as little administrative overhead as possible, although Acme's resources are a bit more extensive and its budget considerably larger than Raleigh Internet's plan. How is this best accomplished? Figure 10-2 illustrates the most effective solution.

Figure 10-2. Availability solution for a midsize corporation

Let's take a closer look at this solution. I have placed two fully functional RADIUS servers in each city's POP, with one configured as the primary, always-on server and the other configured as a backup server for failover purposes in case the primary server goes down. Based on the assumptions previously listed, I know that the users who work from home in each city are the only ones that will be dialing that POP's number. By that assumption, I can simply sort the users that need dial-in access by their city of residence and configure only those users on each city's RADIUS servers. So the RADIUS servers in each city's POP will authenticate those users it knows about.

But that leaves out the fleet of mobile workers. How will they gain access? First, I have placed a three-node parallel processing cluster of RADIUS servers—the core of the network—at the corporate head office in Chicago. These servers know about every user with remote-access privileges in the entire company, so these machines are effectively root servers, much in the same way that there are root DNS servers for the global Internet. The individual RADIUS servers in Birmingham, Dallas, and Tempe will be configured as clients on the root RADIUS cluster in Chicago. So when a mobile user dials a POP, and the POP does not explicitly know about those users, then the individual RADIUS server forwards the request to the root servers.

So the mobile users are happy, the work-from-home users are happy, and your administrators are happy, too—the user management in this design is a cinch since the servers in Tempe, Dallas, and Birmingham all depend on the Chicago root nodes. All the administrators have to do is keep the corporate servers updated, which automatically makes the clients rely on updated information.

In summary, here are the key strategies involved with this scenario:

Determining a system baseline

The nominal active load of 225 ports system-wide was calculated.

Managing explicit and derived requirements

Acme Machine Tools, LLC, needs to be able to service 1,470 active ports at a maximum across all POPs without any special modifications. Again, the POPs must be equipped with servers that can handle such loads. The accessibility of the service is more important with Acme than with Raleigh Internet, since the workers' ability to connect to the network directly affects their ability to get corporate work done. Hence, access is mission critical.

Analyzing risks of single points of failure

I have made the network design for the remote-access service as redundant as possible. The three-machine cluster was placed in Chicago since all the POPs relied on the root servers for up-to-date user authentication information. If the corporate servers went down, employees would be denied access, so a parallel-processing cluster was deemed necessary. In each POP, the two-machine group (one primary, one backup standby for failover) was sufficient for redundancy purposes.

There are other places for redundancy to be applied (in NAS gear, for instance, or in telephonic devices), but they are beyond the scope of this discussion.

[ Team LiB ]