13.5 Tuning After Deployment

Tuning does not necessarily end at the development stage. For many applications such as agent applications, services, servlets and servers, multiuser applications, enterprise systems, etc., there needs to be constant monitoring of the application performance after deployment to ensure that no degradation takes place. In this section, I discuss tuning the deployed application. This is mainly relevant to enterprise systems that are being administered. Shrinkwrapped or similar software is normally tuned the same way as before deployment, using standard profiling tools.

Monitoring the application is the primary tuning activity after deployment. The application should be built with hooks that enable tools to connect to it and gather statistics and response times. The application should be constantly monitored, and all performance logs retained. Monitoring should record as many parameters as possible throughout the system, though clearly you want to avoid monitoring so much that the performance of the running application is compromised by a significant amount. Of course, almost any act of measuring a system affects performance. But the advantage of having performance logs normally pays off enormously, and a few percent decrease in performance should be acceptable.

Individual records in the performance logs should include at least the following six categories:

Time (including offset time from a reference server)
User identifier
Transaction identifier
Application name, type, class, or group
Software component or subsystem
Hardware resource

A standard set of performance logs should be used to give a background system measurement and kept as a reference. Other logs can be compared against that standard. Periodically, the standard should be regenerated, as most enterprise applications change their performance characteristics over time. Ideally, the standard logs can be automatically compared against the current logs, and any significant change in behavior is automatically identified and causes an alert to be sent to the administrators. Trends away from the standard should also trigger a notification; sometimes performance degrades slowly but consistently because of a gradually depleting resource.

Administrators should note every single change to the system: every patch, every upgrade, every configuration change, etc. These changes are the source of most performance problems in production. Patches are cheaper short-term fixes than upgrades, but they usually add to the complexity of the application and increase maintenance costs. Upgrades and rereleases are more expensive in the short term, but cheaper overall.

Administrators should listen to users. Users are the most sensitive barometer of application performance. However, you should double-check users' assertions. A user may be wrong, or might have hit a known system problem or temporary administrative shutdown. Measure the performance yourself. Repeat the measurements several times and take averages and variations. Ensure that caching effects do not skew measurements of a reported problem.

When looking for reasons why performance may have changed, consider any recent changes such as an increase in the number of users, other applications added to the system, code changes on the client or server, hardware changes, etc. In addition to user response time measurements, look at where the distributed code is executing, what volumes of data are being used, and where the code is spending most of its time.

Many factors can easily give misleading or temporarily different measurements to the application. Distributed garbage collection may have cut in, system clocks may become unsynchronized, background processes may be triggered, and relative processor power may change, causing obscure effects. Consider if anyone else is using the processors, and if so, what they are doing and why.

You need to differentiate between:

Occasional sudden slowness, e.g., from background processes starting up
General slowness, perhaps reflecting that the application was not tuned for the current load, or that the systems or networks are saturated
A sudden slowdown that continues, often the result of a change to the system

Each of these characteristic changes in performance indicates a different set of problems.