8.5 Clustering Objects and Counting I/O Operations

Clustering is a technique that takes advantage of locality (usually on the disk) to improve performance. It is useful when you have objects stored on disk and can arrange where objects are in relation to each other. For example, suppose you store serialized objects on disk, but need to have fast access to some of these objects. The most basic example of clustering is arranging the serialization of the objects in such a way as to selectively deserialize them to get exactly the subset of objects you need, in as few disk accesses, file openings, and object deserializations as possible.

Suppose you want to serialize a table of objects. Perhaps they cannot all fit into memory at the same time, or they are persistent, or there are other reasons for serialization. It may be that of the objects in the table, 10% are accessed frequently while the other 90% are only infrequently accessed and the application can accept slight delays on accessing these less frequently required objects. In this scenario, rather than serializing the whole table, you may be better off serializing the 10% of frequently used objects into one file (which can be deserialized in one long call) and the other 90% into one or more other files with an object table index allowing individual objects to be read in as needed.

Alternatively, it may be that objects are grouped in some way in your application so that whenever one of the table objects is referenced, this also automatically requires certain other related objects. In this case, you want to cluster these groups of objects so they are deserialized together.

If you need to manage objects on disk for persistency, sharing, memory, or whatever reason, you should consider using an object-storage system (such as an object database). The serialization provided with Java is very basic and provides little in the way of simple systemwide customization. For example, if you have a collection of objects on disk, typically you want to read into memory the collection down to one or two levels (i.e., only the collection elements, not any objects held in the instance variables of the collection elements). With serialization, you get the transitive closure^[12] of the collection in general, which is almost certainly much more than you want. Serialization supports reading to certain levels in only a very rudimentary way: basically, it says you have to do the reading yourself, but it gives you the hooks that let you customize on a per-class basis. The ability to tune to this level of granularity is really what you need for any sort of disk-based object storage beyond the most basic. And you usually do get those extra tuning capabilities in various object-storage systems.

^[12] The transitive closure is the set of all objects reachable from any one object, i.e., an object and its data variables and their data variables, etc.

At a lower level, you should be aware that the system reads in data from the disk one page at a time (page size is system-dependent, normally 4 or 8 KB). This means that if you cluster data (of whatever type) on the disk so that the data that needs to be together is physically close together on disk, then the reading of that data into memory is also speeded up. Typically, the most control you have over clustering objects is by putting data into the same file near each other and hoping that the filesystem is not too fragmented. Defragmenting the disks on occasion can help.

Clustering should reduce the number of disk I/O operations you need to execute. Consequently, measuring the number of disk I/O operations that are executed is essential to determine if you have clustered usefully.^[13] The simplest technique to measure I/O is to monitor the number of reads, writes, opens, and closes that are performed. This gets complicated by using different I/O classes wrapped one around the other. But you can always find the lowest-level class that is actually doing the I/O (usually one of FileInputStream, FileOutputStream, and RandomAccessFile in the java.io package). You can determine all actual methods that execute I/O fairly easily if you have the JDK source: simply find all source files with the word "native." If you look in java.io for these and look at the actual method names of the native methods, you will find that in almost every case, the only classes applicable to you are the FileInputStream, FileOutputStream, and RandomAccessFile classes. Now the difficult part is wrapping these calls so that you can monitor them. Native methods that are declared private are straightforward to handle: just redefine the java.io class to count the times they are called internally. Native methods that are protected or have no access modifier are similarly handled: just ensure you do the same redefinition for subclasses and package members. But the methods defined with the public modifier need to be tracked for any classes that call these native methods, which can be difficult and tiresome, but not impossible.

^[13] Ultimately, it is the number of low-level I/O operations that matter. But if you reduce the high-level I/O operations, the low-level ones are generally reduced by the same proportion. The Java read/write/open/close operations at the "native" level are also the OS read/write/open/close operations for all the Java runtimes I've investigated.

The simplest alternative would be to use the debug interface to count the number of hits on the method. Unfortunately, you cannot set a breakpoint on a native method, so this is not possible.

The result is that it takes some effort to identify every I/O call in an application. If you have consistently used your own I/O classes, the java.io buffered classes, and the java.io Reader and Writer classes, it may be enough to wrap the I/O calls to FileOutputStream and FileInputStream from these classes. If you have done nonstandard things, you need to put in more effort.

One other way to determine how many I/O operations you have used is to execute Runtime.getRuntime( ).traceMethodCalls(true) before the test starts, capture the method trace, and filter out the native calls you have identified. Unfortunately, this is optional functionality in the JDK (Java specifies that the traceMethodCalls( ) method must exist in Runtime, but it does not have to do anything), so you are lucky if you use a system that supports it. The only one I am aware of is the Symantec development environment, and in that case, you have to be in the IDE and running in debug mode. Running the Symantec VM outside the IDE does not seem to enable this feature. Some profilers may also help to produce a trace of all I/O operations.

I would recommend that all basic I/O calls have logging statements next to them, capable of reporting the amount of I/O performed (both the number of I/O operations and the number of bytes transferred). I/O is typically so costly that one null call or if statement (when logging is not turned on) is not at all significant for each I/O performed. On the other hand, it is incredibly useful to be able to determine at any time whether I/O is causing a performance problem. Typically, I/O performance depends on the configuration of the system and on resources outside the application. So if an unusual configuration causes I/O to be dramatically more expensive, this can be easily missed in testing and difficult to determine (especially remotely) unless you have an I/O-monitoring capability built into your application.