8.6 Compression

A colleague of mine once installed a compression utility on his desktop machine that compressed the entire disk. The utility worked as a type of disk driver: accesses to the disk went through the utility, and every read and write was decompressed or compressed transparently to the rest of the system, and to the user. My colleague was expecting the system to run slower, but needed the extra disk space and was willing to put up with a slower system.

What he actually found was that his system ran faster! It turned out that the major bottleneck to his system was disk throughput, and by making most files smaller (averaging half the previous size), everything was moving between memory and disk much quicker. The CPU had plenty of spare cycles necessary to handle the compression-decompression procedures because it was waiting for disk transfers to complete.

This illustrates how the overhead of compression can be outweighed by the benefits of reducing I/O. The system described obviously had a disk that was relatively too slow in comparison to the CPU processing power. But this is quite common. Disk throughput has not improved nearly as fast as CPUs have increased in speed, and this divergent trend is set to continue for some time. The same is true for networks. Although networks do tend to have a huge jump in throughput with each generation, this jump tends to be offset by the much larger volumes of data being transferred. Furthermore, network-mounted disks are also increasingly common, and the double performance hit from accessing a disk over a network is surely a prime candidate for increasing speed using compression.

On the other hand, if a system has a fully loaded CPU, adding compression can make things worse. This means that when you control the environment (servers, servlets, etc.), you can probably specify precisely, by testing, whether or not to use compression in your application to improve performance. When the environment is unknown, the situation is more complex. One suggestion is to write I/O wrapper classes that handle compressed and uncompressed I/O automatically on the fly. Your application can then test whether any particular I/O destination has better performance using compression, and then automatically use compression when called for.

One final thing to note about compressed data is that it is not always necessary to decompress the data in order to work with it. As an example, if you are using 2-Ronnies compression,^[14] the text "Hello. Have you any eggs? No, we haven't any eggs" is compressed into "LO. F U NE X? 9, V FN NE X."

^[14] "The Two Ronnies" was a British comedy show that featured very inventive comedy sketches, many based on word play. One such sketch involved a restaurant scene where all the characters spoke only in letters and numbers, joining the letters up in such a way that they sounded like words. The mapping for some of the words to letters was as follows: have = F, you = U, any = NE, eggs = X, hello = LO, no = 9, yes = S, we = V, have = F, haven't = FN, ham = M, and = N

Now, if I want to search the text to see if it includes the phrase "any eggs," I do not actually need to decompress the compressed text. Instead, I compress the search string "any eggs" using 2-Ronnies compression into "NE X", and I can now use that compressed search string to search directly on the compressed text.

When applied to objects or data, this technique requires some effort. You need to ensure that any small data chunk compresses in the same way both on its own and as part of a larger volume of data containing that data chunk. If this is not the case, you may need to break objects and searchable data into fields that are individually compressed.

There are several advantages to this technique of searching directly against compressed data:

There is no need to decompress a large amount of data.
Searches are actually quicker because the search is against a smaller volume of data.
More data can be held in memory simultaneously (since it is compressed), which can be especially important for searching through large volumes of disk-stored data.

It is rarely possible to search for compressed substrings directly in compressed data because of the way most compression algorithms use tables covering the whole dataset. However, this scheme has been used to selectively query for data locations. For this usage, unique data keys are compressed separately from the rest of the data. A pointer is stored next to the compressed key. This produces a compressed index table that can be searched without decompressing the keys. The compression algorithm is separately applicable for each key. This scheme allows compressed keys to be searched directly to identify the location of the corresponding data.