4.14 Timing Cryptographic Primitives

4.14.1 Problem

You want to compare the efficiency of two similar cryptographic primitives and would like to ensure that you do so in a fair manner.

4.14.2 Solution

Time operations by calculating how many cycles it takes to process each byte, so that you can compare numbers across processors of different speeds more fairly.

Focus on the expected average case, best case, and worst case.

4.14.3 Discussion

When you're looking at timing information, you usually have one of two motivations: either you're interested in comparing the performance of two algorithms, or you'd like to get a sense of how much data you'll actually be able to pump through a particular machine.

Measuring bytes per second is a useful thing when you're comparing the performance of multiple algorithms on a single box, but it gives no real indication of performance on other machines. Therefore, cryptographers prefer to measure how many processor clock cycles it takes to process each byte, because doing so allows for comparisons that are more widely applicable. For example, such comparisons will generally hold fast on the same line of processors running at different speeds.

If you're directly comparing the speed of an algorithm on a 2GHz Pentium 4 against the published speed of the same algorithm run on a 800 MHz Pentium 3, the first one will always be faster when measured in bytes per second. However, if you convert the numbers from bytes per second to cycles per byte, you'll see that, if you run the same implementation of an algorithm on a P3 and a P4, the P3 will generally be faster by 25% or so, just because instructions on a P4 take longer to execute on average than they do on a P3.

If you know the speed of an algorithm in bytes per second, you can calculate the number of cycles per byte simply by dividing by the clock speed in hertz (giving you bytes per cycle) and taking the reciprocal (getting cycles per byte). If you know the speed measured in gigabytes per second, you can divide by the clock speed in gigahertz, then take the reciprocal. For example, you can process data at 0.2 gigabytes per second on a 3 GHz CPU as follows:

.2/3 = 0.066666666666666666 (bytes processed per cycle)

1/0.066666666666666666 = 15.0 cycles per byte

For many different reasons, it can be fairly difficult to get timing numbers that are completely accurate. Often, internal clocks that the programmer can read are somewhat asynchronous from the core processor clock. More significantly, there's often significant overhead that can be included in timing results, such as the cost of context switches and sometimes timing overhead.

Some CPUs, such as AMD's Athlon, are advertised such that the actual clock speed is not obvious. For example, the Athlon 2000 runs at roughly 1666 MHz, significantly less than the 2000 MHz one might suspect.

Generally, you'll want to find out how quickly a primitive or algorithm can process a fixed amount of data, and you'd like to know how well it does that in a real-world environment. For that reason, you generally shouldn't worry much about subtracting out things that aren't relevant to the underlying algorithm, such as context switches and procedure call overhead. Instead, we recommend running the algorithm many times and averaging the total time to give a good indication of overall performance.

In the following sections we'll discuss timing basics, then look at the particulars of timing cryptographic code.

4.14.3.1 Timing basics

You need to be able to record the current time with as much precision as possible. On a modern x86 machine, it's somewhat common to see people using inline assembly to call the RDTSC instruction directly, which returns the number of clock cycles since boot as a 64-bit value. For example, here's some inline assembly for GCC on 32-bit x86 platforms (only!) that reads the counter, placing it into a 64-bit unsigned long long that you pass in by address:

#define current_stamp(a) asm volatile("rdtsc" : "=a"(((unsigned int *)(a))[0]),\
 "=d"(((unsigned int *)a)[1]))

The following program uses the above macro to return the number of ticks since boot:

#include <stdio.h>
   
int main(int argc, char *argv[  ]) {
  spc_uint64_t x;
   
  current_stamp(&x);
  printf("%lld ticks since boot (when I read the clock).\n", x);
  return 0;
}

RDTSC is fairly accurate, although processor pipelining issues can lead to this technique's being a few cycles off, but this is rarely a big deal.

On Windows machines, you can read the same thing using QueryPerformanceCounter( ), which takes a pointer to a 64-bit integer (the LARGE_INTEGER or _ _int64 type).

You can get fairly accurate timing just by subtracting two subsequent calls to current_stamp( ). For example, you can time how long an empty for loop with 10,000 iterations takes:

#include <stdio.h>
   
int main(int argc, char *argv[  ]) {
  spc_uint64_t start, finish, diff;
  volatile int i;
   
  current_stamp(&start);
  for (i = 0;  i < 10000;  i++);
  current_stamp(&finish);
  diff = finish - start;
  printf("That loop took %lld cycles.\n", diff);
  return 0;
}

On an Athlon XP, compiling with GCC 2.95.4, the previous code will consistently give 43-44 cycles without optimization turned on and 37-38 cycles with optimization turned on. Generally, if i is declared volatile, the compiler won't eliminate the loop, even when it can figure out that there are no side effects.

Note that you can expect some minimal overhead in gathering the timestamp to begin with. You can calculate the fixed timing overhead by timing nothing:

int main(int argc, char *argv[  ]) {
  spc_uint64_t start, finish, diff;
   
  current_stamp(&start);
  current_stamp(&finish);
  diff = finish - start;
  printf("Timing overhead takes %lld cycles.\n", diff);
  return 0;
}

On an Athlon XP, the overhead is usually reported as 0 cycles and occasionally as 1 cycle. This isn't really accurate, because the two store operations in the first timestamp call take about 2 to 4 cycles. The problem is largely due to pipelining and other complex architectural issues, and it is hard to work around. You can explicitly introduce pipeline stalls, but we've found that doesn't always work as well as expected. One thing to do is to time the processing of a large amount of data. Even then, you will get variances in timing because of things not under your control, such as context switches. In short, you can get within a few cycles of the truth, and beyond that you'll probably have to take some sort of average.

A more portable but less accurate way of getting timing information on Unix-based platforms is to ask the operating system for the clock using the gettimeofday( ) function. The resolution varies depending on your underlying hardware, but it's usually very good. It can be implemented using RDTSC but might have additional overhead. Nonetheless, on most operating systems, gettimeofday( ) is very accurate.

Other Ways to Get the Time

On many machines, there are other ways to get the time. One way is to use the POSIX times( ) function, which has the advantage that you can separate the time your process spends in the kernel from the time spent in user space running your code. While times( ) is obsoleted on many systems, getrusage( ) does the same thing.

Another alternative is the ISO C89 standard function, clock( ). However, other timers we discuss generally provide resolution that is as good as or better than this function.

Here's a macro that will use gettimeofday( ) to put the number of microseconds since January 1, 1970 into an unsigned 64-bit integer (if your compiler does not support a 64-bit integer type, you'll have to store the two 32-bit values separately and diff them properly; see below).

#include <sys/time.h>
#define current_time_as_int64(a) {                               \
          struct timeval t;                                      \
          gettimeofday(&t, 0);                                   \
          *a = (spc_uint64_t)((t.tv_sec * 1000000) + t.tv_usec); \
        }

Attackers can often force the worst-case performance for functionality with well-chosen inputs. Therefore, you should always be sure to determine the worst-case performance characteristics of whatever it is you're doing, and plan accordingly.

The gettimeofday( )-based macro does not compute the same thing the RDTSC version does! The former returns the number of microseconds elapsed, while the latter returns the number of cycles elapsed.

You'll usually be interested in the number of seconds elapsed. Therefore, you'll need to convert the result of the gettimeofday( ) call to a number of cycles. To perform this conversion, divide by the clock speed, represented as a floating-point number in gigahertz.

Because you care about elapsed time, you'll want to subtract the starting time from the ending time to come up with elapsed time. You can transform a per-second representation to a per-cycle representation once you've calculated the total running time by subtracting the start from the end. Here's a function to do both, which requires you to define a constant with your clock speed in gigahertz:

#define MY_GHZ 1.6666666666666667  /* We're using an Athlon XP 2000 */
   
spc_uint64_t get_cycle_count(spc_uint64_t start, spc_uint64_t end) {
  return (spc_uint64_t)((end - start) / (doublt)MY_GHZ);
}

4.14.3.2 Timing cryptographic code

When timing cryptographic primitives, you'll generally want to know how many cycles it takes to process a byte, on average. That's easy: just divide the number of bytes you process by the number of cycles it takes to process. If you wish, you can remove overhead from the cycle count, such as timing overhead (e.g., a loop).

One important thing to note about timing cryptographic code is that some types of algorithms have different performance characteristics as they process more data. That is, they can be dominated by per-message overhead costs for small message sizes. For example, most hash functions such as SHA1 are significantly slower (per byte) for small messages than they are for large messages.

You need to figure out whether you care about optimal performance or average-case performance. Most often, it will be the latter. For example, if you are comparing the speeds of SHA1 and some other cryptographic hash function such as RIPEMD-160, you should ask yourself what range of message sizes you expect to see and test for values sampled throughout that range.

[ Team LiB ]