A message digest (also known as a cryptographic checksum or cryptographic hashcode) is nothing more than a number - a special number that is effectively a hashcode produced by a function that is very difficult to reverse.
A digital signature is (most often) a message digest encrypted with someone's private key to certify the contents. This process of encryption is called signing. Digital signatures can perform two different functions, both very important to the security of your system:
Integrity - A digital signature indicates whether a file or a message has been modified.
Authentication - A digital signature makes possible mathematically verifying the name of the person who signed the message.
A third function that is quite valuable in some contexts is called non-repudiation. Non-repudiation means that after you have signed and sent a message, you cannot later claim that you did not sign the original message. You cannot repudiate your signature, because the message was signed with your private key (which, presumably, no one else has).
We'll outline the first two of these concepts here, and refer to them in later chapters (and especially in Chapter 9, Integrity Management). Non-repudiation is not of major concern to us here, although it is important in many other contexts, especially that of email.
A simple hash function takes some input, usually of indefinite length, and produces a small number that is significantly shorter than the input. The function is many to one, in that many (possibly infinite) inputs may generate the same output value. The function is also deterministic in that the same output value is always generated for identical inputs. Hash functions are often used in mechanisms that require fast lookup for various inputs, such as symbol tables in compilers and spelling checkers.
A message digest is also a hash function. It takes a variable length input - often an entire disk file - and reduces it to a small value (typically 128 to 512 bits). Give it the same input, and it always produces the same output. And, because the output is very much smaller than the potential input, for at least one of the output values there must be more than one input value that can produce it; we would expect that to be true for all possible output values for a good message digest algorithm.
There are two other important properties of good message digest algorithms. The first is that the algorithm cannot be predicted or reversed. That is, given a particular output value, we cannot come up with an input to the algorithm that will produce that output, either by trying to find an inverse to the algorithm, or by somehow predicting the nature of the input required. With at least 128 bits of output, a brute force attack is pretty much out of the question, as there will be 1.7 x 1038 possible input values of the same length to try, on average, before finding one that generates the correct output. Compare this with some of the figures given in "Strength of RSA" earlier in this chapter, and you'll see that this task is beyond anything anyone would be able to try with current technology. With numbers as large as these, the idea that any two different documents produced at random during the course of human history would have the same 128-bit message digest is unlikely!
The second useful property of message digest algorithms is that a small change in the input results in a significant change in the output. Change a single input bit, and roughly half of the output bits should change. This is actually a consequence of the first property, because we don't want the output to be predictable based on the input. However, this aspect is a valuable property of the message digest all by itself.
Given the way that a message digest works, you can understand how it can be used as an authentication system for anyone who is distributing digital documents: simply publish your documents electronically, distribute them on the Internet, and for each document also publish its message digest. Then, if you want to be sure that the copy of the document you download from the Internet is an unaltered copy of the original, simply recalculate the document's message digest and compare it with the one for the document that you published. If they match, you know you've got the same document as the original.
In fact, the Computer Emergency Response Team (CERT) does this via the Internet when they distribute patches and bug fixes for security-related problems. The following is a portion of a 1994 message from CERT advising recipients to replace the FTPprograms on their computers with more secure versions:
Date: Thu, 14 Apr 94 16:00:00 EDT Subject: CERT Advisory CA-94:08.ftpd.vulnerabilities To: [email protected] From: [email protected] (CERT Advisory) Organization: CERT Coordination Center Address: Software Engineering Institute Carnegie Mellon University Pittsburgh, Pennsylvania 15213-3890 ======================================================================= CA-94:08 CERT Advisory April 14, 1994 ftpd Vulnerabilities ----------------------------------------------------------------------- The CERT Coordination Center has received information concerning two vulnerabilities in some ftpd implementations. The first is a vulnerability with the SITE EXEC command feature of the FTP daemon (ftpd) found in versions of ftpd that support the SITE EXEC feature. This vulnerability allows local or remote users to gain root access. The second vulnerability involves a race condition found in the ftpd implementations listed in Section I. below. This vulnerability allows local users to gain root access. Sites using these implementations are vulnerable even if they do not support anonymous FTP. . . . II. Impact Anyone (remote or local) can gain root access on a host running a vulnerable FTP daemon. Support for anonymous FTP is not required to exploit this vulnerability. III. Solution Affected sites can solve both of these problems by upgrading to the latest version of ftpd. These versions are listed below. Be certain to verify the checksum information to confirm that you have retrieved a valid copy. If you cannot install the new version in a timely manner, you should disable FTP service until you have corrected this problem. It is not sufficient to disable anonymous FTP. You must disable the FTP daemon. For wuarchive ftpd, you can obtain version 2.4 via anonymous FTP from wuarchive.wustl.edu, in the "/packages/wuarchive-ftpd" directory. If you are currently running version 2.3, a patch file is available. BSD SVR4 File Checksum Checksum MD5 Digital Signature ----------------- -------- --------- ----------------------------- wu-ftpd2.4.tar.Z 38213 181 20337 362 cdcb237b71082fa23706429134d8c32e patch_2.3-2.4.Z 09291 8 51092 16 5558a04d9da7cdb1113b158aff89be8f . . .
As you can tell from the tone of the message, CERT considers the security problem to be extremely serious: anyone on the Internet could break into a computer running a particular ftpd program and become the superuser! CERT had a fix for this bug but rather than distribute it to each site individually, they identified a principal site with the fix (wuarchive.wustl.edu) and advised people to download the fix.
MD5 is a commonly used message-digest algorithm. CERT publishes the MD5 "digital signature" of the files so that people can verify the authenticity of the patch before they install it. After receiving the message from CERT and downloading the patch, system administrators are supposed to compute the MD5 of the binary. This process provides two indications.
If the MD5 of the binary matches the one published in the CERT message, a system administrator knows the file wasn't damaged during the download.
More importantly, if the two MD5 codes match, a system administrator can be certain that hackers haven't broken into the computer wuarchive.wustl.edu and replaced the patch with a program that contains security holes.
Although CERT's interest in security is commendable, there is an important flaw in this approach. Nothing guarantees that the CERT message itself isn't forged. In other words, any hacker with sufficient skill to break into an anonymous FTP repository and play switcheroo with the binaries might also be able to send a forged email message to CERT's mailing lists telling system administrators to install new (and faulty) software. Unsuspecting administrators would receive the email message, download the patch, check the MD5 codes, and install the software - creating a new security hole for the hackers to exploit.
CERT's problem, then, is this: while the MD5 code in the email message is a signature for the patch, there is no signature for the message from CERT itself! There is no way to verify that the CERT message is authentic.
CERT personnel are aware of the problem that alerts can be forged. For this reason, CERT uses digital signatures for all its alerts available via anonymous FTP from the computer ftp.cert.org. For each alert, you find a file with an .asc file extension, which contains a digital signature. Unfortunately, CERT (at this time) does not distribute its alerts with the digital signature at the bottom of the message, which would make things easier for everybody.
As we pointed out in the last section, a message digest function is only half of the solution to creating a reliable digital signature. The other half is public key encryption - only run in reverse.
Recall that when we introduced public key encryption earlier in this chapter, we said that it depends on two keys:
A key that is used for encrypting a secret message. Normally, the public key is widely published.
A key that is kept secret, for decrypting messages once they are received.
By using a little bit of mathematical gymnastics, you can run the public key algorithm in reverse. That is, you could encrypt messages with your secret key; these messages can then be decrypted by anyone who possesses your public key. Why would anyone want to do this? Each public key has one and only one matching secret key. If a particular public key can decrypt a message, you can be sure that the matching secret key was used to encrypt it. And that is how digital signatures work.
When you apply your secret key to a message, you are signing it. By using your secret key and the message digest function, you are able to calculate a digital signature for the message you're sending. In principle, a public key algorithm could be used without a message digest algorithm: we could encrypt the whole message with our private key. However, every public key algorithm in use requires a large amount of processor time to encrypt even moderate-size inputs. Thus, to sign a multi-megabyte file might take hours or days if we only used the public key encryption algorithm.
Instead, we use a fast message digest algorithm to calculate a hash value, and then we sign that small hash value with our secret key. When you, the recipient, get that small value, you can decrypt the hash value using our public key. You can also recreate the hash value from the input. If those two values match, you are assured that you got the same file we signed.
The most common digital signature in use today is the combination of the MD5 message digest algorithm and the RSA public key encryption mechanism. Another likely possibility is to use the SHA (Secure Hash Algorithm) and ElGamal public-key mechanism; together, these two algorithms form the NIST DSA - Digital Signature Algorithm.
There are many message-digest functions available today. All of them work in roughly the same way, but they differ in speed and specific features. Details of these functions may be found in the references in Appendix D.
One of the most widely used message digest functions is the MD5 function, which was developed by Ronald Rivest, is distributed by RSA Data Security, and may be used freely without license costs. It is based on the MD4 algorithm, which in turn was based on the MD2 algorithm.
The MD2, MD4, and MD5 message digest functions all produce a 128-bit number from a block of text of any length. Each of them pads the text to a fixed-block size, and then each performs a series of mathematical operations on successive blocks of the input.
MD2 was designed by Ronald Rivest and published in RFC 1319. There are no known weaknesses in it, but it is very slow. To create a faster message-digest, Rivest developed MD4, which was published in Internet RFCS 1186 and 1320. The MD4 algorithm was designed to be fast, compact, and optimized for machines with "little-endian" architectures.
Some potential attacks against MD4 were published in the cryptographic literature, so Dr. Rivest developed the MD5 algorithm, published in RFC 1321.[21] It was largely a redesign of MD4, and includes one more round of internal operations and several significant algorithmic changes. Because of the changes, MD5 is somewhat slower than MD4. However, it is more widely accepted and used than the MD4 algorithm.
[21] Internet RFCs are a form of open standards documents. They can be downloaded or mailed, and they describe a common set of protocols and data structures for interpretability.
As of early 1996, significant flaws have been discovered in MD4. As a result, the algorithm should not be used.
The Secure Hash Algorithm was developed by NIST with some assistance by the NSA. The algorithm appears to be closely related to the MD4 algorithm, except that it produces an output of 160 bits instead of 128. Analysis of the algorithm reveals that some of the differences from the MD4 algorithm are similar in purpose to the improvements added to the MD5 algorithm (although different in nature).
The HAVAL algorithm is a modification of the MD5 algorithm, developed by Yuliang Zheng, Josef Pieprzyk, and Jennifer Seberry. It can be modified to produce output hash values of various lengths, from 92 bits to 256. It also has an adjustable number of "rounds" (application of the internal algorithm). The result is that HAVAL can be made to run faster than MD5, although there may be some corresponding decrease in the strength of the output. Alternatively, HAVAL can be tuned to produce larger and potentially more secure hash codes.[22]
[22] You should note that merely having longer hash values does not necessarily make a message digest algorithm more secure.
SNEFRU was designed by Ralph Merkle to produce either 128-bit or 256-bit hash codes. The algorithm can also be run with a variable number of "rounds" of the internal algorithm. However, analysis by several cryptographers has shown that SNEFRU has weaknesses that can be exploited, and that you can find arbitrary messages that hash to a given 128-bit value if the 4-round version is used. Dr. Merkle currently recommends that only 8-round SNEFRU be used, but this algorithm is significantly slower than the MD5 or HAVAL algorithms.
For the sake of completeness, we will describe two other types of "signature" functions.
A checksum is a function that is calculated over an input to determine if that input has been corrupted. Most often, checksums are used to verify that data communications over a modem or network link have not undergone "bit-rot," or random changes from noise. They may also be built into storage controllers to perform checks on data moved to and from media: if a checksum doesn't agree with the data, then there may be a problem on the disk or tape.
Checksums are usually calculated as simple linear or polynomial functions over their input, and result in small values (16 or 32 bits). CRC polynomials, or cyclic-redundancy checksums, are a particular form of checksum that are commonly used. The sum command in UNIX will generate a CRC checksum, although there appear to be at least three major versions of sum available on modern UNIX systems, and they do not generate the same values!
Checksums are easy to calculate, and are simple to fool. You can alter a file in such a way that it has the same simple checksum as before the alteration. In fact, many "hacker toolkits" circulating in the hacker underground have tools to recreate sum output for system commands after they have been modified! Thus, checksums should never be used as a verification against malicious tampering.
Message Authentication Codes, or MACS [23] are basically message digests with a password thrown in. The intent is that the MAC cannot be recreated by someone with the same input unless that person also knows the secret key (password). These may or may not be safer than a simple message digest - depending on the algorithm used, the strength of the key, and the length of the output MAC.
[23] Not to be confused with the Mandatory Access Control, whose acronym is also MAC.
One simple form of MAC appends the message to the key and then generates a message digest. Because the key is part of the input, it alters the message digest in a way that can be recreated. Because two keys will generate very different output for the same data input, we achieve our goal of a password-dependent MAC.
A second form of MAC uses some form of stream encryption method, such as RC4 or DES in CFB mode. The key in this case is the encryption password, and the MAC is the last block of bits from the encryption algorithm. As the encryption output depends on all the bits of input and the secret password, the last block of the output will be different for a different input or a different password. However, if the encryption block size is small (e.g., 64 bits), the MAC may be more susceptible to brute-force guessing attacks than a larger message digest value would be.
A public key digital signature may be thought of as a MAC, too, as it depends on the message digest output and the secret key. A change in either will result in a change in the overall value of the function.