Cryptography in the ZIP2 System

This describes the use of cryptography in the ZIP2 file format. Unlike the other documents in this set, it is designed to be more self-contained, to facilitate peer-review from those interested in cryptopgraphy and security, without the need to be familiar with all the concepts in this file format specification. So, necessary details about other parts of the system are summarized here in-line.

Table of Contents

Requirements

Knowing nothing else about what the ZIP2 program does or what the file format looks like, the issues should be understood (indeed, should be driven by) a set of requirements.

In addition, it will be useful to understand a few things about the file format.

The ZIP2 file is written in a series of chunks. Each chunk is written or read independantly, and updates to the file format specification is done by defining new chunk types. Each chunk is defined by its chunk ID, not by any specific offset within the physical file. Chunks may be re-arranged or moved freely, without updating their contents or any cross-references.

Encrypting affects the payload of a chunk, without regard to what kind of information that chunk contains—strored file data, or any kind of meta-data about the stored file or the archive itself.

The meta-data for the chunk, including its type and length, is not encrypted.

The final bits stored in a ZIP2 file’s payload are determined by some original content to be stored and a series of transformations. The transformations can include compression, encryption, and others. A transformation, by definition, is a reversable process that accepts a stream of input and produces a stream of output. The transformation algorithm has access to the chunk’s header, also.

The specification of cryptographic algorithms can focus on a transformation that only does this encryption step. Any other preparation or conditioning of the data can be described as a transformation that preceeds it. So we meerly need to concern ourselves with meeting preconditions on the data (e.g. a common precondition is that the length is a multiple of the block size).

Finally, just what do we want to accomplish with these security features? It is important to understand what problem is being solved.

CrypHash — Cryptographic Hash Function

Updated 2 July 2003

Everything in this specification that needs a cryptographic hashing function refers to this definition. It’s refered to as CrypHash so as not to be confused with other meanings of “hash”.

The cryptographic hashing function used by ZIP2 is SHAd-256 as defined in 6.4.2 of Practical Cryptography.

SHAd-256(m) is defined as SHA-256(SHA-256(m)). That is, hash the result of the hash again. SHA-256 is defined by NIST in draft FIPS PUB 182-2.

After reading Ferguson and Schneier, I hope/expect that something better will come along eventually. So, this specification takes care to specify the cryptographic hash function only in this section, and all uses of such a function refer to this section. That is, which function is used can be globally changed in this specification simply by redefining CrypHash without editing any other text.

In case future versions of this specification want to make such a change, and preserve interoperability and backward compatibility, a META-b(70) chunk in the file specifies globally (for the entire file) which hash function is used. Initially, only one is defined.

Block Cipher

Updated 14 July 2003

This describes the usage of a 128-bit block cipher in general, and should be applicable to any such cipher and key size. Extensions to specify additional ciphers, as well as future enhancents to this specification, can all use this section if the only change is to the underlying block cipher choice.

The block cipher is used in Counter mode. A keystream is generated as long as necessary by encrypting concecutive 128-bit integers, described below. If the message is not an exact multiple of the block size, the excess key bytes are discarded. The ciphertext is the XOR of the keystream with the plaintext.

The initial value of the counter, used to generate the first block of the keystream, is computed as CrypHash (hash-id || chunk ID || length || alg-id || 0x00), taking the first 12 bytes, and then appending 4 bytes of zeros; where Chunk ID is the same sequence of bytes that would be used to identify this Chunk in the TOCN, and length is a uintV containing the length in bytes of the plaintext, alg-id is different for each underlying block cipher, 0x00 is a zero byte and || is concatenation.

To increment the counter, treat it as a 128-bit integer in big-endian format.

Note that the Chunk ID includes an indication as to which encryption algorithm and parameters are being used, so this cannot be altered by an attacker to force a different decoding.

The hash-id is there to be changed when the definition of CrypHash changes. It matches the sequence of bytes used to specify CrypHash in the META-b(70) chunk (it is currently the single byte 0x00).

The CrypHash algorithm is used even though only 96 bits are needed, rather than using a shorter hash algorithm, because this is the only hash used throughout this specification. This prevents the need to include multiple algorithms in the code.

The alg-id is arguably not necessary, since the Chunk ID encodes this information indirectly, too. It’s included for good measure, and to assure that different block ciphers will get different initialization vectors.

The 0x00 byte identifies this specific usage of the initialization vector. If future specifications provide additional things, some of which require a similiar hash of the Chunk ID, this byte will not be zero for other uses. Each use will be guaranteed to give different hash results.

The last 32 bits are zeroed to prevent overlaps in generated counter values, when different chunks are encrypted using the same key and none of the chunks is longer than 232 blocks. If a Chunk is longer than 32 gigabytes, adding the overhead of an intermediate session key is not a big deal!

The ciphertext is not larger than the plaintext in keeping with requirement R1. The meta-data of the plaintext, though not encrypted, is linked with the encryption to prevent an attacker from changing the meaning of the ciphertext.

Rijndael (AES)

Updated 5 September 2003

Rijndael-256 is used as described above with an alg-id of the byte 66 (hex 42).

The derived key is formed with a parameter of “Rijndael256”.

Unique Initialization Vector

Updated 6 September 2003

The block cipher specification describes how the initial value for the counter (for encrypting in CTR mode) is generated from the Chunk ID. To prevent generating the same keystream, it is necessary for this Initialization Vector to be unique for a given derived key. This section discusses this issue.

Split Archives

Within a single ZIP2 file, every Chunk has a unique Chunk ID. So, in the simplest application, we will always get a different initialization vector.

Within a multi-part archive (a ZIP2 archive spanning multiple physical files), some Chunk IDs may be unique to a file but not unique to the file set. Specifically, there are two broad catigories of Chunks—those that are numbered with unique IDs; and those that appear only once in a file. The latter kind may appear once in each file, thus the duplication of the Chunk ID.

A general mechanism is the explicit introduction of stated key material. A Chunk may be encrypted using key B, where the definition of key B is itself encrypted using key A. This allows a intermediate-key mechanism that can be used anywhere a key is used. Using a different intermediate key in each portion file will avoid the issue of repeated Chunk IDs. Potentially, every Chunk can have a different intermediate key, completely avoiding any issues concerning the uniqueness of initialization vectors.

However, the archive is increased by the size of the key. A generational key mechanism (below) is provided because it is much shorter—just a (possibly) one-byte parameter.

Updated Files

A related issue concerns the updating of files. Although it’s true that a Chunk ID can’t be repeated within an archive, what about different versions of the same archive?

For example, a chunk with id DATA#42 may be used to store the contents of file foo.html. The archive is distributed. Later, the content of foo.html is changed, and the archive updated to match. A new copy is distributed, with a different chunk called DATA#42. Someone who intercepted both the distributions will have two chunks with the same ID, taken from two different ZIP2 archives. These two are special in that the other content has not been completely re-encoded, so the various keys are still the same, and the new version will be using the same derived key as the old.

In the case of DATA Chunks, it would be a simple matter to never re-use the instance number when updating a file. But other archive content may be encrypted too, such as the directory listing. Such things may have fixed or well-known instance numbers.

It would be safest to simply re-encode the entire archive for the revised version. Any intermediate keys would be changed. If a key were generated directly from a password, introduce an intermediate key so it can be changed without changing the password.

If this is not desirable (say, you are only sending a delta of the changes, or changing a subset of a complete set of media), then the same principle can be used but only on the updated chunks: An updated chunk uses a different intermediate key than the old version, but unchanged chunks continue using the old version.

Generational Keys

To facilitate the use of more intermediate keys,a generational key capability is provided. This is a key definition that takes one key for input along with a generation number and produces another key as output. This would increase the size of the archive only by the size of the generational-key definition, which is a chunk that contains the instance number of the input key and the generation number (2 bytes plus chunk definition overhead of 5 bytes, total 7 bytes). This is much shorter than storing 16 bytes of key material (with the same 5-byte overhead).

The generational key algorithm is as follows: CrypHash ("generational" || generation || basekey). See the details on the KEYD page.

Recommendations

When updating an archive, and writing a new chunk with a same Chunk ID as an existing encrypted chunk, increment the generation number. If the chunk to be updated is encrypted with a generational key, then form a new generational key referencing the same base and the next generation number. If this new generational key is already in the archive, use that instance number, otherwise add it to the archive and note the assigned instance number. The instance number of the new generational key is used for the encryption on the updated chunk.

If updating an existing encrypted chunk that is not using a generational key but a key of some other type, create a new generational key referencing the original key as the base and the number 1 for the generation.

When adding a chunk with an ID that is not present now but you suspect was present at one time in the past, the issues are the same as with an update. Use the largest generation number present in the archive as the “worst case” starting point, or introduce a new derived key.

For multi-part archives, always introduce an intermediate key (either generational or derived) for each portion file. All chunks that may be repeated in different portion files (e.g. those that have instance numbers of zero) should use the per-portion derived key rather than the base key.

When updating an archive, if the implementation can’t be sure of the strategy used before, it can avoid any problems by always introducing a new derived key with a random value.

Message Authentication

A fundimental concept of the ZIP2 file structure is that information is stored in separate Chunks, not in a single record. In particular, information about a file stored in the archive (including its content) are in various Chunks related by a common identifier, just as a relationship is stored in a relational database by having multiple records share a value in a key field.

So just encrypting the file content Chunk leaves other information about the file unprotected. In fact, any Chunk may be encrypted, so some information about the file can be exposed and other information kept secret, as fits the need. But what about deleting chunks? The very structure of the file promotes separate Chunks to describe a file with no explicit reference from one to another; they are related simply by being present and sharing a common key (the Instance Number).

An active attacker could delete Chunks, and thus remove possibly vital meta-information about the file such as security attributes. Or, an attacker could replace such a Chunk with one that’s not encrypted at all, to substitute his own information.

To prevent this, all the related Chunks need to be tied together so that altering or deleting one will cause message authentication to fail.

This is done using a KHSH Chunk. To summarize, all the various chunks that contribute to the extracted file’s meaning are listed, along with a keyed hash of their combined content. It requires the same key to verify the keyed hash (or to create it) as it does to decrypt those chunks.

When a user extracts an encrypted file from the ZIP2 archive, every chunk that’s referenced in order to perform that extraction is noted, and must be authenticated using the same origin key (all of them if multiple origin keys are required). If a Chunk listed in the KHSH manifest is missing or doesn’t check, then it is considered a security error.

Keys

The origin key is the key material provided by the user to extract a file or otherwise read protected content in a ZIP2 archive. This may be a password, a private asymetric key, a hardware token, or whatever.

The origin key might not be directly used to encrypt Chunks. The origin key may unlock another key, or lead to a chain of such keys, until finally the base key is what is supplied to the algorithms listed in this document.

This allows multiple passwords to access overlapping content without unnecessary redundancy. For example, three different origin keys may unlock base key 1, and one of those origin keys (but not the others) also unlocks base key 2. A Chunk encrypted using base key 1 can be read by any of three users entrusted with their individual passwords, but only one of those users can decrypt a Chunk encrypted using base key 2.

The same mechanism extends to facilitating a hybrid system, where the origin key is an asymetric “private key” and the base key is a symetric key.

The same base key may be provided to several different steps in the overall process. For example, content will be encrypted using AES and also authenticated using a keyed hash. The first thing that these algorithms do with the provided key material (the base key) is generate a derived key to ensure that different aspects of the system don’t use the same actual key.

A derived key is created by applying a CrypHash hash to a parameter followed by the base key. For example, a derived key “foo” is computed as CrypHash("foo" || basekey).

Asymetric (public/private) Keys

Chunks may be encrypted using a recipient’s public key, to be decrypted using the corresponding private key. Such keys are defined using Elliptic Curve Cryptography (ECC) systems, because such keys are short.

According to our fundimental requirement (R1), RSA keys are extremely undesirable. The message size is the same as the key size, and this will be many times longer than a symetric key with the same security. Since the only thing encoded using asymetric techniques is the symetric key, this will always waste a great deal of space.

ECC using GF(2n) has much smaller key size and thus message size than RSA, and the margin is growing larger: With ECC we need to add two bits to double the complexity, to keep up with Moore’s Law. With RSA, it takes 5 bits or so to double the complexity, and we must also keep up with improvments in factoring methods at about the same pace. So, with RSA we need to add 10 bits per year to maintain the same level of security, while with ECC we need to add 2.

Details of the group parameters TBA.

References

Applied Cryptography by Bruce Schneier, ISBN 0-471-11709-9

Practical Cryptography by Niels Ferguson and Bruce Schneier, ISBN 0-471-22357-3


Valid HTML 4.01!

Page content copyright 2003 by John M. Dlugosz. Home:http://www.dlugosz.com, email:mailto:john@dlugosz.com