ZIP2 File Structure

Data Type Definitions

lpstring

All text fields are encoded as Unicode in UTF-8 format, length-prefixed with a uintV specifying the number of bytes (not characters) that follow.

For example, “Hello” would be encoded as hex 05 48 65 6C 6C 6F (bold indicates the length), “Długosz ”1 as hex 08 44 C9 82 75 67 6F 73 7A, and the complete text of Moby Dick as hex D2 54 F1 43 41 6C 6C 20 6D 65 20 49 73 68 6D 61 65 6C 1,201,374 bytes omitted for brevity 61 67 6F 2E.

offset

A value that is a offset specifies the byte offset in an archive. In the normal case where the archive is simply saved as a file (as opposed to being embedded in a larger data stream), this is the file seek position.

If the archive is embedded in a larger data stream, the offset is still the same numeric value, since it is relative to the archive’s position in this larger file. That is, the first byte of the signature is defined as being offset zero.

sintV

A variable-length signed value. It may represent an integer of any length, and require from 1 to an arbitrary number of bytes. The encoding is detailed here.

uintV

A variable-length unsigned value. It may represent an integer of any length, and require from 1 to an arbitrary number of bytes. The encoding is detailed here.

This short segment allows a file to be easily identified as a ZIP2 file, locates the optional index, and provides for embedding a ZIP2 data stream in a larger file.

Signature

4 bytes containing “ZIP2”.

Usage Text

At most 128 bytes (and as little as 0 bytes) of UTF-8 text, intended to explain what to do with this file type, not be a comment about the contents of this specific archive. For example, “See http://www.dlugosz.com/ZIP2/ for information on the ZIP2 file format.”.

Seperator

3 bytes containing hex 1A 04 FE. The first two bytes terminates text on most systems (^Z or ^D), so the Signature and Usage Text can be typed (or cat'ed) as a text file. The third byte makes sure the high bits have not been lost, and is also a byte that is never present in a UTF-8 string. This means that the end of the usage string can be located unambiguously by scanning for a hex FE byte.

Offsets

Two uintV values giving the locations of other interesting things in the file.

Offset to ENDX

The first uintV is the offset of the last item in the archive, where an item is either a chunk or a fill byte. Anything after that item shall be ignored, so this facilitates embedding this archive in a larger data stream.

This value may be zero (which is efficiently encoded as a single zero byte) indicating that it is not being used.

This points to the beginning of the last chunk, rather than the end of the whole archive (e.g. it is the size) to facilitate reading that last chunk, which might be an ENDX chunk.

Offset to TOCN

The optional value provides a known location for the TOCN chunk, because looking for it defeates the purpose, and it can be moved and dynamically located during updates.

This uintVmay be zero to indicate that it is not being used. Otherwise, it must be the offset of the TOCN of this file.

Checksum

A one-byte hash of everything from the first byte of the signature through the byte that preceeds this one. See checksum for a complete definition.

In-Place Updates

An archive may be updated after it is initially written. If you update the Offset values in the header, the size may change due to the variable-length encoding. Other than rewriting the entire file to move things down, there are three ways to deal with this efficiently.

You can encode the offsets using a non-minimal length encoding. That is, suppose the initial value fit in a 2-byte form. You can chose the encode the value as a 4 or 5-byte form, so that larger numbers could be re-written in the same location.

You can include several fill bytes after the header, before the first chunk. This allows for planned growth of the header.

You can move the first chunk (only) out of the way, or split it. That is, you can make room without having to rewrite the entire file.

Header Example 1

0000: 5A 49 50 32   ; "ZIP2"
0004: 53 65 65 20 68 74 74 70  ; See http
000C: 3A 2F 2F 77 77 77 2E 64  ; ://www.d
0014: 6C 75 67 6F 73 7A 2E 63  ; lugosz.c
001C: 6F 6D 2F 5A 49 50 32 2F  ; om/ZIP2/
0024: 20 66 6F 72 20 69 6E 66  ;  for inf
002C: 6F 72 6D 61 74 69 6F 6E  ; ormation
0034: 20 6F 6E 20 74 68 65 20  ;  on the
003C: 5A 49 50 32 20 66 69 6C  ; ZIP2 fil
0044: 65 20 66 6F 72 6D 61 74  ; e format
004C: 2E                       ; .
004D: 1A 04 FE  ; seperator
0050: DE 84 80; offset to ENDX = 2,000,000
0053: 80 C8  ; offset to TOCN = 200
0055: 4E  ; checksum

Header Example 2

This is the smallest possible header, 10 bytes. The usage string is empty, and all optional features are unused.

0000: 5A 49 50 32   ; "ZIP2"
0004: 1A 04 FE  ; seperator
0007: 00  ; offset to ENDX
0008: 00  ; offset to TOCN
0009: C2  ; checksum

See the code samples to generate these headers.

Fill Bytes

To facilitate deleting or resizing chunks in-place, fill bytes may be used between chunks. A fill byte has the value of 0.

For regions larger than 4 bytes, use an XXXX chunk to mark the unused area in the file. The fill bytes are designed for gaps smaller than the minimum size of a regular chunk (which therefore can’t hold a XXXX chunk).

Since the chunk begins with a length, and a single zero byte is read as a length of zero, this elegantly fits the definition of a degenerate chunk. Code that walks the chunks in a file will easily handle the presence of fill bytes (see Listing 1 ????).

Chunks

After the non-chunk header, the rest of the file consists of chunks and possibly small numbers of fill bytes between chunks.

Each chunk contains a size, a type and instance number, various optional fields, content, and a checksum.

The chunk fields that are present are in this order: Size, Type (includes flags), Subtype, Instance Number, Instance Range, Part Number, Instance Sizes, Experimental ID, Payload Specification, Payload, and finally a Checksum.

Size

The first thing in the chunk is a uintV stating the number of bytes that follow in this chunk (not counting the bytes making up this field).

Yes, that’s right, the size comes before the type. This is logical, elegant, and efficient. Code can read or traverse chunks by reading a single uintV first, then reading (or skipping) the rest.

Type

The type is a 2-byte tag that identifies the semantic meaning and certain properties of the chunk. For example, a DATA chunk holds the actual compressed file data, and an INDX chunk holds file names. When encountering a chunk, this field tells you that it is a DATA or INDX or whatnot.

The type field records several flags in the high-order bits, and an enumeration for the different named types in the low-order bits.

For more on each chunk type, see the following detail documents:

namecode
(Subtype)
description
8D3N1 (69, 70)“8.3”-style short name aliases
AGNT6identifies creator and user-defined chunks
CMNT7 (64, 65)comments and icons for the corresponding files
COMP8 (64)compression algorithm
CRYP8 (65)encryption algorithm
DATA, DATA-nd1stored file contents
DEAT17used by transacted updates
DICT5 (65)data to seed a dictionary-based compressor
ENDX4end-of-archive marker
FATR1 (65)FAT file system attributes
FORM8payload specification advanced transforms
HASH11 (64)error detection for chunks or files
ICON13small bitmap to display
IDEN14identifies this file or part
INDX2directory of files stored
KEYD16Key definition
KHSH11 (66,67,68)authentication for chunks or files
JRNL31(64)used by transacted updates
MACF1 (66)Macintosh Finder info
META10
ROOT15suppliment to INDX, file source or destination
RTFM12comment for the entire archive
SECW1 (67)Win32 security information
SIGN11 (65)digital signatures for chunks or files
SIZE7 (67)file size information
TIME1 (64)advanced timestamp information for files
TOCN3chunk concordance
TREE5 (64)data to seed a frequency-based compressor
UNIX1 (68)UNIX file system attributes
UPDP9updatable data parameters
XXXX0unused areas in the file

Flags

The Flags are part of the Type field. Some of the flags indicate structural details of the chunk that follows. Some indicate options, and others change the semantics of the base type.

The flags all have 1-letter abbreviations, and relevant flags are noted as lowercase letters following the base name. For example, a DATA-rp chunk indicates a DATA chunk with the r (range) and p (part) flags set. The same chunk might also have the y flag set (which is typical for DATA flags) but it wasn't mentioned because it wasn't relevant to the discussion. Of course, sometimes it's necessary to note the complete value of the Type field, and this is done by noting all the flags.

[rpcdnxya|UUb#####]

The last 5 bits of the second byte, shown above as #####, holds the Type Number. That is, different values here indicate whether it is DATA or INDX or ENDX or whatever (see table above).

The 2-bit UU field indicates:
00 neither instance-indexed nor user-defined.
01 instance indexed (the i flag).
10 user-defined.
11 totally different interpretation of the Type field than what is documented here. A hedge for future expansion or meaning; not defined now.

the i flag

Indicates that the Instances Sizes field is present. This is set on most chunk types that can allow more than once instance, and is cleared on those that don’t need it because all the instances are the same size or contain internal indications of their lengths.

Note that you can consistantly treat the i and u flags as ordinary bit position flags, just like the others, provided you never try to set both at the same time.

Also, it is illegal to set the i flag without also setting the r flag. Doing so is reserved for future meanings.

the u flag

When set, this is a “user defined” chunk rather than one of the standard types like DATA, INDX, etc. Details TBA.

the p flag

Indicates that the Part Number field is present, and that a chunk has been divided into multiple parts. Set on any chunk that is so divided.

the c flag

When the p flag is set, this flag means “to be continued” and is set on all parts except the last one. This flag may only be set if the p flag is also set; other bit combinations are reserved for future use.

the d flag

Indicates “redundant”, or that the information in this chunk is not strictly necessary but can be computed using other information or found in other files in a multi-part archive.

This flag changes the semantic meaning and content of the chunk’s definition, and will always be documented separatly as a different type.

the n flag

Indicates “pointer”, or that the chunk contains a reference to the real information, located elsewhere. Often used with the d flag.

This flag changes the semantic meaning and content of the chunk’s definition, and will always be documented separatly as a different type.

the x flag

Indicates that the Experimental ID field is present. Will be set only during development of new chunk types or user-defined chunks.

the y flag

Indicates that the Payload Specification field is present, and the payload processed as indicated. Any chunk can have this flag set to use this feature, unless otherwise noted.

the a flag

Indicates that the chunk’s instance number matches the same instance number of a DATA chunk, and this information suppliments the information for that same file. Some chunk types have this flag always specified, to indicate that they are used this way.

the r flag

Indicates that the Instance Range field is present, and the chunk contains several instances concatenated together. This flag may be set on any chunk that uses this feature. All chunk types may do this, unless otherwise noted.

the b flag

Indicates that the Subtype field is present.

chunk field Subtype

The Subtype is a uintV value. Subtypes are used to extend the range of Type numbers, logically organize related types, and provide a single value that applies to all instances of the Payload.

Sometimes a Type/Subtype constant is used for a specific purpose. In that case, the table above shows a name for a specific type and subtype combined. Other times the name applies to the Type value only, and the Subtype is considered a parameter of that type.

chunk field Instance Number

The Instance Number is a uintV that identifies which occurance of a particular chunk of the same type that this is.

For example, one file’s data may be stored in chunk DATA#1 (the #n indicates the instance number), and another file will be stored in chunk DATA#2. Each file is stored in a different DATA chunk, and each has its own number.

The chunk instance number is unique to the type, but different types may have the same number without conflict. Chunks are referenced by number, not by position in the file (or even which file they are in, in spanning archives), so this number is very important. The number 0 is used for chunk types that only occur once in a file.

This field is always present, on every chunk.

chunk field Instance Range

The Instance Range is a uintV specifying the number of instances recorded in this chunk.

Suppose you had chunks DATA#10, DATA#11, and so on up through DATA#42. Instead of 32 individual chunks, you could store one chunk, called DATA#10-42. This can save overhead of the chunk definition, and allow “solid packing” of small files.

The file actually stores the starting number and length, rather than the starting number and ending number, because it may encode shorter. For example, in DATA#80-180 the Instance Number will be 80, and the Instance Range will be 100 (and 100 encodes as 1 byte, but 180 encodes as 2 bytes).

chunk field Part Number

This field is a uintV specifying the part number.

A chunk’s content may be arbitrarily chopped up into smaller chunks. When this happens, each is given the same number and consecutive parts, starting with 0. For example, DATA chunk #3 may contain compressed content of a file, and it is chopped up into dATA #3p0 and DATA#3p1 (and each may be in a different file, for spanning archives).

chunk field Instance Sizes

If the chunk contains more than once instance, there needs to be some way of knowing where one ends and the next one begins. Sometimes it is apparent because they are all the same size or “fully structured” meaning it can be read and the reading knows when to stop because of the nature of the content. But sometimes the length must be known before reading.

When necessary (indicated by the i flag), this index is present for this purpose. It is a list of (Instance Range - 1) sintV's specifying the delta size for each item.

This is explained in depth on its own page To be written.

chunk field Experimental ID

This uintV is an aid for developers. In order to allow for continued development of ZIP2 code, without confusing the population of ZIP2 files with incompatibilities during development or after a new release, there is a standard way to mark a chunk as experimental.

During development of a new chunk type, set the x flag and supply a number in this field to indicate which revision of your definition is here. When the chunk is ratified, remove this. This is used for development only, and is not a general version tracking mechanism for allowing chunk definitions to change! Once the chunk is ratified, it may not be changed in a way that is not strictly backward compatible.

chunk field Payload Specification

This field consists of 2 uintV's specifying the Decompression and Decryption, respectivly. If absent, the payload data is taken as-is. If present, the Payload content is run through the indicated decryption and decompression to produce the data stored in this chunk. The decryption value indicates an instance of a CRYP chunk. That is performed, and the the result is transformed using the instance of the COMP chunk indicated by the decompression value. Either value may be 0 to indicate “none”.

For more advanced transformations, the Payload Specification can point to a FORM chunk, which can have any number of transformations of more than these two types, and in any order.

To do this, use a value of 0x7F for the first uintV, and the Instance Number of a FORM chunk as the second.

chunk Payload (the actual content)

What more can be said?

chunk field Checksum

A one-byte hash of everything from the first byte of the Length through the byte that preceeds this one. See checksum for a complete definition.

This field is always present, on every chunk.

Examples

Chunk Example: empty DATA chunk

Here is the smallest possible chunk, weighing in at 5 bytes. This is how an empty file would be saved.

0000: 04  ; length is 4 bytes, not counting the length field itself
0001: 00 01  ; Type is DATA, no flags.
0003: 2A  ; Instance #42
0004: 6F  ; checksum

Chunk Example: empty DATA-p chunk

Here is a more realistic example, though it is one byte longer. If a program were “streaming” output to a ZIP2 file, and and didn't know how long the data was going to be until it encountered the end, it would buffer the data and emit one chunk at a time. Then, it would emit a final (possibly empty) part that is not marked “to be continued”, to terminate the multi-part series.

This is the last of 13 parts of DATA#43, which happens to be empty.

0000: 05  ; length is 5 more bytes
0001: 40 01  ; Type is DATA-p
0003: 2B  ; Instance #43
0004: 0D  ; Part 13.  This field is present because the p flag is set.
0005: 10  ; checksum

Chunk Example: INDX-nd

 0000: 07  ; length is 7 more bytes
 0001: 98 02  ; Type is INDX-rnd
 0003: 01  ; Instance #1 through...
 0004: 03  ; Range is 3, so this gives #1-3.
 0005: 80 CB  ; Payload (2 bytes long)
 0007: 98  ; checksum
 

The Payload is always the rest of the chunk, after taking out the various header fields and checksum. In this case, it is only two bytes. In case you are wondering, this chunk says that the real INDX instances 1, 2, and 3 can be found in file sequence number 203 of a multi-file archive. If this handy record is present in every file of the set, then the program can prompt the user to put in disc number 203 rather than checking every single one of them when it wants to read the very important INDX#0. It is concevable that a multi-file archive will put the complete index in the last file, but the user launches the program with the first disc loaded. 203 files seems a bit extreme, but the number was chosen so the Payload wouldn't be only one byte in the example.

Footnotes

Długosz

The second character, ł, is a Polish slashed lower-case L (Unicode U+422). This is the two bytes hex C9 82 encoded in UTF-8. It sounds like a “w” vowel in English, by the way.


Valid HTML 4.01!

Page content copyright 2003 by John M. Dlugosz. Home:http://www.dlugosz.com, email:mailto:john@dlugosz.com