COMP Chunk

See more on Chunks in general.

This is a Compression Algorithm specification, used to configure parameters for a compression algorithm. COMP is a special case of FORM, subtype 64 (hex 40).

COMP Chunk structure

Flags

Flagusage
a (correlated) Cleared.
b (subtype) Set.
r (range of instances) all used in the general manner.
p (multi-part) & c
y (payload specification)
i (instance sizes) Set when r is set, Clear otherwise.

Although the y (payload specification) flag is allowed, it may not refer to itself! Generally, compressing this chunk is pointless, so any use of the y flag will probably be for other unrelated purposes, such as data-integrety of this critical information.

Subtype

Subtype is set to 64 (hex 40).

Instance Number

The Instance Number is matched against the compression value in any Payload Specification.

That is, when some chunk specifies a value for how its payload is compressed, that value refers to a COMP chunk of that number.

Reserved and Predefined Instance Numbers

Instance numbers from hex 40 through hex FFF (64 through 4095) are reserved. An archive shall not contain instances in this range.

Pre-defined COMP definitions
InstanceMeaning
0x40Deflation with no dictionary
0x41Burrows-Wheeler Transform
0x42UTR-6 wrapped in Deflation
0x43Deflation with C/C++ dictionary
0x43Deflation with Perl 5 dictionary
0x44expandible dictionary with agent names, designed especially for AGNT chunks.
0x45Deflation with dictionary of common ACL structures, designed especially for SECW chunks.
0x46Structured Delta with 1 uintV. Good for SIZE chunks.
0x47Structured Delta with 3 uintVs. Good for TIME chunks.
...more to be added
0x7Fuse Extended Payload Specification

COMP Chunk Payload

The Payload of a COMP chunk contains parameters for compressing some other chunk’s payload.

The payload is a list of records.

The record begins with a uintV indicating the decompression algorithm number. This is followed by any parameters needed to decompress data. What parameters needed (if any) vary by algorithm.

If more than one record is present, each is performed in turn to decompress the data (see examples).

Compression Algorithm Identifiers

The algorithm identifiers are in the range of hex 40 through hex FFF for consistancy with other pre-defined values in the ZIP2 specification. In the future, there may be a chunk type for defining algorithms other than built-in ones, so it makes sence to use the same numbering system. However, since there are several levels of indirection, the numbers don't start with 65. Rather, they start with 256 (hex 100) to make it easier for people to keep them straight. This decision adds one byte to the total size of a chunk that defines a compression method, but this is considered insignificant to the overall archive.

CodeAlgorithmParameters
0x100DeflationuintV specifying a DICT instance
0x101UTR-6-none-
0x102Huffman EncodinguintV specifying a TREE instance
0x103Burrows-Wheeler Transform-none-
0x104run-length, repeat, delta
0x105diff of another chunk
0x106extensible substitution dictionaryfatal/nonfatal mode

If a non-standard algorithm is added to a program, use numbers outside the reserved range. Thus far, there is no way to define or document the meaning of such an algorithm identifier within the archive containing its use, but its meaning will be implicit to the augmented code.

Algorithms may be registered by submitting documentation about it. A number will be assigned for this algorithm. Even though it may be a private algorithm that is not generally supported, the number will be unique and unambiguous.

Deflation

The parameter data is a uintV specifying a DICT instance that will be used as a pre-load dictionary. Use a value of 0 for no dictionary.

The code for zlib (written by by Jean-loup Gailly and Mark Adler) is available on a huge number of platforms. It’s well tested, well documented, and actively maintained. It’s thought to be free of patent problems. It’s already standardized as RFC 1950. See http://www.gzip.org/zlib/ for details.

UTR-6

This is a method for representing Unicode data more compactly than UTF-8 when a lot of non-“basic Latin” characters are used. Simply encoding Unicode data as UTF-8 requires a single byte only for ASCII characters, and takes two bytes for the “General Scripts” area, and more bytes otherwise, including CJKV Ideographs. UTR-6 will encode General Scripts as one byte per character through “windowing” the desired character set into a range of 1-byte values.

For details of the algorithm, see Unicode Technical Report #6 Revision 3.1, “A Standard Compression Scheme for Unicode”, located at http://www.unicode.org/unicode/reports/tr6/tr6-3.1.html.

UTR-6 may be directly useful for character data (such as a directory chunk or UTF-8 file data) in languages that use many non-ASCII characters, that contains little redundancy or patterns. However, UTR-6 may more typically be used as part of a chain of algorithms. For example, the result of UTR-6 may then be deflated.

Huffman Encoding

The parameter data is a uintV specifying a DICT instance that defines the frequency table.

Using data from a TREE chunk, a Huffman tree is created so that the most common bytes have the smallest bit sequences. The output data is a set of blocks. Each block is a length (uintV) followed by the result of looking up the bit sequence on each byte of input, with trailing zeros added to make the output a multiple of 8 bits. A single zero byte (looks like a block length of zero) indicates the end of data.

The length is needed because the tailing padding bits could otherwise be confused for more data when decoding. The length is a uintV of the number of bytes in the unencoded input. That is chosen over the number of bits in the output so it may be emitted before knowing how long the output will be. The decoder will need to stop when the length reaches this expected value, even if there are bits remaining in the encoded data stream. At this point the rest of the bits in the current byte are flushed and the length of the next block begins with the following byte.

The wrapping blocks are used because the length is needed. By using multiple blocks, you don’t need to know the length of the input in advance (a requirement). You can compress each buffer-full independently, and output a 0 when you reach the end.

The bytes of encoded data are created from each group of 8 bits, the first bit in the stream being the high bit in the byte.

To form the Huffman tree, start with the TREE data set as leaf nodes. This forms an initial list of 256 nodes. Remove any nodes whose weight is zero

While there are more than one node in the list, find the two nodes with the smallest weights. Remove them from the list and form a new non-leaf node that points to these two nodes, the zero branch being the leaf node that was closer to the beginning of the list, and the one branch being the other one. The weight is the sum of the weights of the combined nodes. Add this new node to the list in the lowest unused position (for values < 256) or append to the list if there are no more empty slots.

∴ That is simple, but how optimal is it? Is there a (still simple) way to form a tree that does a better job of minimizing the sum of the (path length times weight of leaf)?

The bit string emitted for a byte is defined as the branch choices to get from the root to that byte’s leaf node.

Burrows_Wheeler Transform

What libbzip2 does (see http://sourceware.cygnus.com/bzip2/index.html bad link?). This is a complete compression method centered on the Burrow-Wheeler Transform.

∴The documentation on libbzip2 indicates that some improvements are possible but were omitted to maintain backward compatibility in the bzip file format. However, such improvements could be made for use within the ZIP2 file. This needs to be investigated.

The code for libbzip2 (by Julian Seward) is written in ANSI C and works on a variety of UNIX platforms and Windows. A Macintosh version has been found at http://persephone.cps.unizar.es/~spd /bzip2/. It’s well tested, well documented, and actively maintained. It’s thought to be free of patent problems.

An upcoming version of zlib is supposed to include BWT as a major new feature. Since it’s still in progress, it could be written to learn from libbzip2 and improve upon it. If ready in time, it could be a better candidate for use in this specification.

This algorithm gives better results (around 5%) than deflation for large files. For small files, deflation works better. When a suitable dictionary is available, small files deflate much better.

Run-Length, Repeat, Delta

The general-purpose Deflation method fails to give optimum results in certain cases. This simple method can be used preparatory to deflation to eliminate the pathological cases and in less severe cases enhance its performance.

For example, given a megabyte file containing mostly the letter "M", a standard ZIP file held 2201 bytes of file data. However, that ZIP file was itself zipped down to 208 bytes, a reduction of 91% ! Clearly, the first pass did not compress as well as it might have. This simple example illustrates the limitations of a dictionary-based compression system.

∴Re-think this before documenting (it's in the old MS Word doc from draft 1). Combine with the "shuffle" data streams, or split everything up into individual algorithms instead?

Extensible Substitution Dictionary

This replaces known strings with short codes, and escapes out anything that may be confused for such a code.

It relies on a dictionary known to the recipient. Unlike Deflation, the dictionary may be updated and still be backward compatible. This is designed to compress the AGNT chunks (though you can use it for anything that might benifit from the same features). As agents are registered, the pre-defined DICT instance will be updated. However, in non-fatal mode two AGNT strings so-compressed may still be tested for equality even if it contains a substitution that’s not in the dictionary. In fact, the nature of the AGNT data is to inform the program about the implementation that created the archive; so if a substituion code is unknown the program can rightly conclude that it knows nothing about that agent. The dictionary would be updated at the same time the program is enhanced to deal with features particular to that agent.

∴Link to a page containing technical details of the algorithm.

Usage Notes

Note that variations on compressing are not recorded here if they are not needed during decompressing. For example, the deflation algorithm has parameters for a speed/compression tradeoff level, a speed/memory tradeoff level, history window size, and a strategy. However, when decompressing none of that matters; the output is compatible no matter which parameters were used. An implementation may provide a way to specify parameters to the compression algorithm when adding files to the archive, but they are not recorded here. This is information necessary for the decompressor to extract the file.

Examples

Composition

To define COMP#7 to decode data that has been encoded using UTR-6 and then deflated using dictionary DICT#65, list them in the order in which to apply the decode process. The payload would contain the list (0x102, 0x41, 0x101) That would be represented as:

81 02  ; 0x102 = deflate algorithm
41     ; parameter to algorithm
81 01  ; 0x101 = UTR-6 algorithm
       ; takes no parameters.

which would be the payload of COMP#7.

COMP-nd Chunk

A part number, as with DATA.

COMP-n Chunk

A URL.


Valid HTML 4.01!

Page content copyright 2003 by John M. Dlugosz. Home:http://www.dlugosz.com, email:mailto:john@dlugosz.com