Multi-File Archives

The ZIP2 File Structure defines a physical file format containing information pertaining to an archive. But, an archive might be split into multiple phyiscal files. There are special issues to consider in making such a set of files provide all the desired features. This is one of the things missing from other archive formats.

Features

Multi-file or “spanning” archives are fully supported. It is not an afterthought grafted on to an existing design, nor simply chopping a large file into arbitrary segments. Code to support multi-file archives is part of the reference implementation.

Instead of only being able to list and extract files in a multi-file archive, a ZIP2 archive can be updated with files added, replaced, or removed.

A ZIP2 file have files extracted or updated without having to read all the portion files. Only those portion files containing relevant data need to be accessed.

If portion files are on different pieces of removable media, the user must be prompted to insert each one as needed. The design prevents unnecessary reading or scanning of portion files, and the program can prompt for the exact piece of media needed.

Missing or damaged portion files are tolerated if not needed for the task at hand. Since it only looks at the required portions, the program is not even aware that some other portion is missing.

An archive can easily be repackaged into portions of a desired size. For example, copying a large archive to removable media can put as much as fits on each piece, even when different amounts of space is free on each piece. A server can keep one copy but present a logical view of different sized portions on demand; that is, the user can choose to download 1 10-meg file, 10 1-meg files, 100 100K files, or any other desired way of slicing it.

Portion Files

When a single logical archive is split into multiple physical files, each file is known as a portion.

Each portion file is made up of a Non-Chunk Header followed by Chunks, and may be embedded in a larger data stream. In short, the file format is exactly like that of a regular one-piece archive, except that it doesn’t contain all the chunks needed to define the archive. Collectivly, all the portion files contain all the chunks, and optionally a few extra ones to cross-reference chunks between files.

Each portion file is assigned a sequence number with the first file being number 1, and continuing up to the last portion file in the archive.

Instance Numbers between Portions

There are three distinct kinds of chunks: those with a sequence number of 0, those with a positive sequence number and contain real information, and those that are pointers or cross-references to chunks in other portions.

Chunks having the special sequence number of zero may exist in more than one portion; that is, each portion may have its own.

Non-pointer chunks with positive sequence numbers must be unique within the whole archive, not just unique within a file. So, an implementation that sees a chunk somewhere can assume that it won’t see the same one (with a possible contradictary meaning) somewhere else.

Pointer chunks have the n flag set. These chunks are stand-ins for a chunk having the same type and instance number that exist somewhere else. The purpose of these pointer chunks are to indicate the location of that real chunk. So, the instance number must be unique within a file, but each file can contain its own copy. Contradictary information in different copies is not fatal.

File Names

Each file in a multi-part archive should have a consistant name. This allows the implementation to look at the original file’s name and calculate the name for some other needed portion without having to search for it and do more general matching. Given the name of one portion file, the name of any portion file can be unambiguously computed.

Primary Naming Convention

The names should match in case and representation. That is, they should be binary identical except for the sequence number. Again, this is to allow easy unambiguous computing of the portion name rather than a search and match operation.

The name consists of a base name, a dot, the sequence number, another dot, and the extension "ZIP2" or "zip2".

There is no whitespace on either side of a dot, and the sequence number either has no leading zeros or is always padded to the same length.

For example,

	source backup.001.zip2
	source backup.002.zip2
	source backup.003.zip2
	...
	source backup.099.zip2
	source backup.100.zip2
	
	Reference.1.zip2
	Reference.2.zip2
	...
	Reference.9.zip2
	Reference.10.zip2
	...
	Reference.1038.zip2
	Reference.1039.zip2

Having no padding in the sequence number has the advantage of not having to guess how many portions you will need ahead of time. If you started with 001, what happens when you reach 999? But having padding allows the list of file names to sort correctly if sorting them with a simple string comparison. Without padding, Windows Explorer for example would put 12 before 2.

If a large numbered file is opened first (typically the user may insert the last disc of a set to start), if there are no visible leading zeros in the sequence number then it is ambiguous how to form a smaller number. Starting with file number 1039, is part 5 going to be "5" or "0005"? When it’s ambiguous the implementation must check for both possibilities. But once it knows one way or the other, it should only check for the correct name.

Restricted Naming Convention

Sometimes files may be stored on a media that does not allow for multiple dots or long names. For example, plain DOS FAT-16 without LFN support, or ISO 9660 Level 1 CDs. In that situation, this alternate naming convention is used.

The base name is 8 characters or less of ASCII upper-case letters, digits, or the underscore character.

The base name is followed by a dot and a 3-digit sequence number, padded with zeros. If there are more than 999 portions, the hundreds digit goes from 9 to A, and continues through to Z. If there are more than 3599 portions, you can’t use this naming convention.

Bad Naming Convention

It is strongly discouraged that each portion have the identical name, residing on different pieces of media or different file paths. It should be possible for multiple portions to be copied to the same directory without conflict. An implementation may use the fully-qualified file name to keep track of the different portions in use, without use of any other media identifier for removable media.

Other Names

Even if the portion files don’t follow these naming conventions, it should not be a fatal error in an interactive program. Any time an implementation needs a file, the user may be able to override the file name and location and identifiy what the desired file is really called. In non-interactive mode, such naming information would have to be specified ahead of time (e.g. as command-line arguments) or it is indeed an error since there is no user to prompt.

Knowlege of File Contents

When an implementation is aware of the existance of a particular information-bearing (that is, non-pointer) chunk, it is said to have noticed that chunk.

An implementation notices a chunk if it exists in the current portion file, or if a -n (pointer) variation of that instance exists in the current portion file. The order of chunks in a single file does not matter, as updating a file may write a chunk to any free area within the file. So if a file is opened, the implementation should be aware of the existance of all chunks in that file, either by using a TOCN or by walking the blocks through to the end of the file.

An implementation is not required to remember all the chunks that are in every portion file it’s read in the current program invocation, though it certainly may do so. It must be aware of the chunks in the current file (in the case of removable media, this is formally the most recient one inserted), and it may assume that various optional -a flagged chunks will be noticed in the same portion that contains the associated DATA chunk.

A META chunk can modify this and tell the program that certain chunks can be assumed to exist and to look for them in other portions if necessary, even if they have not been noticed yet.

-nd flagged chunks

-n flagged chunks


Valid HTML 4.01!

Page content copyright 2003 by John M. Dlugosz. Home:http://www.dlugosz.com, email:mailto:john@dlugosz.com