Spaces:

shashankkandimalla
/

ocr-text-processing

Running

App Files Files Community

ocr-text-processing / ocr-application /share /doc /xz /xz-file-format.txt

shashankkandimalla

Upload folder using huggingface_hub

288007d verified 10 months ago

raw

history blame contribute delete

44.3 kB


	The .xz File Format
	===================

	Version 1.2.0 (2024-01-19)


	0. Preface
	0.1. Notices and Acknowledgements
	0.2. Getting the Latest Version
	0.3. Version History
	1. Conventions
	1.1. Byte and Its Representation
	1.2. Multibyte Integers
	2. Overall Structure of .xz File
	2.1. Stream
	2.1.1. Stream Header
	2.1.1.1. Header Magic Bytes
	2.1.1.2. Stream Flags
	2.1.1.3. CRC32
	2.1.2. Stream Footer
	2.1.2.1. CRC32
	2.1.2.2. Backward Size
	2.1.2.3. Stream Flags
	2.1.2.4. Footer Magic Bytes
	2.2. Stream Padding
	3. Block
	3.1. Block Header
	3.1.1. Block Header Size
	3.1.2. Block Flags
	3.1.3. Compressed Size
	3.1.4. Uncompressed Size
	3.1.5. List of Filter Flags
	3.1.6. Header Padding
	3.1.7. CRC32
	3.2. Compressed Data
	3.3. Block Padding
	3.4. Check
	4. Index
	4.1. Index Indicator
	4.2. Number of Records
	4.3. List of Records
	4.3.1. Unpadded Size
	4.3.2. Uncompressed Size
	4.4. Index Padding
	4.5. CRC32
	5. Filter Chains
	5.1. Alignment
	5.2. Security
	5.3. Filters
	5.3.1. LZMA2
	5.3.2. Branch/Call/Jump Filters for Executables
	5.3.3. Delta
	5.3.3.1. Format of the Encoded Output
	5.4. Custom Filter IDs
	5.4.1. Reserved Custom Filter ID Ranges
	6. Cyclic Redundancy Checks
	7. References


	0. Preface

	This document describes the .xz file format (filename suffix
	".xz", MIME type "application/x-xz"). It is intended that this
	this format replace the old .lzma format used by LZMA SDK and
	LZMA Utils.


	0.1. Notices and Acknowledgements

	This file format was designed by Lasse Collin
	<lasse.collin@tukaani.org> and Igor Pavlov.

	Special thanks for helping with this document goes to
	Ville Koskinen. Thanks for helping with this document goes to
	Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius.

	This document has been put into the public domain.


	0.2. Getting the Latest Version

	The latest official version of this document can be downloaded
	from <https://xz.tukaani.org/format/xz-file-format.txt>.

	Specific versions of this document have a filename
	xz-file-format-X.Y.Z.txt where X.Y.Z is the version number.
	For example, the version 1.0.0 of this document is available
	at <https://xz.tukaani.org/format/xz-file-format-1.0.0.txt>.


	0.3. Version History

	Version Date Description

	1.2.0 2024-01-19 Added RISC-V filter and updated URLs in
	Sections 0.2 and 7. The URL of this
	specification was changed.

	1.1.0 2022-12-11 Added ARM64 filter and clarified 32-bit
	ARM endianness in Section 5.3.2,
	language improvements in Section 5.4

	1.0.4 2009-08-27 Language improvements in Sections 1.2,
	2.1.1.2, 3.1.1, 3.1.2, and 5.3.1

	1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4

	1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1

	1.0.1 2009-06-01 Typo fix in Section 0.3 and minor
	clarifications to Sections 2, 2.2,
	3.3, 4.4, and 5.3.2

	1.0.0 2009-01-14 The first official version


	1. Conventions

	The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD",
	"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this
	document are to be interpreted as described in [RFC-2119].

	Indicating a warning means displaying a message, returning
	appropriate exit status, or doing something else to let the
	user know that something worth warning occurred. The operation
	SHOULD still finish if a warning is indicated.

	Indicating an error means displaying a message, returning
	appropriate exit status, or doing something else to let the
	user know that something prevented successfully finishing the
	operation. The operation MUST be aborted once an error has
	been indicated.


	1.1. Byte and Its Representation

	In this document, byte is always 8 bits.

	A "null byte" has all bits unset. That is, the value of a null
	byte is 0x00.

	To represent byte blocks, this document uses notation that
	is similar to the notation used in [RFC-1952]:

	+-------+
	\| Foo \| One byte.
	+-------+

	+---+---+
	\| Foo \| Two bytes; that is, some of the vertical bars
	+---+---+ can be missing.

	+=======+
	\| Foo \| Zero or more bytes.
	+=======+

	In this document, a boxed byte or a byte sequence declared
	using this notation is called "a field". The example field
	above would be called "the Foo field" or plain "Foo".

	If there are many fields, they may be split to multiple lines.
	This is indicated with an arrow ("--->"):

	+=====+
	\| Foo \|
	+=====+

	+=====+
	---> \| Bar \|
	+=====+

	The above is equivalent to this:

	+=====+=====+
	\| Foo \| Bar \|
	+=====+=====+


	1.2. Multibyte Integers

	Multibyte integers of static length, such as CRC values,
	are stored in little endian byte order (least significant
	byte first).

	When smaller values are more likely than bigger values (for
	example file sizes), multibyte integers are encoded in a
	variable-length representation:
	- Numbers in the range [0, 127] are copied as is, and take
	one byte of space.
	- Bigger numbers will occupy two or more bytes. All but the
	last byte of the multibyte representation have the highest
	(eighth) bit set.

	For now, the value of the variable-length integers is limited
	to 63 bits, which limits the encoded size of the integer to
	nine bytes. These limits may be increased in the future if
	needed.

	The following C code illustrates encoding and decoding of
	variable-length integers. The functions return the number of
	bytes occupied by the integer (1-9), or zero on error.

	#include <stddef.h>
	#include <inttypes.h>

	size_t
	encode(uint8_t buf[static 9], uint64_t num)
	{
	if (num > UINT64_MAX / 2)
	return 0;

	size_t i = 0;

	while (num >= 0x80) {
	buf[i++] = (uint8_t)(num) \| 0x80;
	num >>= 7;
	}

	buf[i++] = (uint8_t)(num);

	return i;
	}

	size_t
	decode(const uint8_t buf[], size_t size_max, uint64_t *num)
	{
	if (size_max == 0)
	return 0;

	if (size_max > 9)
	size_max = 9;

	*num = buf[0] & 0x7F;
	size_t i = 0;

	while (buf[i++] & 0x80) {
	if (i >= size_max \|\| buf[i] == 0x00)
	return 0;

	num \|= (uint64_t)(buf[i] & 0x7F) << (i 7);
	}

	return i;
	}


	2. Overall Structure of .xz File

	A standalone .xz files consist of one or more Streams which may
	have Stream Padding between or after them:

	+========+================+========+================+
	\| Stream \| Stream Padding \| Stream \| Stream Padding \| ...
	+========+================+========+================+

	The sizes of Stream and Stream Padding are always multiples
	of four bytes, thus the size of every valid .xz file MUST be
	a multiple of four bytes.

	While a typical file contains only one Stream and no Stream
	Padding, a decoder handling standalone .xz files SHOULD support
	files that have more than one Stream or Stream Padding.

	In contrast to standalone .xz files, when the .xz file format
	is used as an internal part of some other file format or
	communication protocol, it usually is expected that the decoder
	stops after the first Stream, and doesn't look for Stream
	Padding or possibly other Streams.


	2.1. Stream

	+-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+
	\| Stream Header \| Block \| Block \| ... \| Block \|
	+-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+

	+=======+-+-+-+-+-+-+-+-+-+-+-+-+
	---> \| Index \| Stream Footer \|
	+=======+-+-+-+-+-+-+-+-+-+-+-+-+

	All the above fields have a size that is a multiple of four. If
	Stream is used as an internal part of another file format, it
	is RECOMMENDED to make the Stream start at an offset that is
	a multiple of four bytes.

	Stream Header, Index, and Stream Footer are always present in
	a Stream. The maximum size of the Index field is 16 GiB (2^34).

	There are zero or more Blocks. The maximum number of Blocks is
	limited only by the maximum size of the Index field.

	Total size of a Stream MUST be less than 8 EiB (2^63 bytes).
	The same limit applies to the total amount of uncompressed
	data stored in a Stream.

	If an implementation supports handling .xz files with multiple
	concatenated Streams, it MAY apply the above limits to the file
	as a whole instead of limiting per Stream basis.


	2.1.1. Stream Header

	+---+---+---+---+---+---+-------+------+--+--+--+--+
	\| Header Magic Bytes \| Stream Flags \| CRC32 \|
	+---+---+---+---+---+---+-------+------+--+--+--+--+


	2.1.1.1. Header Magic Bytes

	The first six (6) bytes of the Stream are so called Header
	Magic Bytes. They can be used to identify the file type.

	Using a C array and ASCII:
	const uint8_t HEADER_MAGIC[6]
	= { 0xFD, '7', 'z', 'X', 'Z', 0x00 };

	In plain hexadecimal:
	FD 37 7A 58 5A 00

	Notes:
	- The first byte (0xFD) was chosen so that the files cannot
	be erroneously detected as being in .lzma format, in which
	the first byte is in the range [0x00, 0xE0].
	- The sixth byte (0x00) was chosen to prevent applications
	from misdetecting the file as a text file.

	If the Header Magic Bytes don't match, the decoder MUST
	indicate an error.


	2.1.1.2. Stream Flags

	The first byte of Stream Flags is always a null byte. In the
	future, this byte may be used to indicate a new Stream version
	or other Stream properties.

	The second byte of Stream Flags is a bit field:

	Bit(s) Mask Description
	0-3 0x0F Type of Check (see Section 3.4):
	ID Size Check name
	0x00 0 bytes None
	0x01 4 bytes CRC32
	0x02 4 bytes (Reserved)
	0x03 4 bytes (Reserved)
	0x04 8 bytes CRC64
	0x05 8 bytes (Reserved)
	0x06 8 bytes (Reserved)
	0x07 16 bytes (Reserved)
	0x08 16 bytes (Reserved)
	0x09 16 bytes (Reserved)
	0x0A 32 bytes SHA-256
	0x0B 32 bytes (Reserved)
	0x0C 32 bytes (Reserved)
	0x0D 64 bytes (Reserved)
	0x0E 64 bytes (Reserved)
	0x0F 64 bytes (Reserved)
	4-7 0xF0 Reserved for future use; MUST be zero for now.

	Implementations SHOULD support at least the Check IDs 0x00
	(None) and 0x01 (CRC32). Supporting other Check IDs is
	OPTIONAL. If an unsupported Check is used, the decoder SHOULD
	indicate a warning or error.

	If any reserved bit is set, the decoder MUST indicate an error.
	It is possible that there is a new field present which the
	decoder is not aware of, and can thus parse the Stream Header
	incorrectly.


	2.1.1.3. CRC32

	The CRC32 is calculated from the Stream Flags field. It is
	stored as an unsigned 32-bit little endian integer. If the
	calculated value does not match the stored one, the decoder
	MUST indicate an error.

	The idea is that Stream Flags would always be two bytes, even
	if new features are needed. This way old decoders will be able
	to verify the CRC32 calculated from Stream Flags, and thus
	distinguish between corrupt files (CRC32 doesn't match) and
	files that the decoder doesn't support (CRC32 matches but
	Stream Flags has reserved bits set).


	2.1.2. Stream Footer

	+-+-+-+-+---+---+---+---+-------+------+----------+---------+
	\| CRC32 \| Backward Size \| Stream Flags \| Footer Magic Bytes \|
	+-+-+-+-+---+---+---+---+-------+------+----------+---------+


	2.1.2.1. CRC32

	The CRC32 is calculated from the Backward Size and Stream Flags
	fields. It is stored as an unsigned 32-bit little endian
	integer. If the calculated value does not match the stored one,
	the decoder MUST indicate an error.

	The reason to have the CRC32 field before the Backward Size and
	Stream Flags fields is to keep the four-byte fields aligned to
	a multiple of four bytes.


	2.1.2.2. Backward Size

	Backward Size is stored as a 32-bit little endian integer,
	which indicates the size of the Index field as multiple of
	four bytes, minimum value being four bytes:

	real_backward_size = (stored_backward_size + 1) * 4;

	If the stored value does not match the real size of the Index
	field, the decoder MUST indicate an error.

	Using a fixed-size integer to store Backward Size makes
	it slightly simpler to parse the Stream Footer when the
	application needs to parse the Stream backwards.


	2.1.2.3. Stream Flags

	This is a copy of the Stream Flags field from the Stream
	Header. The information stored to Stream Flags is needed
	when parsing the Stream backwards. The decoder MUST compare
	the Stream Flags fields in both Stream Header and Stream
	Footer, and indicate an error if they are not identical.


	2.1.2.4. Footer Magic Bytes

	As the last step of the decoding process, the decoder MUST
	verify the existence of Footer Magic Bytes. If they don't
	match, an error MUST be indicated.

	Using a C array and ASCII:
	const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' };

	In hexadecimal:
	59 5A

	The primary reason to have Footer Magic Bytes is to make
	it easier to detect incomplete files quickly, without
	uncompressing. If the file does not end with Footer Magic Bytes
	(excluding Stream Padding described in Section 2.2), it cannot
	be undamaged, unless someone has intentionally appended garbage
	after the end of the Stream.


	2.2. Stream Padding

	Only the decoders that support decoding of concatenated Streams
	MUST support Stream Padding.

	Stream Padding MUST contain only null bytes. To preserve the
	four-byte alignment of consecutive Streams, the size of Stream
	Padding MUST be a multiple of four bytes. Empty Stream Padding
	is allowed. If these requirements are not met, the decoder MUST
	indicate an error.

	Note that non-empty Stream Padding is allowed at the end of the
	file; there doesn't need to be a new Stream after non-empty
	Stream Padding. This can be convenient in certain situations
	[GNU-tar].

	The possibility of Stream Padding MUST be taken into account
	when designing an application that parses Streams backwards,
	and the application supports concatenated Streams.


	3. Block

	+==============+=================+===============+=======+
	\| Block Header \| Compressed Data \| Block Padding \| Check \|
	+==============+=================+===============+=======+


	3.1. Block Header

	+-------------------+-------------+=================+
	\| Block Header Size \| Block Flags \| Compressed Size \|
	+-------------------+-------------+=================+

	+===================+======================+
	---> \| Uncompressed Size \| List of Filter Flags \|
	+===================+======================+

	+================+--+--+--+--+
	---> \| Header Padding \| CRC32 \|
	+================+--+--+--+--+


	3.1.1. Block Header Size

	This field overlaps with the Index Indicator field (see
	Section 4.1).

	This field contains the size of the Block Header field,
	including the Block Header Size field itself. Valid values are
	in the range [0x01, 0xFF], which indicate the size of the Block
	Header as multiples of four bytes, minimum size being eight
	bytes:

	real_header_size = (encoded_header_size + 1) * 4;

	If a Block Header bigger than 1024 bytes is needed in the
	future, a new field can be added between the Block Header and
	Compressed Data fields. The presence of this new field would
	be indicated in the Block Header field.


	3.1.2. Block Flags

	The Block Flags field is a bit field:

	Bit(s) Mask Description
	0-1 0x03 Number of filters (1-4)
	2-5 0x3C Reserved for future use; MUST be zero for now.
	6 0x40 The Compressed Size field is present.
	7 0x80 The Uncompressed Size field is present.

	If any reserved bit is set, the decoder MUST indicate an error.
	It is possible that there is a new field present which the
	decoder is not aware of, and can thus parse the Block Header
	incorrectly.


	3.1.3. Compressed Size

	This field is present only if the appropriate bit is set in
	the Block Flags field (see Section 3.1.2).

	The Compressed Size field contains the size of the Compressed
	Data field, which MUST be non-zero. Compressed Size is stored
	using the encoding described in Section 1.2. If the Compressed
	Size doesn't match the size of the Compressed Data field, the
	decoder MUST indicate an error.


	3.1.4. Uncompressed Size

	This field is present only if the appropriate bit is set in
	the Block Flags field (see Section 3.1.2).

	The Uncompressed Size field contains the size of the Block
	after uncompressing. Uncompressed Size is stored using the
	encoding described in Section 1.2. If the Uncompressed Size
	does not match the real uncompressed size, the decoder MUST
	indicate an error.

	Storing the Compressed Size and Uncompressed Size fields serves
	several purposes:
	- The decoder knows how much memory it needs to allocate
	for a temporary buffer in multithreaded mode.
	- Simple error detection: wrong size indicates a broken file.
	- Seeking forwards to a specific location in streamed mode.

	It should be noted that the only reliable way to determine
	the real uncompressed size is to uncompress the Block,
	because the Block Header and Index fields may contain
	(intentionally or unintentionally) invalid information.


	3.1.5. List of Filter Flags

	+================+================+ +================+
	\| Filter 0 Flags \| Filter 1 Flags \| ... \| Filter n Flags \|
	+================+================+ +================+

	The number of Filter Flags fields is stored in the Block Flags
	field (see Section 3.1.2).

	The format of each Filter Flags field is as follows:

	+===========+====================+===================+
	\| Filter ID \| Size of Properties \| Filter Properties \|
	+===========+====================+===================+

	Both Filter ID and Size of Properties are stored using the
	encoding described in Section 1.2. Size of Properties indicates
	the size of the Filter Properties field as bytes. The list of
	officially defined Filter IDs and the formats of their Filter
	Properties are described in Section 5.3.

	Filter IDs greater than or equal to 0x4000_0000_0000_0000
	(2^62) are reserved for implementation-specific internal use.
	These Filter IDs MUST never be used in List of Filter Flags.


	3.1.6. Header Padding

	This field contains as many null byte as it is needed to make
	the Block Header have the size specified in Block Header Size.
	If any of the bytes are not null bytes, the decoder MUST
	indicate an error. It is possible that there is a new field
	present which the decoder is not aware of, and can thus parse
	the Block Header incorrectly.


	3.1.7. CRC32

	The CRC32 is calculated over everything in the Block Header
	field except the CRC32 field itself. It is stored as an
	unsigned 32-bit little endian integer. If the calculated
	value does not match the stored one, the decoder MUST indicate
	an error.

	By verifying the CRC32 of the Block Header before parsing the
	actual contents allows the decoder to distinguish between
	corrupt and unsupported files.


	3.2. Compressed Data

	The format of Compressed Data depends on Block Flags and List
	of Filter Flags. Excluding the descriptions of the simplest
	filters in Section 5.3, the format of the filter-specific
	encoded data is out of scope of this document.


	3.3. Block Padding

	Block Padding MUST contain 0-3 null bytes to make the size of
	the Block a multiple of four bytes. This can be needed when
	the size of Compressed Data is not a multiple of four. If any
	of the bytes in Block Padding are not null bytes, the decoder
	MUST indicate an error.


	3.4. Check

	The type and size of the Check field depends on which bits
	are set in the Stream Flags field (see Section 2.1.1.2).

	The Check, when used, is calculated from the original
	uncompressed data. If the calculated Check does not match the
	stored one, the decoder MUST indicate an error. If the selected
	type of Check is not supported by the decoder, it SHOULD
	indicate a warning or error.


	4. Index

	+-----------------+===================+
	\| Index Indicator \| Number of Records \|
	+-----------------+===================+

	+=================+===============+-+-+-+-+
	---> \| List of Records \| Index Padding \| CRC32 \|
	+=================+===============+-+-+-+-+

	Index serves several purposes. Using it, one can
	- verify that all Blocks in a Stream have been processed;
	- find out the uncompressed size of a Stream; and
	- quickly access the beginning of any Block (random access).


	4.1. Index Indicator

	This field overlaps with the Block Header Size field (see
	Section 3.1.1). The value of Index Indicator is always 0x00.


	4.2. Number of Records

	This field indicates how many Records there are in the List
	of Records field, and thus how many Blocks there are in the
	Stream. The value is stored using the encoding described in
	Section 1.2. If the decoder has decoded all the Blocks of the
	Stream, and then notices that the Number of Records doesn't
	match the real number of Blocks, the decoder MUST indicate an
	error.


	4.3. List of Records

	List of Records consists of as many Records as indicated by the
	Number of Records field:

	+========+========+
	\| Record \| Record \| ...
	+========+========+

	Each Record contains information about one Block:

	+===============+===================+
	\| Unpadded Size \| Uncompressed Size \|
	+===============+===================+

	If the decoder has decoded all the Blocks of the Stream, it
	MUST verify that the contents of the Records match the real
	Unpadded Size and Uncompressed Size of the respective Blocks.

	Implementation hint: It is possible to verify the Index with
	constant memory usage by calculating for example SHA-256 of
	both the real size values and the List of Records, then
	comparing the hash values. Implementing this using
	non-cryptographic hash like CRC32 SHOULD be avoided unless
	small code size is important.

	If the decoder supports random-access reading, it MUST verify
	that Unpadded Size and Uncompressed Size of every completely
	decoded Block match the sizes stored in the Index. If only
	partial Block is decoded, the decoder MUST verify that the
	processed sizes don't exceed the sizes stored in the Index.


	4.3.1. Unpadded Size

	This field indicates the size of the Block excluding the Block
	Padding field. That is, Unpadded Size is the size of the Block
	Header, Compressed Data, and Check fields. Unpadded Size is
	stored using the encoding described in Section 1.2. The value
	MUST never be zero; with the current structure of Blocks, the
	actual minimum value for Unpadded Size is five.

	Implementation note: Because the size of the Block Padding
	field is not included in Unpadded Size, calculating the total
	size of a Stream or doing random-access reading requires
	calculating the actual size of the Blocks by rounding Unpadded
	Sizes up to the next multiple of four.

	The reason to exclude Block Padding from Unpadded Size is to
	ease making a raw copy of Compressed Data without Block
	Padding. This can be useful, for example, if someone wants
	to convert Streams to some other file format quickly.


	4.3.2. Uncompressed Size

	This field indicates the Uncompressed Size of the respective
	Block as bytes. The value is stored using the encoding
	described in Section 1.2.


	4.4. Index Padding

	This field MUST contain 0-3 null bytes to pad the Index to
	a multiple of four bytes. If any of the bytes are not null
	bytes, the decoder MUST indicate an error.


	4.5. CRC32

	The CRC32 is calculated over everything in the Index field
	except the CRC32 field itself. The CRC32 is stored as an
	unsigned 32-bit little endian integer. If the calculated
	value does not match the stored one, the decoder MUST indicate
	an error.


	5. Filter Chains

	The Block Flags field defines how many filters are used. When
	more than one filter is used, the filters are chained; that is,
	the output of one filter is the input of another filter. The
	following figure illustrates the direction of data flow.

	v Uncompressed Data ^
	\| Filter 0 \|
	Encoder \| Filter 1 \| Decoder
	\| Filter n \|
	v Compressed Data ^


	5.1. Alignment

	Alignment of uncompressed input data is usually the job of
	the application producing the data. For example, to get the
	best results, an archiver tool should make sure that all
	PowerPC executable files in the archive stream start at
	offsets that are multiples of four bytes.

	Some filters, for example LZMA2, can be configured to take
	advantage of specified alignment of input data. Note that
	taking advantage of aligned input can be beneficial also when
	a filter is not the first filter in the chain. For example,
	if you compress PowerPC executables, you may want to use the
	PowerPC filter and chain that with the LZMA2 filter. Because
	not only the input but also the output alignment of the PowerPC
	filter is four bytes, it is now beneficial to set LZMA2
	settings so that the LZMA2 encoder can take advantage of its
	four-byte-aligned input data.

	The output of the last filter in the chain is stored to the
	Compressed Data field, which is is guaranteed to be aligned
	to a multiple of four bytes relative to the beginning of the
	Stream. This can increase
	- speed, if the filtered data is handled multiple bytes at
	a time by the filter-specific encoder and decoder,
	because accessing aligned data in computer memory is
	usually faster; and
	- compression ratio, if the output data is later compressed
	with an external compression tool.


	5.2. Security

	If filters would be allowed to be chained freely, it would be
	possible to create malicious files, that would be very slow to
	decode. Such files could be used to create denial of service
	attacks.

	Slow files could occur when multiple filters are chained:

	v Compressed input data
	\| Filter 1 decoder (last filter)
	\| Filter 0 decoder (non-last filter)
	v Uncompressed output data

	The decoder of the last filter in the chain produces a lot of
	output from little input. Another filter in the chain takes the
	output of the last filter, and produces very little output
	while consuming a lot of input. As a result, a lot of data is
	moved inside the filter chain, but the filter chain as a whole
	gets very little work done.

	To prevent this kind of slow files, there are restrictions on
	how the filters can be chained. These restrictions MUST be
	taken into account when designing new filters.

	The maximum number of filters in the chain has been limited to
	four, thus there can be at maximum of three non-last filters.
	Of these three non-last filters, only two are allowed to change
	the size of the data.

	The non-last filters, that change the size of the data, MUST
	have a limit how much the decoder can compress the data: the
	decoder SHOULD produce at least n bytes of output when the
	filter is given 2n bytes of input. This limit is not
	absolute, but significant deviations MUST be avoided.

	The above limitations guarantee that if the last filter in the
	chain produces 4n bytes of output, the chain as a whole will
	produce at least n bytes of output.


	5.3. Filters

	5.3.1. LZMA2

	LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose
	compression algorithm with high compression ratio and fast
	decompression. LZMA is based on LZ77 and range coding
	algorithms.

	LZMA2 is an extension on top of the original LZMA. LZMA2 uses
	LZMA internally, but adds support for flushing the encoder,
	uncompressed chunks, eases stateful decoder implementations,
	and improves support for multithreading. Thus, the plain LZMA
	will not be supported in this file format.

	Filter ID: 0x21
	Size of Filter Properties: 1 byte
	Changes size of data: Yes
	Allow as a non-last filter: No
	Allow as the last filter: Yes

	Preferred alignment:
	Input data: Adjustable to 1/2/4/8/16 byte(s)
	Output data: 1 byte

	The format of the one-byte Filter Properties field is as
	follows:

	Bits Mask Description
	0-5 0x3F Dictionary Size
	6-7 0xC0 Reserved for future use; MUST be zero for now.

	Dictionary Size is encoded with one-bit mantissa and five-bit
	exponent. The smallest dictionary size is 4 KiB and the biggest
	is 4 GiB.

	Raw value Mantissa Exponent Dictionary size
	0 2 11 4 KiB
	1 3 11 6 KiB
	2 2 12 8 KiB
	3 3 12 12 KiB
	4 2 13 16 KiB
	5 3 13 24 KiB
	6 2 14 32 KiB
	... ... ... ...
	35 3 27 768 MiB
	36 2 28 1024 MiB
	37 3 29 1536 MiB
	38 2 30 2048 MiB
	39 3 30 3072 MiB
	40 2 31 4096 MiB - 1 B

	Instead of having a table in the decoder, the dictionary size
	can be decoded using the following C code:

	const uint8_t bits = get_dictionary_flags() & 0x3F;
	if (bits > 40)
	return DICTIONARY_TOO_BIG; // Bigger than 4 GiB

	uint32_t dictionary_size;
	if (bits == 40) {
	dictionary_size = UINT32_MAX;
	} else {
	dictionary_size = 2 \| (bits & 1);
	dictionary_size <<= bits / 2 + 11;
	}


	5.3.2. Branch/Call/Jump Filters for Executables

	These filters convert relative branch, call, and jump
	instructions to their absolute counterparts in executable
	files. This conversion increases redundancy and thus
	compression ratio.

	Size of Filter Properties: 0 or 4 bytes
	Changes size of data: No
	Allow as a non-last filter: Yes
	Allow as the last filter: No

	Below is the list of filters in this category. The alignment
	is the same for both input and output data.

	Filter ID Alignment Description
	0x04 1 byte x86 filter (BCJ)
	0x05 4 bytes PowerPC (big endian) filter
	0x06 16 bytes IA64 filter
	0x07 4 bytes ARM filter [1]
	0x08 2 bytes ARM Thumb filter [1]
	0x09 4 bytes SPARC filter
	0x0A 4 bytes ARM64 filter [2]
	0x0B 2 bytes RISC-V filter

	[1] These are for little endian instruction encoding.
	This must not be confused with data endianness.
	A processor configured for big endian data access
	may still use little endian instruction encoding.
	The filters don't care about the data endianness.

	[2] 4096-byte alignment gives the best results
	because the address in the ADRP instruction
	is a multiple of 4096 bytes.

	If the size of Filter Properties is four bytes, the Filter
	Properties field contains the start offset used for address
	conversions. It is stored as an unsigned 32-bit little endian
	integer. The start offset MUST be a multiple of the alignment
	of the filter as listed in the table above; if it isn't, the
	decoder MUST indicate an error. If the size of Filter
	Properties is zero, the start offset is zero.

	Setting the start offset may be useful if an executable has
	multiple sections, and there are many cross-section calls.
	Taking advantage of this feature usually requires usage of
	the Subblock filter, whose design is not complete yet.


	5.3.3. Delta

	The Delta filter may increase compression ratio when the value
	of the next byte correlates with the value of an earlier byte
	at specified distance.

	Filter ID: 0x03
	Size of Filter Properties: 1 byte
	Changes size of data: No
	Allow as a non-last filter: Yes
	Allow as the last filter: No

	Preferred alignment:
	Input data: 1 byte
	Output data: Same as the original input data

	The Properties byte indicates the delta distance, which can be
	1-256 bytes backwards from the current byte: 0x00 indicates
	distance of 1 byte and 0xFF distance of 256 bytes.


	5.3.3.1. Format of the Encoded Output

	The code below illustrates both encoding and decoding with
	the Delta filter.

	// Distance is in the range [1, 256].
	const unsigned int distance = get_properties_byte() + 1;
	uint8_t pos = 0;
	uint8_t delta[256];

	memset(delta, 0, sizeof(delta));

	while (1) {
	const int byte = read_byte();
	if (byte == EOF)
	break;

	uint8_t tmp = delta[(uint8_t)(distance + pos)];
	if (is_encoder) {
	tmp = (uint8_t)(byte) - tmp;
	delta[pos] = (uint8_t)(byte);
	} else {
	tmp = (uint8_t)(byte) + tmp;
	delta[pos] = tmp;
	}

	write_byte(tmp);
	--pos;
	}


	5.4. Custom Filter IDs

	If a developer wants to use custom Filter IDs, there are two
	choices. The first choice is to contact Lasse Collin and ask
	him to allocate a range of IDs for the developer.

	The second choice is to generate a 40-bit random integer
	which the developer can use as a personal Developer ID.
	To minimize the risk of collisions, Developer ID has to be
	a randomly generated integer, not manually selected "hex word".
	The following command, which works on many free operating
	systems, can be used to generate Developer ID:

	dd if=/dev/urandom bs=5 count=1 \| hexdump

	The developer can then use the Developer ID to create unique
	(well, hopefully unique) Filter IDs.

	Bits Mask Description
	0-15 0x0000_0000_0000_FFFF Filter ID
	16-55 0x00FF_FFFF_FFFF_0000 Developer ID
	56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F

	The resulting 63-bit integer will use 9 bytes of space when
	stored using the encoding described in Section 1.2. To get
	a shorter ID, see the beginning of this Section how to
	request a custom ID range.


	5.4.1. Reserved Custom Filter ID Ranges

	Range Description
	0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility
	0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility
	0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility


	6. Cyclic Redundancy Checks

	There are several incompatible variations to calculate CRC32
	and CRC64. For simplicity and clarity, complete examples are
	provided to calculate the checks as they are used in this file
	format. Implementations MAY use different code as long as it
	gives identical results.

	The program below reads data from standard input, calculates
	the CRC32 and CRC64 values, and prints the calculated values
	as big endian hexadecimal strings to standard output.

	#include <stddef.h>
	#include <inttypes.h>
	#include <stdio.h>

	uint32_t crc32_table[256];
	uint64_t crc64_table[256];

	void
	init(void)
	{
	static const uint32_t poly32 = UINT32_C(0xEDB88320);
	static const uint64_t poly64
	= UINT64_C(0xC96C5795D7870F42);

	for (size_t i = 0; i < 256; ++i) {
	uint32_t crc32 = i;
	uint64_t crc64 = i;

	for (size_t j = 0; j < 8; ++j) {
	if (crc32 & 1)
	crc32 = (crc32 >> 1) ^ poly32;
	else
	crc32 >>= 1;

	if (crc64 & 1)
	crc64 = (crc64 >> 1) ^ poly64;
	else
	crc64 >>= 1;
	}

	crc32_table[i] = crc32;
	crc64_table[i] = crc64;
	}
	}

	uint32_t
	crc32(const uint8_t *buf, size_t size, uint32_t crc)
	{
	crc = ~crc;
	for (size_t i = 0; i < size; ++i)
	crc = crc32_table[buf[i] ^ (crc & 0xFF)]
	^ (crc >> 8);
	return ~crc;
	}

	uint64_t
	crc64(const uint8_t *buf, size_t size, uint64_t crc)
	{
	crc = ~crc;
	for (size_t i = 0; i < size; ++i)
	crc = crc64_table[buf[i] ^ (crc & 0xFF)]
	^ (crc >> 8);
	return ~crc;
	}

	int
	main()
	{
	init();

	uint32_t value32 = 0;
	uint64_t value64 = 0;
	uint64_t total_size = 0;
	uint8_t buf[8192];

	while (1) {
	const size_t buf_size
	= fread(buf, 1, sizeof(buf), stdin);
	if (buf_size == 0)
	break;

	total_size += buf_size;
	value32 = crc32(buf, buf_size, value32);
	value64 = crc64(buf, buf_size, value64);
	}

	printf("Bytes: %" PRIu64 "\n", total_size);
	printf("CRC-32: 0x%08" PRIX32 "\n", value32);
	printf("CRC-64: 0x%016" PRIX64 "\n", value64);

	return 0;
	}


	7. References

	LZMA SDK - The original LZMA implementation
	https://7-zip.org/sdk.html

	LZMA Utils - LZMA adapted to POSIX-like systems
	https://tukaani.org/lzma/

	XZ Utils - The next generation of LZMA Utils
	https://xz.tukaani.org/xz-utils/

	[RFC-1952]
	GZIP file format specification version 4.3
	https://www.ietf.org/rfc/rfc1952.txt
	- Notation of byte boxes in section "2.1. Overall conventions"

	[RFC-2119]
	Key words for use in RFCs to Indicate Requirement Levels
	https://www.ietf.org/rfc/rfc2119.txt

	[GNU-tar]
	GNU tar 1.35 manual
	https://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html
	- Node 9.4.2 "Blocking Factor", paragraph that begins
	"gzip will complain about trailing garbage"
	- Note that this URL points to the latest version of the
	manual, and may some day not contain the note which is in
	1.35. For the exact version of the manual, download GNU
	tar 1.35: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.35.tar.gz