package compsort

You can search for identifiers within the package.

in-package search v0.2.0

Reorder files in an archive to improve its compression

Install

Dune Dependency

gitlab.com Readme Edit opam file Versions (1)

Authors

A

Adrien Nader <adrien@notk.org>

Maintainers

A

Adrien Nader <adrien@notk.org>

Sources

compsort-0.9.2.tar.gz

md5=57169fd2b0f16bbc82c45b85cd489496

sha512=f11e0e9dc8ab7a712f96444fa9aafd0d5843ec656fc9dca368afcdfea5d0d23f51f8346287d34c2b5b1180fbac25d656fe59bef8e34b2849a9b5c912927f855d

doc/README.html

compsort

Optimize the ordering of files in archives for better compression. Compression can often be improved by 10% or more for relevant archives, therefore bringing the benefits of larger compression dictionaries, without their larger memory usage.

Background

Compression algorithms are limited by their window/dictionary size: they won't see redundancy further than their dictionary.

While working with some archives, I noticed that some files were very similar but spread all over the archive, therefore preventing efficient compression.

It is possible to increase the dictionary size but that comes at the cost of increased memory usage during compression and, more importantly, decompression.

Attempts to order files by hand proved that this could bring tangible benefits but also that filenames were not a good input source for heuristics as sorting by filename or extension frequently lead to larger archives.

Approach

The current code uses a degenerate hash chain to speed-up the process of comparing byte sequences between files. This is used to calculate how similar two files are.

The process is CPU-intensive (hours) and memory-hungry (gigabytes) but is a one-time cost which will support incremental additions of files in the dataset.

Counting repetitions

There have been several approaches to count repeated data between two files.

cat file1 file2 | xz | wc -c

The first idea was to compress every pair of file, record the size, and compare it to the size of each file compressed independently.

While conceptually simple and a good proof-of-concept, this was slow. Xz compression is not fast but more importantly, this is quadratic and with no way of improving performance.

Modified xz-utils

A subsequent approach was to create an archive with all the files, and compress it with xz with a very large dictionary. The version of xz that was used was modified to output every repetition ("match") it found. This was followed by a large amount of processing in order to map the positions in the archive back to files.

The need for a patched xz (a one-liner, but still a patch) greatly complicated deployment and the code sat untouched for several months.

Then, in 2024, the xz backdoor was implemented and found. It was clear that providing a forked xz was not going to help.

Xz parsing

An idea was then to compress the archive as usual but then parse it. Unfortunately, parsing LZMA requires an almost-complete decompressor which is not trivial.

Degenerate hash chains

The current approach is based on hash chains for which I didn't easily find literature, especially due to search results for "hash" and "chains" being swamped in crypto-currencies and blockchain results.

I call it degenerate as it's not a proper implementation: it's not even a chain. It has been good enough so far however.

Consider two files X and Y.
Create an array A of length 2 ** 24.
For each position i of X, compute the hash of X[i..i+15] and truncate it to 24 bits; use that as an index to store i in A.
For each position j of Y, compute a truncated hash the same way, do a lookup in A and test if the data at X[i] matches the data at Y[j].

There are many issues to the above but it works well enough in practice. It's quadratic but each iteration is fast and the whole set of comparisons is embarrassingly parallel.

lz4?

A future experiment will involve using lz4 instead of xz. Lz4 is very fast by not encoding symbols: as far as I understand, it only detects repetitions, which is exactly what is wanted here.

Clustering

TBD

(Roughly: a clustering algorithm (as used in bioinformatics), with distance being similar to a Jaccard index using the repetition count from above and file size).

Benchmarks

Compression is improved most usually; when it isn't, it's only very slightly worse.

Tests are done by compressing with xz a cpio archive created with the files in the various orders.

Results also include values for compression using a 1G dictionary which is enough to cover the whole archives at once and serve as an hint of a lower bound for the sizes.

Benchmark script

The benchmarking is quite ad-hoc at the moment. Assuming you have an initrd directory in the current directory and built sources:

X=initrd
find "${X}" -type f > "${X}.list"
./_build/install/default/bin/compsort overlaps.db $(cat ${X}.list) | cut -f1 -d' ' > ${X}.buckets
cat ${X}.buckets | cpio --create -H newc | xz -kfv > "${X}_buckets.cpio.xz"
cat ${X}.list    | cpio --create -H newc | xz -kfv > "${X}_find.cpio.xz"
ls -l "${X}"_*xz

You can use either a directory you already have on your system (e.g. a package from your distribution that is then extracted to its own directory). If you don't have good test data, git fetch the test-data branch of this repository and run:

git show 2016ca68d7612ced77a2f5c5cfdb0cf927a7c4a5 | tar xJv

Keep in mind that the process is time-consuming, is quadratic in terms of the number of files and is also impacted by filesize (maybe linear in the total filesize). There are also two steps and the second one is also quadratic in the number of files but is much much faster.

A quick way to reduce the resources required is to replace cat ${X}.list above with head -n 500 ${X}.list. Obviously this will give partial results but can still be interesting.

Results

initrd

A Ubuntu initrd on my machine.

Version	Size	Compared to default	Relative improvement compared to 1G dictionary
Default order, default dictionary	32399424	0%	0%
Current implementation, default dictionary	28667796	-11.5%	90%
Default order, 1G dictionary	28275764	-12.7%	100%

Firmware

This dataset is roughly linux-firmware.git

Version	Size	Compared to default	Relative improvement compared to 1G dictionary
Default order, default dictionary	512306496	0%	0%
Current implementation, default dictionary	241043200	-6.0%	63%
Default order, 1G dictionary	232100920	-9.4%	100%

initrd-rpi4

A Ubuntu initrd for the Raspberry Pi 4.

Version	Size	Compared to default	Relative improvement compared to 1G dictionary
Default order, default dictionary	10400604	0%	0%
Current implementation, default dictionary	10402760	+0.0%	N/A
Default order, 1G dictionary	10274160	-1.2%	100%

Visualization

The bettercomp program generates two images named bettercomp-<algorithm>.png and bettercomp-noop.png (names are hard-coded at the moment). They help see how the files are ordered (noop shows before re-ordering and the other one after).

Dark pixels at (x,y) mean that the files x and y (index by the position in the input file list) compress poorly together while white ones indicate they compress well together.

python3-django-horizon

Shows patterns in the original order and how files are better grouped after processing. The data is fairly small and fits in a default xz dictionary so compression doesn't improve much but the visual patterns are interesting.

Original order:

Original order "{width=512 height=512px}

Improved order:

Improved order "{width=512 height=512px}

firmware

Patterns below are less obvious but you can see how islands are created, especially on the bottom right with files that compress well together but not at all with others.

Original order:

Original order "{width=512 height=512px}

Improved order:

Improved order "{width=512 height=512px}

Acknowledgments

I had lots of good discussions about this on #tukaani on Libera.chat, especially with Lasse Collin.

On discuss.ocaml.org, Vincent Laviron helped me improve the performance of the overlap computation.

On This Page

Background
Approach
1. Counting repetitions
2. Clustering
Benchmarks
Acknowledgments