package compsort
Install
Dune Dependency
Authors
Maintainers
Sources
md5=57169fd2b0f16bbc82c45b85cd489496
sha512=f11e0e9dc8ab7a712f96444fa9aafd0d5843ec656fc9dca368afcdfea5d0d23f51f8346287d34c2b5b1180fbac25d656fe59bef8e34b2849a9b5c912927f855d
doc/README.html
compsort
Optimize the ordering of files in archives for better compression. Compression can often be improved by 10% or more for relevant archives, therefore bringing the benefits of larger compression dictionaries, without their larger memory usage.
Background
Compression algorithms are limited by their window/dictionary size: they won't see redundancy further than their dictionary.
While working with some archives, I noticed that some files were very similar but spread all over the archive, therefore preventing efficient compression.
It is possible to increase the dictionary size but that comes at the cost of increased memory usage during compression and, more importantly, decompression.
Attempts to order files by hand proved that this could bring tangible benefits but also that filenames were not a good input source for heuristics as sorting by filename or extension frequently lead to larger archives.
Approach
The current code uses a degenerate hash chain to speed-up the process of comparing byte sequences between files. This is used to calculate how similar two files are.
The process is CPU-intensive (hours) and memory-hungry (gigabytes) but is a one-time cost which will support incremental additions of files in the dataset.
Counting repetitions
There have been several approaches to count repeated data between two files.
cat file1 file2 | xz | wc -c
The first idea was to compress every pair of file, record the size, and compare it to the size of each file compressed independently.
While conceptually simple and a good proof-of-concept, this was slow. Xz compression is not fast but more importantly, this is quadratic and with no way of improving performance.
Modified xz-utils
A subsequent approach was to create an archive with all the files, and compress it with xz with a very large dictionary. The version of xz that was used was modified to output every repetition ("match") it found. This was followed by a large amount of processing in order to map the positions in the archive back to files.
The need for a patched xz (a one-liner, but still a patch) greatly complicated deployment and the code sat untouched for several months.
Then, in 2024, the xz backdoor was implemented and found. It was clear that providing a forked xz was not going to help.
Xz parsing
An idea was then to compress the archive as usual but then parse it. Unfortunately, parsing LZMA requires an almost-complete decompressor which is not trivial.
Degenerate hash chains
The current approach is based on hash chains for which I didn't easily find literature, especially due to search results for "hash" and "chains" being swamped in crypto-currencies and blockchain results.
I call it degenerate as it's not a proper implementation: it's not even a chain. It has been good enough so far however.
- Consider two files X and Y.
- Create an array A of length 2 ** 24.
- For each position i of X, compute the hash of X[i..i+15] and truncate it to 24 bits; use that as an index to store i in A.
- For each position j of Y, compute a truncated hash the same way, do a lookup in A and test if the data at X[i] matches the data at Y[j].
There are many issues to the above but it works well enough in practice. It's quadratic but each iteration is fast and the whole set of comparisons is embarrassingly parallel.
lz4?
A future experiment will involve using lz4 instead of xz. Lz4 is very fast by not encoding symbols: as far as I understand, it only detects repetitions, which is exactly what is wanted here.
Clustering
TBD
(Roughly: a clustering algorithm (as used in bioinformatics), with distance being similar to a Jaccard index using the repetition count from above and file size).
Benchmarks
Compression is improved most usually; when it isn't, it's only very slightly worse.
Tests are done by compressing with xz
a cpio
archive created with the files in the various orders.
Results also include values for compression using a 1G dictionary which is enough to cover the whole archives at once and serve as an hint of a lower bound for the sizes.
Benchmark script
The benchmarking is quite ad-hoc at the moment. Assuming you have an initrd
directory in the current directory and built sources:
X=initrd
find "${X}" -type f > "${X}.list"
./_build/install/default/bin/compsort overlaps.db $(cat ${X}.list) | cut -f1 -d' ' > ${X}.buckets
cat ${X}.buckets | cpio --create -H newc | xz -kfv > "${X}_buckets.cpio.xz"
cat ${X}.list | cpio --create -H newc | xz -kfv > "${X}_find.cpio.xz"
ls -l "${X}"_*xz
You can use either a directory you already have on your system (e.g. a package from your distribution that is then extracted to its own directory). If you don't have good test data, git fetch
the test-data
branch of this repository and run:
git show 2016ca68d7612ced77a2f5c5cfdb0cf927a7c4a5 | tar xJv
Keep in mind that the process is time-consuming, is quadratic in terms of the number of files and is also impacted by filesize (maybe linear in the total filesize). There are also two steps and the second one is also quadratic in the number of files but is much much faster.
A quick way to reduce the resources required is to replace cat ${X}.list
above with head -n 500 ${X}.list
. Obviously this will give partial results but can still be interesting.
Results
initrd
A Ubuntu initrd
on my machine.
Version | Size | Compared to default | Relative improvement compared to 1G dictionary |
---|---|---|---|
Default order, default dictionary | 32399424 | 0% | 0% |
Current implementation, default dictionary | 28667796 | -11.5% | 90% |
Default order, 1G dictionary | 28275764 | -12.7% | 100% |
Firmware
This dataset is roughly linux-firmware.git
Version | Size | Compared to default | Relative improvement compared to 1G dictionary |
---|---|---|---|
Default order, default dictionary | 512306496 | 0% | 0% |
Current implementation, default dictionary | 241043200 | -6.0% | 63% |
Default order, 1G dictionary | 232100920 | -9.4% | 100% |
initrd-rpi4
A Ubuntu initrd
for the Raspberry Pi 4.
Version | Size | Compared to default | Relative improvement compared to 1G dictionary |
---|---|---|---|
Default order, default dictionary | 10400604 | 0% | 0% |
Current implementation, default dictionary | 10402760 | +0.0% | N/A |
Default order, 1G dictionary | 10274160 | -1.2% | 100% |
Visualization
The bettercomp
program generates two images named bettercomp-<algorithm>.png
and bettercomp-noop.png
(names are hard-coded at the moment). They help see how the files are ordered (noop shows before re-ordering and the other one after).
Dark pixels at (x,y)
mean that the files x
and y
(index by the position in the input file list) compress poorly together while white ones indicate they compress well together.
python3-django-horizon
Shows patterns in the original order and how files are better grouped after processing. The data is fairly small and fits in a default xz
dictionary so compression doesn't improve much but the visual patterns are interesting.
Original order:
"{width=512 height=512px}
Improved order:
"{width=512 height=512px}
firmware
Patterns below are less obvious but you can see how islands are created, especially on the bottom right with files that compress well together but not at all with others.
Original order:
"{width=512 height=512px}
Improved order:
"{width=512 height=512px}
Acknowledgments
I had lots of good discussions about this on #tukaani
on Libera.chat, especially with Lasse Collin.
On discuss.ocaml.org, Vincent Laviron helped me improve the performance of the overlap computation.