A potential bz2 to xz filelist/hublist format transition...
Posted: 23 Aug 2010, 14:59
... likely does not make sense.
Since bz2 was adopted in place of the old proprietary hublist and filelist compression formats a few years ago, a new general-purpose, standardized compression format xz has emerged which promises an average of 15% compression ratio improvement over bz2 with a factor of few times more computation time and greater memory usage than bzip2's 900KiB. Fedora has used xz as its packaging format since release 13, the venerable GNU tar has supported it as the --lzma compression option since version 1.20, Linux has used it as a compressed kernel image format since 2.6.30, and Blizzard uses it for its MPQ format.
Adding support for a new such compression algorithm would be a potentially reasonable change, so I investigated it with the help of filelists provided by eMTee, FleetCommand, and iceman50. However, I concluded with the help of results listed below that, at least for filelists, xz/lmza2 provide insufficient benefit for excessive cost, both in execution time and memory, to justify implementing them in DC.
The filelist test set contains 440 lists with a total uncompressed size of 1771.7MiB. Filesizes shown are summed across all compressed filelists. Execution times are from an Athlon XP 3000+ at 2GHz running 32-bit xz (XZ Utils) 4.999.9beta/liblzma 4.999.9beta, single-threaded:
Since bz2 was adopted in place of the old proprietary hublist and filelist compression formats a few years ago, a new general-purpose, standardized compression format xz has emerged which promises an average of 15% compression ratio improvement over bz2 with a factor of few times more computation time and greater memory usage than bzip2's 900KiB. Fedora has used xz as its packaging format since release 13, the venerable GNU tar has supported it as the --lzma compression option since version 1.20, Linux has used it as a compressed kernel image format since 2.6.30, and Blizzard uses it for its MPQ format.
Adding support for a new such compression algorithm would be a potentially reasonable change, so I investigated it with the help of filelists provided by eMTee, FleetCommand, and iceman50. However, I concluded with the help of results listed below that, at least for filelists, xz/lmza2 provide insufficient benefit for excessive cost, both in execution time and memory, to justify implementing them in DC.
The filelist test set contains 440 lists with a total uncompressed size of 1771.7MiB. Filesizes shown are summed across all compressed filelists. Execution times are from an Athlon XP 3000+ at 2GHz running 32-bit xz (XZ Utils) 4.999.9beta/liblzma 4.999.9beta, single-threaded:
- bzip2 -9: 553.1 MiB, 13:52.41 minutes (832.41 seconds)
- .xz -6: 547.3 MiB (1.1% improvement), 48:22.61 minutes (2902.61 seconds, 249% increase)
- .xz -6e: 545.6 MiB (1.4% improvement), 58:29.16 minutes (3509.16 seconds, 322% increase)
- .xz -9e: 540.6 MiB (2.2% improvement), 1:03:12.03 hours (3792.03 seconds, 356% increase)
- .tar.bz2 -9: 550.6 MiB, 13:59.60 (839.60 seconds)
- .tar.xz -6: 539.6 MiB, 54:22.40 minutes (3262.40 seconds)
- .tar.xz -6e: 537.9 MiB, 1:05:18.67 minutes (3918.67 seconds)
- .tar.xz -9e: 520.8 MiB, 1:28:22.66 hours (5302.66 seconds)
- bzip2 doesn't gain much from tar here, only 3MiB.
- the time difference between -6 and -9 modulo the -e is negligible even though -9e achieves noticeably better compression than -6e. However, requiring 674MiB RAM (-9) is excessive; I incudeded the -9e cases as the most favorable case for xz, time notwithstanding.
- the -6e flag took 10 additional minutes, a 21% increase over -6, but only a 0.4% improvement/decrease in filesize.
- the absolute times aren't that interesting, since everyone has a different CPU. That's why I focus on relative times/speeds in these comments.
- xz certainly can do spectacularly well (I've seen cases of this) and I believe it does average about 15% better than bzip2, but for DC++ XML filelists either it has issues or bzip2 is unusually/surprisingly effective.
- the -6 setting on xz requires 94MiB of RAM to implement an 8MiB LZMA dictionary. The -9 setting requires 674MiB to implement a 64MiB LZMA dictionary.
- These results are only from a single CPU - i3/5/7s, Core 2s, AMD K8, etc may behave differently timewise.
- allowing 'solid' compression (.tar.bz2/xz) seems to help xz a lot more than .bz2 (absolute differences around 10MiB rather than 5MiB), but this isn't directly relevant to potential usage in DC as a hublist/filelist representation. All compressors found solid archives most time-consuming to create by anywhere from a neglible 7 seconds (bzip2) to several minutes (xz).
- In general, doing these experiments with sample sizes of n=1 (arguably what's happening here, though there are lots of filelists) doesn't produce statistically significant results. I want to be clear that I'm not claiming any, just a set of highly suggestive results.