Page 1 of 2

A potential bz2 to xz filelist/hublist format transition...

Posted: 23 Aug 2010, 14:59
by cologic
... likely does not make sense.

Since bz2 was adopted in place of the old proprietary hublist and filelist compression formats a few years ago, a new general-purpose, standardized compression format xz has emerged which promises an average of 15% compression ratio improvement over bz2 with a factor of few times more computation time and greater memory usage than bzip2's 900KiB. Fedora has used xz as its packaging format since release 13, the venerable GNU tar has supported it as the --lzma compression option since version 1.20, Linux has used it as a compressed kernel image format since 2.6.30, and Blizzard uses it for its MPQ format.

Adding support for a new such compression algorithm would be a potentially reasonable change, so I investigated it with the help of filelists provided by eMTee, FleetCommand, and iceman50. However, I concluded with the help of results listed below that, at least for filelists, xz/lmza2 provide insufficient benefit for excessive cost, both in execution time and memory, to justify implementing them in DC.

The filelist test set contains 440 lists with a total uncompressed size of 1771.7MiB. Filesizes shown are summed across all compressed filelists. Execution times are from an Athlon XP 3000+ at 2GHz running 32-bit xz (XZ Utils) 4.999.9beta/liblzma 4.999.9beta, single-threaded:
  • bzip2 -9: 553.1 MiB, 13:52.41 minutes (832.41 seconds)
  • .xz -6: 547.3 MiB (1.1% improvement), 48:22.61 minutes (2902.61 seconds, 249% increase)
  • .xz -6e: 545.6 MiB (1.4% improvement), 58:29.16 minutes (3509.16 seconds, 322% increase)
  • .xz -9e: 540.6 MiB (2.2% improvement), 1:03:12.03 hours (3792.03 seconds, 356% increase)
Out of curiosity I ran a solid-archive version of these tests, though it doesn't directly apply to potential DC usage.
  • .tar.bz2 -9: 550.6 MiB, 13:59.60 (839.60 seconds)
  • .tar.xz -6: 539.6 MiB, 54:22.40 minutes (3262.40 seconds)
  • .tar.xz -6e: 537.9 MiB, 1:05:18.67 minutes (3918.67 seconds)
  • .tar.xz -9e: 520.8 MiB, 1:28:22.66 hours (5302.66 seconds)
Comments:
  1. bzip2 doesn't gain much from tar here, only 3MiB.
  2. the time difference between -6 and -9 modulo the -e is negligible even though -9e achieves noticeably better compression than -6e. However, requiring 674MiB RAM (-9) is excessive; I incudeded the -9e cases as the most favorable case for xz, time notwithstanding.
  3. the -6e flag took 10 additional minutes, a 21% increase over -6, but only a 0.4% improvement/decrease in filesize.
  4. the absolute times aren't that interesting, since everyone has a different CPU. That's why I focus on relative times/speeds in these comments.
  5. xz certainly can do spectacularly well (I've seen cases of this) and I believe it does average about 15% better than bzip2, but for DC++ XML filelists either it has issues or bzip2 is unusually/surprisingly effective.
  6. the -6 setting on xz requires 94MiB of RAM to implement an 8MiB LZMA dictionary. The -9 setting requires 674MiB to implement a 64MiB LZMA dictionary.
  7. These results are only from a single CPU - i3/5/7s, Core 2s, AMD K8, etc may behave differently timewise.
  8. allowing 'solid' compression (.tar.bz2/xz) seems to help xz a lot more than .bz2 (absolute differences around 10MiB rather than 5MiB), but this isn't directly relevant to potential usage in DC as a hublist/filelist representation. All compressors found solid archives most time-consuming to create by anywhere from a neglible 7 seconds (bzip2) to several minutes (xz).
  9. In general, doing these experiments with sample sizes of n=1 (arguably what's happening here, though there are lots of filelists) doesn't produce statistically significant results. I want to be clear that I'm not claiming any, just a set of highly suggestive results.

Re: A potential bz2 to xz filelist/hublist format transition

Posted: 23 Aug 2010, 18:09
by cologic
Quicksilver suggested trying zlib. Not going to repeat the solid-archiving tests, so these are sums of times and sizes of those 440 individually compressed XML filelists:
  • bzip2 -9: 553.1 MiB, 13:52.41 minutes (832.41 seconds); this is the status quo
  • gzip -4: 686.1 MiB (24.5% larger than status quo), 2:37.02 minutes (157.02 seconds, 81.1% decrease from status quo)
  • gzip -5: 676.3 MiB (22.3% larger than status quo), 2:33.40 minutes (153.40 seconds, 81.6% decrease from status quo)
  • gzip -6: 673.5 MiB (21.8% larger than status quo), 2:42.23 minutes (162.23 seconds, 80.5% decrease from status quo)
  • gzip -7: 666.0 MiB (20.4% larger than status quo), 2:56.82 minutes (176.78 seconds, 78.8% decrease from status quo)
  • gzip -8: 663.4 MiB (19.9% larger than status quo), 3:39.78 minutes (219.78 seconds, 73.6% decrease from status quo)
  • gzip -9: 663.4 MiB (19.9% larger than status quo), 3:40.93 minutes (220.93 seconds, 73.5% decrease from status quo)
These tests were not entirely CPU-bound though - only about 80%-85% of the wall-clock time was CPU usage per se, as contrasted with 99%+ on all the bzip2 and xz tests. The hard disk on this machine (same Athlon XP 3000+ 2GHz) is old and slow; presumably, on modern machines with faster I/O, the speed gains in low gzip settings would be greater, since each decrease in gzip setting saw another couple of percent less wallclock time used on CPU and moved to waiting for I/O. By gzip -4, when total time actually increased (random fluctuation presumably - this is why n=1 sampling is bad), I was seeing "gzip -4v *.xml 106.16s user 6.20s system 71% cpu 2:37.02 total".

I should also note that just copying the raw XML files over into each new directory to be gzipped took about 2:33 minutes, so gzip -5 is essentially the speed of cp on this machine.

Re: A potential bz2 to xz filelist/hublist format transition

Posted: 24 Aug 2010, 08:45
by Quicksilver
As a last test that might yield interesting results would be trying xmill

XMILL is specially for xml compressing.. so I would assume it will result in smaller files than lzma and still don't cost that much cpu/ram.
Though even if it has that good results I wouldn't want it as its not a widely spread enough / people would need to implement it in their own language before being able to use it.


Another interesting thing would be seperating filelists by size. I heard that gzip/zlib often performs better on smaller xml files than bz2. Might be interesting to see if that is also true for dc filelists and if it is, where they break even.

Re: A potential bz2 to xz filelist/hublist format transition

Posted: 12 Sep 2010, 00:50
by cologic
Quicksilver wrote:As a last test that might yield interesting results would be trying xmill

XMILL is specially for xml compressing.. so I would assume it will result in smaller files than lzma and still don't cost that much cpu/ram.
Though even if it has that good results I wouldn't want it as its not a widely spread enough / people would need to implement it in their own language before being able to use it.


Another interesting thing would be seperating filelists by size. I heard that gzip/zlib often performs better on smaller xml files than bz2. Might be interesting to see if that is also true for dc filelists and if it is, where they break even.
First, xmill has apparently been abandoned for a while. I used xmill 0.9.1 from 2004, the last version that I can find any evidence of. It compiles without significant issue with gcc 4.3.2. I compiled with -O2 and configured it to use the system bzip2 and zlib libraries, so those parts are identical. For runtime options, I either ran default (for bzip2) or used -z -<5 through 9>, avoiding -4 per previous data. It appears to operate as a preprocessor to zlib and bzip2; I tried both versions. It appears support "ppmdi", but if that's anything like regular ppm, it's a very high cost/somewhat effective method, moreso than xz/lzma2/7zip. I didn't try it.

Results:
  • bzip2 -9: 553.1 MiB, 13:52.41 minutes (832.41 seconds); status quo
  • xmill/bzip2 -9: 542.6 MiB (1.9% smaller than status quo), 14:53:14 minutes (893.14 seconds)
  • xmill/gzip -5: 593.3 MiB (7.3% larger than status quo), 3:40.74 minutes (220.74 seconds)
  • xmill/gzip -6: 591.2 MiB (6.9% larger than status quo), 4:01.47 minutes (241.47 seconds)
  • xmill/gzip -7: 590.4 MiB (6.8% larger than status quo), 4:15.80 minutes (255.80 seconds)
  • xmill/gzip -8: 589.6 MiB (6.6% larger than status quo), 5:23.42 minutes (323.42 seconds)
  • xmill/gzip -9: 589.5 MiB (6.6% larger than status quo), 5:43.29 minutes (343.29 seconds)
It's definitely recognizing the XML; some per-file verbose sample output:

Code: Select all

   Structure:  874342 ==>     7305 (0.835485%)
 Whitespaces:       0 ==> Small...
     Special:      56 ==> Small...
        Sum:   874342 ==>     7305 (0.835485%)

//# <- @Version
          0:        2 ==> Small...

//# <- @CID
          0:       40 ==> Small...

//# <- @Base
          0:        2 ==> Small...

//# <- @Generator
          0:       12 ==> Small...

//# <- @Name
          0:  2256195 ==>   663870 (29.424318%)

//# <- @Size
          0:   602688 ==>   270583 (44.896032%)

//# <- @TTH
          0:  3016280 ==>  1891634 (62.714138%)


Header:          Uncompressed:      237   Compressed:      294 (124.050633%)
Structure:       Uncompressed:   874342   Compressed:     7305 (0.835485%)
Data:            Uncompressed:  5875163   Compressed:  2826087 (48.102274%)
                                          --------------------
Sum:                                      Compressed:  2833686
As for your last suggestion: will get to that hopefully soon.

Comments:
  1. xmill/zlib has very different time/compression tradeoffs than does zlib: for example, -8 and -9 are very similar in zlib, but have real differences with xmill/zlib. xmill/zlib -7 is relatively less worthwhile than zlib -7. In general, xmill/zlib seems to smooth out the zlib time/compression curve, such that none of the settings is unusually great, but not is almost worthless either (such as gzip -8).
  2. xmill with zlib still can't match bzip2 in size, though it only takes about a third of the time. It comes much closer than raw zlib does. It looks like a reasonable use of xmill-like programs might be to compromise between zlib (too inefficient) and bzip2 (too slow), a domain where I'm not sure there are many competitors.
  3. xmill tended to have an overhead of a couple of minutes over the entire test set with -O2. That's not bad, but given the mediocre results it seems unwarranted (even aside from maintenance issues).
  4. bzip2 has proven a remarkably durably useful compromise between speed and compressed size.

Re: A potential bz2 to xz filelist/hublist format transition

Posted: 13 Sep 2010, 01:59
by darkKlor
Interesting stuff. gzip could be good for slow CPU environments where you don't want a background task slowing the system, or high-speed hubs, such as LANs where the emphasis is on getting the list out rather than keeping the size down. of course the differences in terms of seconds a very small, so perhaps not a huge benefit?

Somebody managing to write a GPU implementation of Tiger would be more interesting. I asked the bloke that created Tiger if he thought it would do any good and all he bothered saying was that he submitting Tiger as part of the current NIST competition.

Re: A potential bz2 to xz filelist/hublist format transition

Posted: 13 Sep 2010, 18:21
by cologic
darkKlor wrote:Interesting stuff. gzip could be good for slow CPU environments where you don't want a background task slowing the system, or high-speed hubs, such as LANs where the emphasis is on getting the list out rather than keeping the size down. of course the differences in terms of seconds a very small, so perhaps not a huge benefit?

Somebody managing to write a GPU implementation of Tiger would be more interesting. I asked the bloke that created Tiger if he thought it would do any good and all he bothered saying was that he submitting Tiger as part of the current NIST competition.
The CPU tested is, absent an Atom, slower than almost about any processor one is likely to encounter, having been released in 2003. Despite this, the maximum time difference from bzip2 -9 encountered over 1.8GiB of raw XML input was 680 seconds. Stipulating that bzip2's compression time probably scales approximately linearly with compressed file size, 4% of that 680 second difference is 27 seconds - so on a 7 year old CPU on a fairly large filelist at 75MiB (4% of 1.8GiB), gzip -5 would probably save less than 30 seconds and anything 'faster' looks limited by hard disk speed. At that point, I'd look at other bottlenecks before assuming that gzip would be worthwhile:
  • DC client new version adoption rates tend to be low, such that one would probably have to wait a year or more before one could assume that even in a smallish hub one would only ever need to serve out a .gz filelist. Otherwise, it merely adds work, albeit not much: compress the same filelist twice, once in bzip -9 and once with gzip.
  • Further, by the time new clients are adopted, CPUs will have gotten faster again, or at least more multithreaded (which is sufficient either to render background processing less onerous or speed up compression, depending on the tradeoff one wishes to make). Even Atoms are dual-core now.
  • Smartphones running something akin to DroidDC are an interesting test case but I'd suggest they benefit from bzip2: they tend to have limited bandwidth (no Ethernet usually, either 802.11g/n, 3G, or, in a year or so, LTE - all relatively slow compared to the classic LAN) such that saving 20% filesize with bzip2 over gzip on compressed filelists matters; and they tend to have highly constrained storage, maxing out around 16GiB internally and maybe a 32GiB or 64GiB SD card, making large filelists unlikely regardless.
  • The most plausible case I can imagine satisfying the required intersection of fast (at least 100Mb/s, preferably 1Gb/s) networks with readily accessible multi-TiB storage are some Atom-based NAS-like machines (or Atom-based desktops - I've seen a few people build them, but they know the tradeoffs). Given that Atoms are becoming dual-core, I wouldn't worry about them too much.
  • An alternative is someone running an OP client or other such case of wanting to download and decode lots of filelists. bzip2 compression and decompression times are nigh-symmetrical, so this could actually trigger some painful CPU usage on slow machines. However, unlike the compression done automatically, this is entirely voluntary on the part of that user - if their computer is unable to handle such things, they're free not to download and decompress hundreds of megabytes of filelists. It'd probably blow DC++'s heap anyway.
Regarding Tiger implementations on the GPU specifically, I don't know, but people have implemented MD6 and probably other hash algorithms on GPUs. This guy muses a bit about it, but without concluding much or apparently doing anything with it.

Also, generically, I haven't seen a system in years where the hash algorithm wasn't outrunning the disk by a factor of at least 2. I'm sure with nice RAID arrays, or, more recently, SSDs, that might change though - what have people's experiences been hashing nontrivially-sized shares from SSDs been?

I'd assume they'd usually be paired with pretty fast/modern CPUs, so that the TTH algorithm would remain noticeably faster. Even on an Intel Core 2 1.83 GHz processor under Windows Vista in 32-bit mode, Crypto++ benchmarks put Tiger at 214MiB/s, whereas even the fastest SSDs only slightly exceed that rate. New CPUs should do much better. It's possible that putting TTH on the GPU would help, but I'm not sure it'd be preferable overall.

Interesting that Tiger is going to be submitted in the NIST rounds - will be nice to see what sort of cryptanalysis gets done on it and how strong it looks. It is something of a problem that though enough cryptographers have attacked it to give some sense of its strength, it's not enough of a mainstream hash to have lots of cryptanalytic results, positive or negative. Anything to encourage that is welcome.

Re: A potential bz2 to xz filelist/hublist format transition

Posted: 14 Sep 2010, 12:46
by darkKlor
asked the bloke that created Tiger if he thought it would do any good and all he bothered saying was that he submitting Tiger as part of the current NIST competition.
crap, sorry typing fail... should read: 'he is not submitting Tiger' :(

Re: A potential bz2 to xz filelist/hublist format transition

Posted: 17 Sep 2010, 15:29
by Quicksilver
tiger is not good enough for SHA-3

it has to few rounds / gives people headaches ..
and its slower than other functions submitted there...

Re: A potential bz2 to xz filelist/hublist format transition

Posted: 31 Oct 2010, 23:56
by cologic
I wrote a script to automate all of this (parts of it weren't before), have added support for running FreeArc, and have started it running. It's not estimated to complete for several days (it runs each compression configuration on each filelist three times to try to avoid timing quirks, is quite exhaustive with regards to such configurations - e.g. 24 for FreeArc, and is running on a 7-year-old CPU), but for now, here it is:

Code: Select all

#!/usr/bin/python

def time(func):
    # On Debian Linux, clock() has resolution at best of 0.01 seconds. time()
    # appears to resolve reliably in the microsecond range and certainly in the
    # millisecond time domain that small zlib compression calls invoke. Calling
    # an empty function, this uses up to 4 microseconds on an Athlon XP 2.1GHz,
    # safely orders of magnitude less than even a quick zlib.compress call.

    # On Windows, a different tradeoff exists and one might want to switch from
    # time.time() to time.clock(); per http://docs.python.org/library/time.html
    # the latter achieves microsecond accuracy on that platform. Finally, while
    # clock() various measures wall-clock time or execution time depending upon
    # the operating system, time.time() always measures wall time.

    # See http://mail.python.org/pipermail/python-list/2007-January/1121263.html
    # and http://stackoverflow.com/questions/85451/python-time-clock-vs-time-time-accuracy
    # for more information.
    from time import time
    begin = time()
    func(); func(); ret = func()
    return (time() - begin)/3, ret

# Map input strings to output (compressed) lengths.
def subprocess_invoke(args, stdin):
    # Generic adapter to capture size of stdout. This probably adds some timing
    # overhead, but that's unavoidable (since Python doesn't have libraries for
    # e.g. freearc or xmill)
    from subprocess import Popen, PIPE
    process = Popen(args, stdin=PIPE, stdout=PIPE, stderr=None, shell=False)

    # Without this, some external programs hang on .wait()
    output_len = len(process.communicate(stdin)[0])
    process.stdin.close()

    assert process.wait() == 0
    return output_len

def gzip_size(raw, level):
    from zlib import compress
    return len(compress(raw, level))

def bzip2_size(raw, level):
    from bz2 import compress
    return len(compress(raw, level))

def xz_size(raw, level):
    return subprocess_invoke(['xz', '--stdout', '-%d'%level], raw)

def xz_extreme_size(raw, level):
    return subprocess_invoke(['xz', '--stdout', '-%d'%level, '--extreme'], raw)

def freearc_size(pathname, level):
    # Tested using version 0.666 generic Linux binary from website

    # FreeArc appears not to support stdin/out, so this becomes ugly.
    # The RuntimeWarning about tempnam being a security vulnerability is
    # legitimate but irrelevant here.
    from os import tempnam, stat, unlink
    from stat import ST_SIZE
    archive_name = tempnam()+'.arc'
    # Better not have any other 'arc' exectuables earlier in the path...
    subprocess_invoke(['arc', 'a', '-m%s'%level, '-lc', archive_name, pathname], '')
    size = stat(archive_name)[ST_SIZE]
    unlink(archive_name)
    return size

def xmill_zlib_size(raw, level):
    # Tested using version 0.91, compiled with -O2 -fomit-frame-pointer on g++ 4.3.2
    return subprocess_invoke(['xcmill', '-c', '-z', '-%d'%level], raw)

def xmill_bzip2_size(raw):
    # Tested using version 0.91, compiled with -O2 -fomit-frame-pointer on g++ 4.3.2
    return subprocess_invoke(['xcmill', '-c'], raw)

# Unavailable compressors can be elided.
compressors = [('raw', lambda _,__:len(_)), ('status_quo', lambda _,__:bzip2_size(_, 9))]
compressors += [('zlib_%d'%i, (lambda i:(lambda _,__:gzip_size(_, i)))(i)) for i in xrange(1, 10)]
compressors += [('bzip2_%d'%i, (lambda i:(lambda _,__:bzip2_size(_, i)))(i)) for i in xrange(1, 10)]
compressors += [('xmill_zlib_%d'%i, (lambda i:(lambda _,__:xmill_zlib_size(_, i)))(i)) for i in xrange(1,10)]
compressors += [('xmill_bzip2', lambda _,__:xmill_bzip2_size(_))]

# My 1GiB RAM test machine can't use the -9 settings for these with other memory overhead also.
compressors += [('xz_%d'%i, (lambda i:(lambda _,__:xz_size(_, i)))(i)) for i in xrange(1, 9)]
compressors += [('xz_e_%d'%i, (lambda i:(lambda _,__:xz_extreme_size(_, i)))(i)) for i in xrange(1, 9)]
compressors += [('freearc_%dx'%i, (lambda i:(lambda _,__:freearc_size(__, str(i)+'x')))(i)) for i in xrange(1, 9)]
compressors += [('freearc_%d'%i, (lambda i:(lambda _,__:freearc_size(__, str(i))))(i)) for i in xrange(1, 9)]
compressors += [('freearc_%dp'%i, (lambda i:(lambda _,__:freearc_size(__, str(i)+'p')))(i)) for i in xrange(1, 9)]

# Write as it's being collected, to allow some additional robustness to interruption.
def collect_data(raw_dir):
    from csv import writer
    from os import path, listdir

    for filename in listdir(raw_dir):
        assert filename.endswith('.xml')
        basename = filename[:-4]
        pathname = path.join(raw_dir, filename)

        # One must have enough RAM to hold store the
        # uncompressed filelist and compressed filelist.
        raw = file(pathname, 'rb').read()
        raw_size = len(raw)

        comp_stats_file = [basename]
        # Iterate through compressors, accumulate results
        for comp_name, comp_size_func in compressors:
            comp_time, comp_size = time(lambda:comp_size_func(raw, pathname))
            comp_stats_file += [comp_size, comp_time]
        yield comp_stats_file

def dir_size(raw_dir):
    from os import path, listdir, stat
    from stat import ST_SIZE

    raw_sizes = [stat(path.join(raw_dir, filename))[ST_SIZE]
                 for filename in listdir(raw_dir) if filename.endswith('.xml')]
    return sum(raw_sizes), len(raw_sizes)


def fmt_int_padding(num, denom):
    n_l, d_l = map(lambda _:len(str(_)), [num, denom])
    assert num >= 0 and denom >= 0 and num <= denom and n_l <= d_l
    return '%s%d'%(' '*(d_l - n_l), num)

def fmt_int_ratio(num, denom):
    return '%s/%d'%(fmt_int_padding(num, denom), denom)

def fmt_percent(x):
    return '%5.2f%%'%(x*100)

def write_csv(raw_dir):
    from csv import writer
    from time import time
    from datetime import datetime, timedelta

    f = file('compressed_comparison.csv', 'wb')
    csv_file = writer(f)

    summary_line = ['filelist_name']
    for compressor_name, _ in compressors:
        summary_line += ['%s_size'%compressor_name, '%s_time'%compressor_name]
    csv_file.writerow(summary_line)

    total_raw_size, total_files = dir_size(raw_dir)
    finished_files = 0
    finished_raw_size = 0
    start_time = time()

    for comp_stats_file in collect_data(raw_dir):
        basename = comp_stats_file[0]
        # For this to work, 'raw' must be the first compressor. Could remove that
        # dependency, but *shrug*.
        finished_files += 1
        finished_raw_size += comp_stats_file[1]

        # This is just linear extrapolation. It should work fairly well.
        consumed_time = time() - start_time
        remaining_raw_size = total_raw_size - finished_raw_size
        est_remaining_time = consumed_time*(float(remaining_raw_size)/finished_raw_size)
        est_completion = datetime.now() + timedelta(0, est_remaining_time)

        fmt = '%s files; %s (%s) bytes; est. done %s (%s)'
        print fmt%(fmt_int_ratio(finished_files, total_files), fmt_int_padding(finished_raw_size, total_raw_size), fmt_percent(float(finished_raw_size)/total_raw_size), est_completion.strftime('%m-%d %H:%M'), basename.split('.')[0])

        csv_file.writerow(comp_stats_file)

    f.close()

write_csv('raw')
When it's done, I'll have a big .csv and any analysis thus far suggested (including QS's per-filelist-size, etc) can be run on that.

Re: A potential bz2 to xz filelist/hublist format transition

Posted: 08 Nov 2010, 23:19
by cologic
I've committed an updated version of this script, a version which actually completed (see the giant comment block starting with "My 1GiB RAM test machine ..." for details) to the ADC project repository.

I haven't run any analysis on it yet (it only finished yesterday, because I had to restart it after about 3 days' run, because of the memory issues alluded to in aforementioned comment block).