Page 1 of 1

The incorrectness of BLOM implementations

Posted: 11 Oct 2012, 12:51
by Ayo
As I was implementing BLOM for ncdc, I compared my results with the test vectors of Jucy (provided by Quicksilver) and of DC++, as provided by poy. I figured out why DC++ and Jucy give different results for the last test vector, and also figured out that both results are actually incorrect.

The algorithm, according to the spec:
The client constructs the bloom filter by creating a bit array of m bits set to 0 (zero). For each file it then sets to "1" k positions constructed from the file hash. Seeing the file hash as a stream of bits (starting from the lowest bit of the first byte, ending at the highest bit of the last byte), the client should use h bits starting at the first bit of the first byte to create an integer and apply modulo m to get the position in the bit array, then redo the process k times advancing the starting position by h each time.
The bold part is where the problem lies.

Pseudocode of what Jucy does:

Code: Select all

index = 0;
for(i=0; i<k; i++)
  index += is_bit_in_hash_set(i) << i;
And the same for DC++

Code: Select all

index = 0;
for(i=0; i<k; i++)
  index |= is_bit_in_hash_set(i) << i;
Both are logically correct. However, in both implementations, the shift operation is performed with 32 bit arithmetic, whereas k (and thus, i) may be larger than 32. Since the resulting value of the incorrect shift has more than one bit set, + and | give different values for 'index' and thus result in different bloom filters. The fix is to ensure that the shift operation is performed with 64 bit arithmetic, in which case the last test vector gives the following filter:

Code: Select all

AAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
The practical implication of this bug is that, if a hub requests a bloom filter with h>32, no client will currently respond with a correct filter. Are there any hubs that request a filter with h>32? And if so, do they rely on the buggy behaviour?

Re: The incorrectness of BLOM implementations

Posted: 11 Oct 2012, 16:49
by poy
good catch, i've fixed it in rev 3074 of DC++ and rev 619 of ADCH++.

note that ADCH++ is currently not affected as it hardcodes h to be 24.

fortunately, there isn't any other hub that implements BLOM as far as i'm aware. :) and even if they did, they'd probably set h=24 anyway as it's the value recommended in the spec.

Re: The incorrectness of BLOM implementations

Posted: 11 Oct 2012, 16:59
by klondike
Another thing I have realized on the code is that you store the data in a vector<bool> this data type is not guaranted to use a single bit for each bool and usually uses a byte for each bool value. You'd probably want to give a look to this one instead http://www.boost.org/doc/libs/1_51_0/li ... itset.html Or a bit_vector http://www.sgi.com/tech/stl/bit_vector.html

Re: The incorrectness of BLOM implementations

Posted: 11 Oct 2012, 21:48
by poy
vector<bool> is specialized in C++, see section 23.3.7 of <http://www.open-std.org/jtc1/sc22/wg21/ ... /n3337.pdf>.

Re: The incorrectness of BLOM implementations

Posted: 11 Oct 2012, 23:06
by klondike
It's not specialized on all the C++ implementations in fact I recall it isn't on gcc ones.
3 There is no requirement that the data be stored as a contiguous allocation of bool values. A space-optimized representation of bits is recommended instead.
bit_vector and dynamic_bitset instead should be. But meh, not that important on client side (maybe a little on server side though).

Re: The incorrectness of BLOM implementations

Posted: 12 Oct 2012, 06:26
by Ayo
poy wrote:good catch, i've fixed it in rev 3074 of DC++ and rev 619 of ADCH++.
Thanks for the quick fix.
poy wrote:note that ADCH++ is currently not affected as it hardcodes h to be 24.

fortunately, there isn't any other hub that implements BLOM as far as i'm aware. :) and even if they did, they'd probably set h=24 anyway as it's the value recommended in the spec.
Glad to hear that, that means we don't have to handle any incompatibilities on the client side. Future hubs that wish to use h>32 will have to deal with it, though.

I've pushed BLOM to the ncdc repo yesterday and the feature is available since version 1.13-2-g5e91e52, the implementation is quite small and available at http://g.blicky.net/ncdc.git/tree/src/bloom.c

Re: The incorrectness of BLOM implementations

Posted: 12 Oct 2012, 12:17
by poy
klondike wrote:It's not specialized on all the C++ implementations in fact I recall it isn't on gcc ones.
3 There is no requirement that the data be stored as a contiguous allocation of bool values. A space-optimized representation of bits is recommended instead.
bit_vector and dynamic_bitset instead should be. But meh, not that important on client side (maybe a little on server side though).
it's just a set of bits masked together as ints in GCC as far as i can tell - see bits/stl_bvector.hpp.
Ayo wrote:Glad to hear that, that means we don't have to handle any incompatibilities on the client side. Future hubs that wish to use h>32 will have to deal with it, though.
somehow i can foresee problems with m overflowing when h goes above 32 but we'll see when that happens... :)

Re: The incorrectness of BLOM implementations

Posted: 21 Oct 2012, 13:51
by Quicksilver
For the record: Jucy 0.86 fixed that problem for h between 32 and 64.