The incorrectness of BLOM implementations

Here is the sub forum used for talking about ideas, implementations and suggestions or typical guidelines.

Further info on extension or the protocol is found at our Wiki
Locked
Ayo
Junior Member
Posts: 27
Joined: 23 Feb 2011, 13:50

The incorrectness of BLOM implementations

Post by Ayo » 11 Oct 2012, 12:51

As I was implementing BLOM for ncdc, I compared my results with the test vectors of Jucy (provided by Quicksilver) and of DC++, as provided by poy. I figured out why DC++ and Jucy give different results for the last test vector, and also figured out that both results are actually incorrect.

The algorithm, according to the spec:
The client constructs the bloom filter by creating a bit array of m bits set to 0 (zero). For each file it then sets to "1" k positions constructed from the file hash. Seeing the file hash as a stream of bits (starting from the lowest bit of the first byte, ending at the highest bit of the last byte), the client should use h bits starting at the first bit of the first byte to create an integer and apply modulo m to get the position in the bit array, then redo the process k times advancing the starting position by h each time.
The bold part is where the problem lies.

Pseudocode of what Jucy does:

Code: Select all

index = 0;
for(i=0; i<k; i++)
  index += is_bit_in_hash_set(i) << i;
And the same for DC++

Code: Select all

index = 0;
for(i=0; i<k; i++)
  index |= is_bit_in_hash_set(i) << i;
Both are logically correct. However, in both implementations, the shift operation is performed with 32 bit arithmetic, whereas k (and thus, i) may be larger than 32. Since the resulting value of the incorrect shift has more than one bit set, + and | give different values for 'index' and thus result in different bloom filters. The fix is to ensure that the shift operation is performed with 64 bit arithmetic, in which case the last test vector gives the following filter:

Code: Select all

AAAAAAACAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAABAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
The practical implication of this bug is that, if a hub requests a bloom filter with h>32, no client will currently respond with a correct filter. Are there any hubs that request a filter with h>32? And if so, do they rely on the buggy behaviour?

poy
Member
Posts: 78
Joined: 26 Nov 2008, 17:04

Re: The incorrectness of BLOM implementations

Post by poy » 11 Oct 2012, 16:49

good catch, i've fixed it in rev 3074 of DC++ and rev 619 of ADCH++.

note that ADCH++ is currently not affected as it hardcodes h to be 24.

fortunately, there isn't any other hub that implements BLOM as far as i'm aware. :) and even if they did, they'd probably set h=24 anyway as it's the value recommended in the spec.

klondike
Member
Posts: 73
Joined: 14 Nov 2010, 13:06

Re: The incorrectness of BLOM implementations

Post by klondike » 11 Oct 2012, 16:59

Another thing I have realized on the code is that you store the data in a vector<bool> this data type is not guaranted to use a single bit for each bool and usually uses a byte for each bool value. You'd probably want to give a look to this one instead http://www.boost.org/doc/libs/1_51_0/li ... itset.html Or a bit_vector http://www.sgi.com/tech/stl/bit_vector.html

poy
Member
Posts: 78
Joined: 26 Nov 2008, 17:04

Re: The incorrectness of BLOM implementations

Post by poy » 11 Oct 2012, 21:48

vector<bool> is specialized in C++, see section 23.3.7 of <http://www.open-std.org/jtc1/sc22/wg21/ ... /n3337.pdf>.

klondike
Member
Posts: 73
Joined: 14 Nov 2010, 13:06

Re: The incorrectness of BLOM implementations

Post by klondike » 11 Oct 2012, 23:06

It's not specialized on all the C++ implementations in fact I recall it isn't on gcc ones.
3 There is no requirement that the data be stored as a contiguous allocation of bool values. A space-optimized representation of bits is recommended instead.
bit_vector and dynamic_bitset instead should be. But meh, not that important on client side (maybe a little on server side though).

Ayo
Junior Member
Posts: 27
Joined: 23 Feb 2011, 13:50

Re: The incorrectness of BLOM implementations

Post by Ayo » 12 Oct 2012, 06:26

poy wrote:good catch, i've fixed it in rev 3074 of DC++ and rev 619 of ADCH++.
Thanks for the quick fix.
poy wrote:note that ADCH++ is currently not affected as it hardcodes h to be 24.

fortunately, there isn't any other hub that implements BLOM as far as i'm aware. :) and even if they did, they'd probably set h=24 anyway as it's the value recommended in the spec.
Glad to hear that, that means we don't have to handle any incompatibilities on the client side. Future hubs that wish to use h>32 will have to deal with it, though.

I've pushed BLOM to the ncdc repo yesterday and the feature is available since version 1.13-2-g5e91e52, the implementation is quite small and available at http://g.blicky.net/ncdc.git/tree/src/bloom.c

poy
Member
Posts: 78
Joined: 26 Nov 2008, 17:04

Re: The incorrectness of BLOM implementations

Post by poy » 12 Oct 2012, 12:17

klondike wrote:It's not specialized on all the C++ implementations in fact I recall it isn't on gcc ones.
3 There is no requirement that the data be stored as a contiguous allocation of bool values. A space-optimized representation of bits is recommended instead.
bit_vector and dynamic_bitset instead should be. But meh, not that important on client side (maybe a little on server side though).
it's just a set of bits masked together as ints in GCC as far as i can tell - see bits/stl_bvector.hpp.
Ayo wrote:Glad to hear that, that means we don't have to handle any incompatibilities on the client side. Future hubs that wish to use h>32 will have to deal with it, though.
somehow i can foresee problems with m overflowing when h goes above 32 but we'll see when that happens... :)

Quicksilver
Member
Posts: 56
Joined: 17 Aug 2009, 21:32

Re: The incorrectness of BLOM implementations

Post by Quicksilver » 21 Oct 2012, 13:51

For the record: Jucy 0.86 fixed that problem for h between 32 and 64.

Locked