Hash set extension

Ideas for ADC may be presented here for others to review and point out flaws or further improve the idea.
Forum rules
If you have an account on the wiki, remember to update the ADC Proposals page for new ideas.

http://dcbase.org/wiki/ADC_Proposals_list
Locked
Pretorian
Site Admin
Posts: 214
Joined: 21 Jul 2009, 10:21

Hash set extension

Post by Pretorian » 10 Aug 2009, 23:19

BASE does not allow multiple items to have one distinct hash. The intention is to allow automatic searches and matching for a set of files or even directory structures. The following intends to add hash sets.

The file list need to be modified to be able to contain a hash set attribute. Hashes are used similarly as for files with a TTH attribute (if TTH is the chosen algorithm), but named "hashset". The previously negotiated hash method should be assumed. That means that the Directory element may contain a 'hashset' attribute and the File element may contain a 'hashset' attribute. If file list size is a worrysome matter, simply "set" should be used.

There are various ways a hash set can be created from;
  • 1. Selected files' names concatenated.
    2. Selected files' names and their absolute paths concatenated (in terms of path in the directory tree of the file list).
    3. Selected files' hashes concatenated.
    4. Selected files' content concatenated.
    5. Random generated data.
    6. Combination of above items.
Now, while these items will surely create enough unique content to base the hash set, some are more practical than others. If the set shall be part of a Directory, "selected files" above are to be treated as "the files in the directory" as well as "directory name".

Investigation;
  • 1. Likely to be unique with a sufficiently high grade, especially in a high amount of files.
    2. While more unique than previous item, it may be inaccurate if two users have the two files "file1.txt" and "file2.txt" if the former have the files at "shared/text" and the other user have the files at "downloaded/text".
    3. Likely to be more unique than item 1, as these values are based on the files' content.
    4. While likely to be as likely as item 3, not viable to (re-)hash that much content.
    5. Not likely to be unique unless a good seed is used. The random range need to be sufficiently large to generate a trustworthy hash.
    6. Items 1, 3 and 4 are likely to be similar in their uniqueness (or at least only slighy variations). Therefore, it is unlikely that any combination will yield a better unique hash.
Conclusion;
It seems items 1, 3 and 4 are similarly unique. Item 4 is likely not to be used due to its ineffciancy. Choosing between item 1 and 3 should yield a better unique base data when selecting item 3. Therefore;
Selected files' hashes shall be concatenated.

Pietry
Senior Member
Posts: 328
Joined: 04 Dec 2007, 07:25
Location: Bucharest
Contact:

Re: Hash set extension

Post by Pietry » 11 Aug 2009, 07:19

I agree that the hash set should include the hashes of all files , and a master hash for all the files perhaps.
This extension can bring a torrent like similarity and one could search/download a whole pack of files, like a complete directory with a distribution perhaps. This has been talked on DCDev if I recall correctly and it's good if this extension brings this functionality on ADC.
Just someone

arnetheduck
Newbie
Posts: 8
Joined: 17 Mar 2009, 13:37

Re: Hash set extension

Post by arnetheduck » 19 Aug 2009, 18:36

while it's up, there was some talk about adding out-of-hash files to the set as well, i e the files that contribute to the hash would be a subset of the whole set of files so that non-critical files or files that change (sfv for example?) don't modify the set hash...

en_dator
Member
Posts: 72
Joined: 01 Apr 2008, 19:24

Re: Hash set extension

Post by en_dator » 19 Aug 2009, 19:47

then filename should not be ^filename$ but instead ^(relativepath)?filename$

as there is need to be able to contain a directory structure in the hashset.

Pretorian
Site Admin
Posts: 214
Joined: 21 Jul 2009, 10:21

Re: Hash set extension

Post by Pretorian » 31 May 2010, 15:46

I would like to make the following addendum to the specification;

Signal feature HSET so the other client know you support this extension.

The attribute set shall be used for all files that are in the hash set generation. The hash set is generated by concatenating all hashes in an ascending string order.

The attribute setex (set generation excluded) shall be used for all files that are in a given set but are NOT part of the hash set generation.

I.e., if file A, B and C shall be included in a set, but a varying C shall be allowed, then set hash == hash(hash of A + hash of B) or set hash == hash(hash of B + hash of A) depending on string order. File A and B shall include the attribute set and file C shall include the attribute setex.

iOCTAGRAM
Newbie
Posts: 1
Joined: 21 Oct 2017, 18:19

Re: Hash set extension

Post by iOCTAGRAM » 21 Oct 2017, 18:53

Is there a precise algorithm description available so I can reimplement it? I know it's ancient and obsolete misfeature abandoned in favor of dcls, but I need exactly it.

I've got a folder TTH response from FlyLink DC r503-x64-19663 for TTH search request, and I need to repeat calculations with debug prints. I saved complete remote filelist and trying to independently recalculate folder TTH, but no matter what I try it doesn't match.

This is so ancient I don't know who may recall the precise algorithm. I recall something about sorting, concatenating and hashing. I tried to sort or not sort hashes. I tried to reverse endianness or not reverse endianness. I tried to reverse endianness before or after sorting. I tried raw TIGER or TTH. I just can't make it match.

Downloaded FlyLink sources. Surprisingly I can't even find where does TTH gets calculated for folders. Everything is so asynchronous, I spend several hours, but I still can't see the code responsible for it. Tried with StrongDC sources, no luck. But it didn't originate in FlyLinkDC, it was inherited long ago, and then nobody touched this code. Can anybody point me where exactly in the sources it's implemented?

Locked