Page 2 of 3

Re: Segmented downloading : Why the big deal?

Posted: 01 Jan 2011, 21:44
by arnetheduck
Actually, you don't need any "special" tools - just plot timings of a read or write loop to see how long it takes to do a file with and without segments, or reread a file multiple times...no rocket science there. Then do another that opens and closes said file in a loop and plot how many you can do in a minute (try looking a a clock for a full 60 seconds to get a good feeling of just how long that is...) to measure how "expensive" that operation is. Hypothesis is a strong word here, I'd say it's closer to relaying rumor or hearsay - like telling a good story or reciting a bible.

Re: Segmented downloading : Why the big deal?

Posted: 02 Jan 2011, 13:18
by Quicksilver
@andyhpp I would disagree with you here. Doing the job of the OS in a program is a inner-platform effect antipattern.
This is something that should never be done without good reason! Here might be a good reason though ..


@arnetheduck
Hypothesis is eactly the correct wording for an unproven idea. I avoided the word theory as that would have been to strong, but hypothesis seems exactly right. I even stated the 2 necessary preconditions:
1. implementation that closes the file.
2. caching alg that reacts to file closing in a plausible way.
Sounds like perfectly fine for a hypothesis...

The measurements you are talking about seem to require modification of sourcecode so I see this as a job rather for you or for any other dc++ devs. I just wanted to provide a plausible explanation how segmented downloading could be bad for hdds, if done wrong and bad caching alg came together. I see segmented downloads as a must for any client out there with no way around. And now please stop trying to stultify me with comparison to bible and rumor relaying.

Re: Segmented downloading : Why the big deal?

Posted: 02 Jan 2011, 16:03
by arnetheduck
But it's not very plausible even...why would it evict perfectly good data from the cache when it doesn't need to...in fact, a file that's been opened and closed is very likely to be opened and again soon...consider dll's, consider source code files being recompiled, consider mp3's on replay, browser cache files (same images being reloaded)...etc etc...

Hearsay because you're repeating what others are saying without any additional facts...and you're certainly capable of doing better, for example by doing said experiment; you know how to program just as well as I do...

Re: Segmented downloading : Why the big deal?

Posted: 02 Jan 2011, 17:03
by andyhhp
@Quicksilver: I didn't wish to imply that I thought application buffering and operating system level buffering together was a good idea. I just wished to state that write buffering itself (irrepective of whether it is implemented at the application or OS level) is a good idea. I would completely agree that buffering both at the application and OS level is a bad idea.

Windows at least allows you to pass flags to specifiy what sort of buffering you would like the operating system to do, including "dont do any buffering for me - it will do it myself". The C standard uses setvbuf/setbuf functions in stdio.h to alter FILE* buffering. (I am not certain, but I believe the winapi flags just use the stdio functions behind the scenes, but it has been a long time since I researched the topic)

@arnetheduck: The problem with your argument is the definition of "when it doesn't ned to". The argument of temporal locality of data does not work for file handles. If I call close() on a file handle, I am telling the operating system that I am truely and utterly done with the file. It is perfectly reasonable for the OS to use this as an indication to free up the cache. Also, how much do you expect the OS to cache of a file? There is no way that an OS is going to cache all of a 700MB file in memory, even on a machine with 4GB of ram. The OS is constantly looking for any excuse to free up areas of its cache so the memory can be given to other applications without them taking as many page faults.

As for your examples:

DLLs: The common DLLs are resident in memory for all process and are demand paged into the process address space (which is trivial kernel overhead and no disk activity).

Source code files: This is why GCC only outputs the intermediate files if you specifically request them. Else, they are just kept in internel buffers in memory (taking the "let me buffer it myself" approach).

MP3s: This is the job of the media application to cache the file handles, especially if the track is on repeat. That way, any caching wont be flushed. This same argument applies to browers with cached content, except that this content tends to get cached in memory. This is one reason why browers have huge memory useages in comparison to other applications.

~Andrew

Re: Segmented downloading : Why the big deal?

Posted: 02 Jan 2011, 18:31
by Big Muscle
Flow84 wrote:
Quicksilver wrote:Solution: 1. Change the implementation of DC++ to not close the file or
2. enlarge the segment size...
I have tested solution one and a dynamic segment size in FlowLib. (Thanks Hackward for giving me some pointers).
Solution one gave me a huge performance improvement :)

What I do (Dont know if this solution is checked into SVN) is having a global filehandler
When user to user connection starts i set the segment size to 1 Mib.
When i receive stuff to save i call Write in the global file handler.
If the file is not already opened i open the file and adds a object (including file handle and last used timestamp) to a list.
Then i lock the specific section part of the file i want to write to and write that part.
Then i update the last used timestamp.
The global filehandler has a thread trying to close unused files (Not beeing used for X seconds).
I have also a trigger on file completion (Yes, i know when i have all content in this file) and forces a close of file handle when file is completed.
It could be very similar when using Memory mapped file. You open one global file mapping handler and then only opening/closing views for each segment. Global file handler is closed when file is finished. But this is mainly about downloading. What about situation on the side of uploader?

As I'm checking Revconnect code now, memory mapped files were used only in its first versions. Later it was replaced with own implementation of SharedFileStream (i.e. only one global handler for all segments used).

Re: Segmented downloading : Why the big deal?

Posted: 02 Jan 2011, 18:44
by Quicksilver
arnetheduck wrote:Hearsay because you're repeating what others are saying without any additional facts...and you're certainly capable of doing better, for example by doing said experiment; you know how to program just as well as I do...
I have been trying to give an explanation for what others are saying. i.e. given the fact its real segmented downloads are bad for hdd.. can we think up an explanation for this.

And no arne I don't know how to program C++ or specifically DC++ as well as you do. For me it would be a journey of hours, just to find the responsible part in the source and set up and environment that can compile dc++. Its definateyl not worth this for me.
I though about such a problem might and coded defensively in my own implementation. I have to do this.. as I want my code to work well under any os and I see it as andyhpp as rather plausible that a cache removes closed media files.

Re: Segmented downloading : Why the big deal?

Posted: 02 Jan 2011, 20:13
by Flow84
Big Muscle wrote:What about situation on the side of uploader?
I dont know if this is perfect but the global file handler do the same here.
It keeps the file handle in X time if it is not told otherwise.

You could probably add a upper limit of number of file handles to keep, but im not :)

Re: Segmented downloading : Why the big deal?

Posted: 03 Jan 2011, 05:36
by cologic
Quicksilver wrote:given the fact its real segmented downloads are bad for hdd..
What is your evidence for this? If you're going to repeat your previous assertions on the topic, see my previous, Latin reply.

Re: Segmented downloading : Why the big deal?

Posted: 03 Jan 2011, 15:36
by Big Muscle
Flow84 wrote:
Big Muscle wrote:What about situation on the side of uploader?
I dont know if this is perfect but the global file handler do the same here.
It keeps the file handle in X time if it is not told otherwise.

You could probably add a upper limit of number of file handles to keep, but im not :)
I will revert SharedFileStream into StrongDC++. It could improve performance at the beginning and at the end of segments. But I don't think that it could be correctly used for uploads, because it's not possible to say when download won't request another segment so file can be closed.

RevConnect's code for SharedFileStream can be seen here - http://reverseconnect.cvs.sourceforge.n ... xt%2Fplain

I was also thinking about that FILE_FLAG_RANDOM_ACCESS flag. Has it sense to use it? Because segments are still read sequentially and it can't be said whether next segment will follow current one or not, but there is bigger probability it will.

Re: Segmented downloading : Why the big deal?

Posted: 03 Jan 2011, 18:54
by Quicksilver
cologic wrote:
Quicksilver wrote:given the fact its real segmented downloads are bad for hdd..
What is your evidence for this? If you're going to repeat your previous assertions on the topic, see my previous, Latin reply.
There is no prove for that... only evidence is users complaining about more failed hdds. So the evidence is slim but exists.
You can call this explanation a Gedankenexperiment. The whole point of this is, that if users complain about something perceived, then don't dismiss it as if it is impossible, if we can come up with a reasonable explanation for such a perception.


Also to come back and repeat the point about MTBF hours of hdds. MTBF are numbers presented by manufacturer without preconditions.
Though we know that our users are filesharers which have potentially more strain on a hdd than an average user of that manufacturer. I know that my Math there was rough estimation .. though given data from more and larger hubs I imagine we could come up with a MTTF time for filesharers that would be much more meaningful for us than anything a manufacturer provides. Counting in hubs which forbid segmenting and normal hubs... we might even get empirical check for the hypothesis. But of couse this seems like a lot of work compared to measureing the caching with a modified dc++ version.