Forum: >>> Magnum BBS <<<

Multithreaded disk access

From Don Y@21:1/5 to All on Fri Oct 15 07:08:23 2021

As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?

E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.

[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]

Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to Don Y on Fri Oct 15 11:38:58 2021

On 10/15/21 10:08 AM, Don Y wrote:

As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?

E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.

[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]

Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?

You know that you can't actually get data off the media faster than the fundamental data rate of the media.

As you mention, cache can give an apparent rate faster than the media,
but you seem to be willing to assume that caching doesn't affect your
rate, and each chunk will only be returned once.

Pathological access patterns can reduce this rate dramatically, and
worse case can result in rates of only a few percent of this factor if
you force significant seeks between each sector read (and overload the buffering so it can't hold larger reads for a given stream).

Non-Pathological access can often result in near 100% of the access rate.

The best test of if you are I/O bound is if the I/O system is constantly
in use, and every I/O request has another pending when it finishes, then
you are totally I/O bound.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Don Y@21:1/5 to Richard Damon on Fri Oct 15 09:00:11 2021

On 10/15/2021 8:38 AM, Richard Damon wrote:

You know that you can't actually get data off the media faster than the fundamental data rate of the media.

Yes, but you don't know that rate *and* that rate varies based on
"where" you're accesses land on the physical medium (e.g., ZDR,
shingled drives, etc.)

As you mention, cache can give an apparent rate faster than the media, but you
seem to be willing to assume that caching doesn't affect your rate, and each chunk will only be returned once.

Cache in the filesystem code will be counterproductive. Cache in
the drive may be a win for some accesses and a loss for others
(e.g., if the drive read ahead thinking the next read was going to
be sequential with the last -- and that proves to be wrong -- the
drive may have missed an opportunity to respond more quickly to the
ACTUAL access that follows).

[I'm avoiding talking about reads AND writes just to keep the
discussion complexity manageable -- to avoid having to introduce
caveats with every statement]

Pathological access patterns can reduce this rate dramatically, and worse case
can result in rates of only a few percent of this factor if you force significant seeks between each sector read (and overload the buffering so it can't hold larger reads for a given stream).

Exactly. But, you don't necessarily know where your next access will
take you. This variation in throughput is what makes defining
"i/o bound" tricky; if the access patterns at some instant (instant
being a period over which you base your decision) make the drive
look slow, then you would opt NOT to spawn a new thread to take
advantage of excess throughput. Similarly, if the drive "looks" serendipitously fast, you may spawn another thread and its
accesses will eventually conflict with those of the first thread
to lower overall throughput.

Non-Pathological access can often result in near 100% of the access rate.

The best test of if you are I/O bound is if the I/O system is constantly in use, and every I/O request has another pending when it finishes, then you are totally I/O bound.

But, if you make that assessment when the access pattern is "unfortunate",
you erroneously conclude the disk is at its capacity. And, vice versa.

Without control over the access patterns, it seems like there is no
reliable strategy for determining when another thread can be
advantageous (?)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dimiter_Popoff@21:1/5 to Don Y on Fri Oct 15 18:46:53 2021

On 10/15/2021 17:08, Don Y wrote:

As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?

E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.

[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]

If caching is disabled things can get really bad quite quickly,
think on updating directory entries to reflect modification/access
dates, file sizes, scattering etc., think also allocation
table accesses etc. E.g. in dps on a larger disk partition
(say >100 gigabytes) the first CAT (cluster allocation table)
access after boot takes some noticeable time, a second maybe;
then it stops being noticeable at all as the CAT is updated
rarely and on a modified area basis only (this on a a processor
capable of 20 Mbytes/second) (dps needs the entire CAT to allocate
new space in order to do its (enhanced) worst fit scheme).
IOW if you torture the disk with constant seeks and scattered
accesses you can slow it down from somewhat to a lot, depends
on way too many factors to be worth wondering about.

Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?

Just try it out for some time and make your pick. Recently
I did that dfs (distributed file system, over tcp) for dps
and had to watch much of this going on, at some point you
reach something between 50 and 100% of the hardware limit,
depending on file sizes you copy and who knows what else
overhead you can think of.

Dimiter

======================================================
Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Don Y@21:1/5 to All on Fri Oct 15 09:19:25 2021

On 10/15/2021 8:46 AM, Dimiter_Popoff wrote:

On 10/15/2021 17:08, Don Y wrote:

If caching is disabled things can get really bad quite quickly,
think on updating directory entries to reflect modification/access
dates, file sizes, scattering etc., think also allocation
table accesses etc.

My point re: filesystem cache (not the on-board disk cache) was that
the user objects accessed will only be visited once. So, no value
to caching *them* in the filesystem's buffers.

E.g. in dps on a larger disk partition
(say >100 gigabytes) the first CAT (cluster allocation table)
access after boot takes some noticeable time, a second maybe;
then it stops being noticeable at all as the CAT is updated
rarely and on a modified area basis only (this on a a processor
capable of 20 Mbytes/second) (dps needs the entire CAT to allocate
new space in order to do its (enhanced) worst fit scheme).
IOW if you torture the disk with constant seeks and scattered
accesses you can slow it down from somewhat to a lot, depends
on way too many factors to be worth wondering about.

I'm trying NOT to be aware of any particulars of the specific
filesystem *type* (FAT*, NTFS, *BSD, etc.) and make decisions
just from high level observations of disk performance.

Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?

Just try it out for some time and make your pick. Recently
I did that dfs (distributed file system, over tcp) for dps
and had to watch much of this going on, at some point you
reach something between 50 and 100% of the hardware limit,
depending on file sizes you copy and who knows what else
overhead you can think of.

I think the cost of any extra complexity in the algorithm
(to dynamically try to optimize number of threads) is
hard to justify -- given no control over the actual
media. I.e., it seems like it's best to just aim for
"simple" and live with whatever throughput you get...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dimiter_Popoff@21:1/5 to Don Y on Fri Oct 15 19:28:49 2021

On 10/15/2021 19:19, Don Y wrote:

On 10/15/2021 8:46 AM, Dimiter_Popoff wrote:
....

Just try it out for some time and make your pick. Recently
I did that dfs (distributed file system, over tcp) for dps
and had to watch much of this going on, at some point you
reach something between 50 and 100% of the hardware limit,
depending on file sizes you copy and who knows what else
overhead you can think of.

I think the cost of any extra complexity in the algorithm
(to dynamically try to optimize number of threads) is
hard to justify -- given no control over the actual
media. I.e., it seems like it's best to just aim for
"simple" and live with whatever throughput you get...

I meant going the simplest way, not adding algorithms.
Just leave it for now and have a few systems running,
look at what is going on and pick some sane figure,
perhaps try it out either way before you settle.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to Don Y on Fri Oct 15 12:48:48 2021

On 10/15/21 12:00 PM, Don Y wrote:

On 10/15/2021 8:38 AM, Richard Damon wrote:

You know that you can't actually get data off the media faster than
the fundamental data rate of the media.

Yes, but you don't know that rate *and* that rate varies based on
"where" you're accesses land on the physical medium (e.g., ZDR,
shingled drives, etc.)

But all of these still have a 'maximum' rate, so you can still define a maximum. It does say that the 'expected' rate you can get gets more
variable.

As you mention, cache can give an apparent rate faster than the media,
but you seem to be willing to assume that caching doesn't affect your
rate, and each chunk will only be returned once.

Cache in the filesystem code will be counterproductive. Cache in
the drive may be a win for some accesses and a loss for others
(e.g., if the drive read ahead thinking the next read was going to
be sequential with the last -- and that proves to be wrong -- the
drive may have missed an opportunity to respond more quickly to the
ACTUAL access that follows).

[I'm avoiding talking about reads AND writes just to keep the
discussion complexity manageable -- to avoid having to introduce
caveats with every statement]

Yes, the drive might try to read ahead and hurt itself, or it might not.
That is mostly out of your control.

Pathological access patterns can reduce this rate dramatically, and
worse case can result in rates of only a few percent of this factor if
you force significant seeks between each sector read (and overload the
buffering so it can't hold larger reads for a given stream).

Exactly. But, you don't necessarily know where your next access will
take you. This variation in throughput is what makes defining
"i/o bound" tricky; if the access patterns at some instant (instant
being a period over which you base your decision) make the drive
look slow, then you would opt NOT to spawn a new thread to take
advantage of excess throughput. Similarly, if the drive "looks" serendipitously fast, you may spawn another thread and its
accesses will eventually conflict with those of the first thread
to lower overall throughput.

Non-Pathological access can often result in near 100% of the access rate.

The best test of if you are I/O bound is if the I/O system is
constantly in use, and every I/O request has another pending when it
finishes, then you are totally I/O bound.

But, if you make that assessment when the access pattern is "unfortunate", you erroneously conclude the disk is at its capacity. And, vice versa.

Without control over the access patterns, it seems like there is no
reliable strategy for determining when another thread can be
advantageous (?)

Yes, adding more threads might change the access pattern. it will TEND
to make the pattern less sequential, and thus more towards that
pathological case (and thus more threads actually decrease the rate you
can do I/O and thus slow down your I//O bound rate). It is possible that
it just happens to be fortunate to make things more sequential, if the
system can see that one thread wants sector N and another wants sector
N+1, something can schedule the reads together and drop a seek.

Predicting that sort of behavior can't be done 'in the abstract'. You
need to think about the details of the system.

As a general principle, if the I/O system is saturated, the job is I/O
bound. Adding more threads will only help if you have the resources to
queue up more requests and can optimize the order of servicing them to
be more efficient with I/O. Predicting that means you need to know and
have some control over the access pattern.

Note, part of this is being able to trade memory to improve I/O speed.
If you know that EVENTUALLY you will want the next sector after the one
you are reading, reading that now and caching it will be a win, but only
if you will be able to use that data before you need to claim that
memory for other uses. This sort of improvement really does require
knowing details you want to try to assume you don't want to know, so you
are limiting your ability to make accurate decisions.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Don Y@21:1/5 to Richard Damon on Fri Oct 15 11:40:25 2021

On 10/15/2021 9:48 AM, Richard Damon wrote:

On 10/15/21 12:00 PM, Don Y wrote:

On 10/15/2021 8:38 AM, Richard Damon wrote:

You know that you can't actually get data off the media faster than the
fundamental data rate of the media.

Yes, but you don't know that rate *and* that rate varies based on
"where" you're accesses land on the physical medium (e.g., ZDR,
shingled drives, etc.)

But all of these still have a 'maximum' rate, so you can still define a maximum. It does say that the 'expected' rate you can get gets more variable.

But only if you have control over the hardware.

How long will a "backup" take on your PC. Today?
Tomorrow? Last week?

If you removed the disk and put it in another PC,
how would those figures change?

If you restore (using file access and not sector access),
and then backup again, how will the numbers change?

As you mention, cache can give an apparent rate faster than the media, but >>> you seem to be willing to assume that caching doesn't affect your rate, and >>> each chunk will only be returned once.

Cache in the filesystem code will be counterproductive. Cache in
the drive may be a win for some accesses and a loss for others
(e.g., if the drive read ahead thinking the next read was going to
be sequential with the last -- and that proves to be wrong -- the
drive may have missed an opportunity to respond more quickly to the
ACTUAL access that follows).

[I'm avoiding talking about reads AND writes just to keep the
discussion complexity manageable -- to avoid having to introduce
caveats with every statement]

Yes, the drive might try to read ahead and hurt itself, or it might not. That is mostly out of your control.

Exactly. So, I can't do anything other than OBSERVE the performance
I am getting.

Non-Pathological access can often result in near 100% of the access rate. >>>
The best test of if you are I/O bound is if the I/O system is constantly in >>> use, and every I/O request has another pending when it finishes, then you >>> are totally I/O bound.

But, if you make that assessment when the access pattern is "unfortunate", >> you erroneously conclude the disk is at its capacity. And, vice versa.

Without control over the access patterns, it seems like there is no
reliable strategy for determining when another thread can be
advantageous (?)

Yes, adding more threads might change the access pattern. it will TEND to make
the pattern less sequential, and thus more towards that pathological case (and
thus more threads actually decrease the rate you can do I/O and thus slow down
your I//O bound rate). It is possible that it just happens to be fortunate to make things more sequential, if the system can see that one thread wants sector
N and another wants sector N+1, something can schedule the reads together and drop a seek.

The point of additional threads is that another thread can schedule
the next access while the processor is busy processing the previous
one. So, the I/O is always kept busy instead of letting it idle
between accesses.

Predicting that sort of behavior can't be done 'in the abstract'. You need to think about the details of the system.

As a general principle, if the I/O system is saturated, the job is I/O bound.

The goal is to *ensure* the I/O system is completely saturated.

Adding more threads will only help if you have the resources to queue up more requests and can optimize the order of servicing them to be more efficient with

Ordering them is an optimization that requires knowledge of how they
will interact *in* the drive. However, simply having ANOTHER request
ready as soon as the previous one is completed (neglecting the
potential for the drive to queue requests internally) is an
enhancement to throughput.

I/O. Predicting that means you need to know and have some control over the access pattern.

Note, part of this is being able to trade memory to improve I/O speed. If you know that EVENTUALLY you will want the next sector after the one you are reading, reading that now and caching it will be a win, but only if you will be
able to use that data before you need to claim that memory for other uses. This
sort of improvement really does require knowing details you want to try to assume you don't want to know, so you are limiting your ability to make accurate decisions.

Moving the code to another platform (something that the user can do
in a heartbeat) will invalidate any assumptions I have made about the performance on my original platform. Hence the desire to have the
*code* sort out what it *can* do to increase performance by observing
its actual performance on TODAY'S actual hardware.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Brett@21:1/5 to Don Y on Sun Oct 17 20:27:17 2021

Don Y <blockedofcourse@foo.invalid> wrote:

As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?

E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.

[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]

Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?

Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125
revolutions a second and a seek takes half a revolution and the next file
is another half a revolution away on average, which gets you 125 files a
second roughly speaking depending on the performance of the drive if my
numbers are not too far off.

This is plenty to support a dozen Windows VM’s on average if it were not
for Windows updates that saturate the disks with hundreds of little file updates at once, causing Microsoft SQL timeouts for the VM’s.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Don Y@21:1/5 to Brett on Sun Oct 17 14:49:52 2021

On 10/17/2021 1:27 PM, Brett wrote:

Don Y <blockedofcourse@foo.invalid> wrote:

As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?

E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.

[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]

Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?

Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125 revolutions a second and a seek takes half a revolution and the next file
is another half a revolution away on average, which gets you 125 files a second roughly speaking depending on the performance of the drive if my numbers are not too far off.

You're assuming files are laid out contiguously -- that no seeks are needed "between sectors".

You're also assuming moving to another track (seek time) is instantaneous
(or, within the half-cylinder rotational delay).

For a 7200 rpm (some are as slow as 5400, some as fast as 15K) drive,
AVERAGE rotational delay is 8.3+ ms/2 = ~4ms.

But, seek time can be 10, 15, + ms. (on my enterprise drives, its 4; but, average rotational delay is 2.) And, if the desired sector lies on a
"distant" cylinder, you can scale that almost linearly.

I.e., looking at the disk's specs is largely useless unless you know how
the data on it is laid out. The only way to know that is to *look* at it.

But, looking at part of the data doesn't mean you can extrapolate to ALL
of the data. So, I'm back to my assumption that you can't really alter
your approach -- with any degree of predictable success -- before hand.
E.g., I can keep spawning threads until I find them queuing (more than
one deep) on the disk driver. But, even then, a moment from now, the
backlog can clear. Or, it can get worse (which means I've wasted
the resources that the threads consume AND added complexity to the
algorithm with no direct benefit)

Below, you're also assuming Windows.

And, for writes, shingled drives throw all of that down the toilet.

This is plenty to support a dozen Windows VM’s on average if it were not for Windows updates that saturate the disks with hundreds of little file updates at once, causing Microsoft SQL timeouts for the VM’s.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dimiter_Popoff@21:1/5 to Don Y on Mon Oct 18 01:09:48 2021

On 10/18/2021 0:49, Don Y wrote:

On 10/17/2021 1:27 PM, Brett wrote:

Don Y <blockedofcourse@foo.invalid> wrote:

As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?

E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.

[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]

Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?

Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125
revolutions a second and a seek takes half a revolution and the next file
is another half a revolution away on average, which gets you 125 files a
second roughly speaking depending on the performance of the drive if my
numbers are not too far off.

You're assuming files are laid out contiguously -- that no seeks are needed "between sectors".

This is the typical case anyway, most files are contiguously allocated.
Even on popular filesystems which have long forgotten how to do worst
fit allocation and have to defragment their disks not so infrequently.
But I think they have to access at least 3 locations to get to a file;
the directory entry, some kind of FAT-like thing, then the file.
Unlike dps, where 2 accesses are enough. And of course dps does
worst fit allocation so defragmentating is just unnecessary.

======================================================
Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Don Y@21:1/5 to Don Y on Sun Oct 17 15:01:08 2021

On 10/17/2021 2:49 PM, Don Y wrote:

But, seek time can be 10, 15, + ms. (on my enterprise drives, its 4; but, average rotational delay is 2.) And, if the desired sector lies on a "distant" cylinder, you can scale that almost linearly.

Sorry, that should be .4 (I should discipline myself to add leading zeroes!)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Clifford Heath@21:1/5 to Don Y on Mon Oct 18 11:58:49 2021

On 18/10/21 8:49 am, Don Y wrote:

On 10/17/2021 1:27 PM, Brett wrote:

Don Y <blockedofcourse@foo.invalid> wrote:

As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?

E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.

[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]

Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?

Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125
revolutions a second and a seek takes half a revolution and the next file
is another half a revolution away on average

For a 7200 rpm (some are as slow as 5400, some as fast as 15K) drive,
AVERAGE rotational delay is 8.3+ ms/2 = ~4ms.
I.e., looking at the disk's specs is largely useless unless you know how
the data on it is laid out.

And, for writes, shingled drives throw all of that down the toilet.

Modern drives do so much caching and bad-sector reassignment that your
physical intuition isn't likely to be of much help in any case.
Typically there will be many full-track caches and a full RTOS running
I/O scheduling using head kinematics. It is incredibly sophisticated and unlikely to yield to any trivial rotational analysis.

Clifford Heath

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Don Y@21:1/5 to All on Mon Oct 18 08:05:28 2021

On 10/17/2021 3:09 PM, Dimiter_Popoff wrote:

You're assuming files are laid out contiguously -- that no seeks are needed >> "between sectors".

This is the typical case anyway, most files are contiguously allocated.

I'm not sure that is the case for files that have been modified on a medium. Write once, read multiple MAY support that sort of approach (assuming there
is enough contiguous space for that first write). But, if you are appending
to a file or overwriting it, I don't know what guarantees you can expect;
there are a LOT of different file systems out there!

I would assume CD9660 would be the friendliest, in this regard. And, it
is typically immutable so once the tracks are laid, they can persist.

[IIRC, CD9660 also has a "summary VToC, of sorts, so you don't have to
seek to individual subdirectories to find things from the medium's root]

Even on popular filesystems which have long forgotten how to do worst
fit allocation and have to defragment their disks not so infrequently.
But I think they have to access at least 3 locations to get to a file;
the directory entry, some kind of FAT-like thing, then the file.
Unlike dps, where 2 accesses are enough. And of course dps does
worst fit allocation so defragmentating is just unnecessary.

I think directories are cached. And, possibly entire drive structures (depending on how much physical RAM you have available).

I still can't see an easy threading strategy that can be applied
without more details of the target hardware/OS, filesystem layout,
specifics of the drives/volumes, etc.

E.g., my disk sanitizer times each (fixed size) access to profile the
drive's performance as well as looking for trouble spots on the media.
But, things like recal cycles or remapping bad sectors introduce
completely unpredictable blips in the throughput. So much so that
I've had to implement a fair bit of logic to identify whether a
"delay" was part of normal operation *or* a sign of an exceptional
event.

[But, the sanitizer has a very predictable access pattern so
there's no filesystem/content -specific issues involved; just
process sectors as fast as possible. (also, there is no
need to have multiple threads per spindle; just a thread *per*
spindle -- plus some overhead threads)

And, the sanitizer isn't as concerned with throughput as the
human operator is the bottleneck (I can crank out a 500+GB drive
every few minutes).]

I'll mock up some synthetic loads and try various thread-spawning
strategies to see the sorts of performance I *might* be able
to get -- with different "preexisting" media (to minimize my
impact on that).

I'm sure I can round up a dozen or more platforms to try -- just
from stuff I have lying around here! :>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dimiter_Popoff@21:1/5 to Don Y on Mon Oct 18 19:25:05 2021

On 10/18/2021 18:05, Don Y wrote:

On 10/17/2021 3:09 PM, Dimiter_Popoff wrote:

You're assuming files are laid out contiguously -- that no seeks are
needed
"between sectors".

This is the typical case anyway, most files are contiguously allocated.

I'm not sure that is the case for files that have been modified on a
medium.

It is not the case for files which are appended to, say logs etc.
But files like that do not make such a high percentage. I looked at a log
file which logs some activities several times a day, it has grown to
52 megabytes (ascii text) for something like 15 months (first entry
is June 1 2020). It is spread over 32 segments, as one would expect.

Then I looked at the directory where I archive emails, 311 files
(one of them being Don.txt :-). Just one or two were two segmented,
the rest were all contiguous.

The devil is not that black (literally translating a Bulgarian saying)
as you see. Worst fit allocation is of course crucial to get to
such figures, the mainstream OS-s don't do it and things there
must be much worse.

....

Even on popular filesystems which have long forgotten how to do worst
fit allocation and have to defragment their disks not so infrequently.
But I think they have to access at least 3 locations to get to a file;
the directory entry, some kind of FAT-like thing, then the file.
Unlike dps, where 2 accesses are enough. And of course dps does
worst fit allocation so defragmentating is just unnecessary.

I think directories are cached. And, possibly entire drive structures (depending on how much physical RAM you have available).

Well of course they must be caching them, especially since there are
gigabytes of RAM available. I know what dps does: it caches longnamed directories which coexist with the old 8.4 ones in the same filesystem
and work faster than the 8.4 ones which typically don't get cached
(these were done to work well even on floppies, a directory entry
update writes back only the sector(s , if crossing) it occupies
etc. Then in dps the CAT (cluster allocation tables) are cached all
the time (do that for a 500G partition and enjoy reading all the
4 megabytes each time the CAT is needed to allocate new space...
it can be done, in fact the caches are enabled upon boot explicitly
on a per LUN/partition basis).

...

E.g., my disk sanitizer times each (fixed size) access to profile the
drive's performance as well as looking for trouble spots on the media.
But, things like recal cycles or remapping bad sectors introduce
completely unpredictable blips in the throughput. So much so that
I've had to implement a fair bit of logic to identify whether a
"delay" was part of normal operation *or* a sign of an exceptional
event.

[But, the sanitizer has a very predictable access pattern so
there's no filesystem/content -specific issues involved; just
process sectors as fast as possible. (also, there is no
need to have multiple threads per spindle; just a thread *per*
spindle -- plus some overhead threads)

And, the sanitizer isn't as concerned with throughput as the
human operator is the bottleneck (I can crank out a 500+GB drive
every few minutes).]

I did something similar many years ago, wneh the largest drive
a nukeman had was 200 (230 IIRC) megabytes, i.e. in prior to
magnetoresistive heads came to the world. It did develop
bad sectors and did not do much internally about it (1993).
So I wrote the "lockout" command (still available, I see I have
recompiled it for power, last change 2016 - can't remember if it
did anything useful, nor why I did that). It accessed sector by
sector the LUN it was told to and built the lockout CAT on
its filesystem (LCAT being ORed to CAT prior to the LUN being
usable for the OS). Took quite some time on that drive but
did the job back then.

I'll mock up some synthetic loads and try various thread-spawning
strategies to see the sorts of performance I *might* be able
to get -- with different "preexisting" media (to minimize my
impact on that).

I'm sure I can round up a dozen or more platforms to try -- just
from stuff I have lying around here! :>

I think this will give you plenty of an idea how to go about it.
Once you know the limit you can run at some reasonable figure
below it and be happy. Getting more precise figures about all
that is neither easy nor will it buy you anything.

======================================================
Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Don Y@21:1/5 to All on Mon Oct 18 13:46:59 2021

On 10/18/2021 9:25 AM, Dimiter_Popoff wrote:

The devil is not that black (literally translating a Bulgarian saying)
as you see. Worst fit allocation is of course crucial to get to
such figures, the mainstream OS-s don't do it and things there
must be much worse.

I think a lot depends on the amount of "churn" the filesystem
experiences, in normal operation. E.g., the "system" disk on
the workstation I'm using today has about 800G in use of 1T total.
But, the vast majority of it is immutable -- binaries, libraries,
etc. So, there's very low fragmentation (because I "build" the
disk in one shot, instead of incrementally revising and
"updating" its contents)

By contrast, the other disks in the machine all see a fair bit of
turnover as things get created, revised and deleted.

Even on popular filesystems which have long forgotten how to do worst
fit allocation and have to defragment their disks not so infrequently.
But I think they have to access at least 3 locations to get to a file;
the directory entry, some kind of FAT-like thing, then the file.
Unlike dps, where 2 accesses are enough. And of course dps does
worst fit allocation so defragmentating is just unnecessary.

I think directories are cached. And, possibly entire drive structures
(depending on how much physical RAM you have available).

Well of course they must be caching them, especially since there are gigabytes of RAM available.

Yes, but *which*? And how many (how "much")? I would assume memory
set aside for caching files and file system structures is dynamically
managed -- if a process looks at lots of directories but few files
vs. a process that looks at few directories but many files...

I know what dps does: it caches longnamed
directories which coexist with the old 8.4 ones in the same filesystem
and work faster than the 8.4 ones which typically don't get cached
(these were done to work well even on floppies, a directory entry
update writes back only the sector(s , if crossing) it occupies
etc. Then in dps the CAT (cluster allocation tables) are cached all
the time (do that for a 500G partition and enjoy reading all the
4 megabytes each time the CAT is needed to allocate new space...
it can be done, in fact the caches are enabled upon boot explicitly
on a per LUN/partition basis).

E.g., my disk sanitizer times each (fixed size) access to profile the
drive's performance as well as looking for trouble spots on the media.
But, things like recal cycles or remapping bad sectors introduce
completely unpredictable blips in the throughput. So much so that
I've had to implement a fair bit of logic to identify whether a
"delay" was part of normal operation *or* a sign of an exceptional
event.

[But, the sanitizer has a very predictable access pattern so
there's no filesystem/content -specific issues involved; just
process sectors as fast as possible. (also, there is no
need to have multiple threads per spindle; just a thread *per*
spindle -- plus some overhead threads)

And, the sanitizer isn't as concerned with throughput as the
human operator is the bottleneck (I can crank out a 500+GB drive
every few minutes).]

I did something similar many years ago, wneh the largest drive
a nukeman had was 200 (230 IIRC) megabytes, i.e. in prior to
magnetoresistive heads came to the world. It did develop
bad sectors and did not do much internally about it (1993).
So I wrote the "lockout" command (still available, I see I have
recompiled it for power, last change 2016 - can't remember if it
did anything useful, nor why I did that). It accessed sector by
sector the LUN it was told to and built the lockout CAT on
its filesystem (LCAT being ORed to CAT prior to the LUN being
usable for the OS). Took quite some time on that drive but
did the job back then.

Yes, support for "bad block management" can be done outside the
drive. In the sanitizer's case, it has to report on whether or not
it was able to successfully "scrub" every PHYSICAL sector that
might contain user data (for some "fussy" users).

So, if it appears that a sector may have been remapped (visible
by a drop in instantaneous access rate), I query the drive's
bad sector statistics to see if I should just abort the process
now and mark the drive to be (physically) shredded -- *if*
it belongs to a "fussy" user. Regardless (for fussy users),
I will query those stats at the end of the operation to see
if they have changed during the process.

But, again, that's a very specific application with different
prospects for optimization. E.g., there's no file system
support required as the disk is just a bunch of data blocks
(sectors) having no particular structure nor meaning. (So,
no need for filesystem code at all! One can scrub a SPARC
disk just as easily as a Mac!)

I'll mock up some synthetic loads and try various thread-spawning
strategies to see the sorts of performance I *might* be able
to get -- with different "preexisting" media (to minimize my
impact on that).

I'm sure I can round up a dozen or more platforms to try -- just
from stuff I have lying around here! :>

I think this will give you plenty of an idea how to go about it.
Once you know the limit you can run at some reasonable figure
below it and be happy. Getting more precise figures about all
that is neither easy nor will it buy you anything.

I suspect "1" is going to end up as the "best compromise". So,
I'm treating this as an exercise in *validating* that assumption.
I'll see if I can salvage some of the performance monitoring code
from the sanitizer to give me details from which I might be able
to ferret out "opportunities". If I start by restricting my
observations to non-destructive synthetic loads, then I can
pull a drive and see how it fares in a different host while
running the same code, etc.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dimiter_Popoff@21:1/5 to Don Y on Tue Oct 19 00:46:15 2021

On 10/18/2021 23:46, Don Y wrote:

On 10/18/2021 9:25 AM, Dimiter_Popoff wrote:

The devil is not that black (literally translating a Bulgarian saying)
as you see. Worst fit allocation is of course crucial to get to
such figures, the mainstream OS-s don't do it and things there
must be much worse.

I think a lot depends on the amount of "churn" the filesystem
experiences, in normal operation. E.g., the "system" disk on
the workstation I'm using today has about 800G in use of 1T total.
But, the vast majority of it is immutable -- binaries, libraries,
etc. So, there's very low fragmentation (because I "build" the
disk in one shot, instead of incrementally revising and
"updating" its contents)

Typically so on most systems, I expect.

By contrast, the other disks in the machine all see a fair bit of
turnover as things get created, revised and deleted.

Now this is where the worst fit allocation strategy becomes the game
changer. A newly created file is almost certainly contiguously
allocated; fragmentation occurs when it is appended to (and the
space past its last block has been already allocated in the meantime).
I think I saw somewhere (never really got interested) that mainstream
operating systems of today do just first fit - which means once you
delete a file, no matter how small, its space will be allocated as
part of the next request etc., no idea why they do it (if they do so,
my memory on that is not very certain) in such a primitive manner
but here they are.

======================================================
Dimiter Popoff, TGI http://www.tgi-sci.com ====================================================== http://www.flickr.com/photos/didi_tgi/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Don Y@21:1/5 to All on Mon Oct 18 16:17:13 2021

On 10/18/2021 2:46 PM, Dimiter_Popoff wrote:

By contrast, the other disks in the machine all see a fair bit of
turnover as things get created, revised and deleted.

Now this is where the worst fit allocation strategy becomes the game
changer. A newly created file is almost certainly contiguously
allocated; fragmentation occurs when it is appended to (and the
space past its last block has been already allocated in the meantime).
I think I saw somewhere (never really got interested) that mainstream operating systems of today do just first fit

Dunno. There are *lots* of different "file systems". And, as most
systems use the filesystem for its global namespace, the term is
sadly overloaded (in ways that have nothing to do with storage media).

I don't use large secondary stores in my designs. Where "big storage"
is required, it is accessed remotely (network) so the problem of
dealing with it is not mine. E.g., in my current project, there is
no file system exposed; any persistent storage is done via "tables"
that are created, as needed (by the system and its apps) in a remote
database server.

[This lets me delegate the issue of data retention to a specific box.
And, also lets me impart structure to the data that is stored. So,
for example, I can find the most recent firmware image for a particular hardware module by just issuing a query and waiting on the result
instead of having to parse a bunch of names, somewhere in a filesystem/namespace, looking for the one that has the highest
rev level IN its name: firmware_1.0.1, firmware_1.0.2, firmware_5.3,
firmware 6.1 (typo! the '_' was omitted -- will the parse algorithm
choke on that omission??)]

- which means once you
delete a file, no matter how small, its space will be allocated as
part of the next request etc., no idea why they do it (if they do so,
my memory on that is not very certain) in such a primitive manner
but here they are.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Don Y@21:1/5 to Don Y on Mon Oct 25 04:19:34 2021

On 10/18/2021 1:46 PM, Don Y wrote:

I think this will give you plenty of an idea how to go about it.
Once you know the limit you can run at some reasonable figure
below it and be happy. Getting more precise figures about all
that is neither easy nor will it buy you anything.

I suspect "1" is going to end up as the "best compromise". So,
I'm treating this as an exercise in *validating* that assumption.
I'll see if I can salvage some of the performance monitoring code
from the sanitizer to give me details from which I might be able
to ferret out "opportunities". If I start by restricting my
observations to non-destructive synthetic loads, then I can
pull a drive and see how it fares in a different host while
running the same code, etc.

Actually, '2' turns out to be marginally better than '1'.
Beyond that, its hard to generalize without controlling some
of the other variables.

'2' wins because there is always the potential to make the
disk busy, again, just after it satisfies the access for
the 1st thread (which is now busy using the data, etc.)

But, if the first thread finishes up before the second thread's
request has been satisfied, then the presence of a THIRD thread
would just be clutter. (i.e., the work performed correlates
with the number of threads that can have value)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Richard Damon@21:1/5 to Don Y on Mon Oct 25 20:23:25 2021

On 10/25/21 7:19 AM, Don Y wrote:

On 10/18/2021 1:46 PM, Don Y wrote:

I think this will give you plenty of an idea how to go about it.
Once you know the limit you can run at some reasonable figure
below it and be happy. Getting more precise figures about all
that is neither easy nor will it buy you anything.

I suspect "1" is going to end up as the "best compromise". So,
I'm treating this as an exercise in *validating* that assumption.
I'll see if I can salvage some of the performance monitoring code
from the sanitizer to give me details from which I might be able
to ferret out "opportunities". If I start by restricting my
observations to non-destructive synthetic loads, then I can
pull a drive and see how it fares in a different host while
running the same code, etc.

Actually, '2' turns out to be marginally better than '1'.
Beyond that, its hard to generalize without controlling some
of the other variables.

'2' wins because there is always the potential to make the
disk busy, again, just after it satisfies the access for
the 1st thread (which is now busy using the data, etc.)

But, if the first thread finishes up before the second thread's
request has been satisfied, then the presence of a THIRD thread
would just be clutter. (i.e., the work performed correlates
with the number of threads that can have value)

Actually, 2 might be slower than 1, because the new request from the
second thread is apt to need a seek, while a single thread making all
the calls is more apt to sequentially read much more of the disk.

The controller, if not given a new chained command, might choose to automatically start reading the next sector of the cylinder, which could
be likely the next one asked for.

The real optimum is likely a single process doing asynchronous requests
queuing up a series of requests, and then distributing the data as it
comes in to processing threads to do what ever crunching needs to be done.

These threads then send the data to a single thread that does
asynchronous writes of the results.

You can easily tell if the input or output processes are I/O bound or
not and use that to adjust the number of crunching threads in the middle.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Don Y@21:1/5 to Richard Damon on Mon Oct 25 22:29:58 2021

On 10/25/2021 5:23 PM, Richard Damon wrote:

On 10/25/21 7:19 AM, Don Y wrote:

On 10/18/2021 1:46 PM, Don Y wrote:

I think this will give you plenty of an idea how to go about it.
Once you know the limit you can run at some reasonable figure
below it and be happy. Getting more precise figures about all
that is neither easy nor will it buy you anything.

I suspect "1" is going to end up as the "best compromise". So,
I'm treating this as an exercise in *validating* that assumption.
I'll see if I can salvage some of the performance monitoring code
from the sanitizer to give me details from which I might be able
to ferret out "opportunities". If I start by restricting my
observations to non-destructive synthetic loads, then I can
pull a drive and see how it fares in a different host while
running the same code, etc.

Actually, '2' turns out to be marginally better than '1'.
Beyond that, its hard to generalize without controlling some
of the other variables.

'2' wins because there is always the potential to make the
disk busy, again, just after it satisfies the access for
the 1st thread (which is now busy using the data, etc.)

But, if the first thread finishes up before the second thread's
request has been satisfied, then the presence of a THIRD thread
would just be clutter. (i.e., the work performed correlates
with the number of threads that can have value)

Actually, 2 might be slower than 1, because the new request from the second thread is apt to need a seek, while a single thread making all the calls is more apt to sequentially read much more of the disk.

Second thread is another instance of first thread; same code, same data.
So, it is just "looking ahead" -- i.e., it WILL do what the first thread
WOULD do if the second thread hadn't gotten to it, first! The strategy
is predicated on tightly coupled actors so they each can see what
the other has done/is doing.

If the layout of the object that thread 1 is processing calls for
a seek, then thread 2 will perform that seek as if thread 1 were to
do it "when done chewing on the past data".

If the disk caches data, then thread 2 will reap the benefits of
that cache AS IF it was thread 1 acting later.

Etc.

A second thread just hides the cost of the data processing.

The controller, if not given a new chained command, might choose to automatically start reading the next sector of the cylinder, which could be likely the next one asked for.

The real optimum is likely a single process doing asynchronous requests queuing
up a series of requests, and then distributing the data as it comes in to processing threads to do what ever crunching needs to be done.

These threads then send the data to a single thread that does asynchronous writes of the results.

You can easily tell if the input or output processes are I/O bound or not and use that to adjust the number of crunching threads in the middle.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Dimiter_Popoff@21:1/5 to antispam@math.uni.wroc.pl on Sun Oct 31 21:55:25 2021

On 10/31/2021 21:21, antispam@math.uni.wroc.pl wrote:

Dimiter_Popoff <dp@tgi-sci.com> wrote:

Now this is where the worst fit allocation strategy becomes the game
changer. A newly created file is almost certainly contiguously
allocated; fragmentation occurs when it is appended to (and the
space past its last block has been already allocated in the meantime).
I think I saw somewhere (never really got interested) that mainstream
operating systems of today do just first fit - which means once you
delete a file, no matter how small, its space will be allocated as
part of the next request etc., no idea why they do it (if they do so,
my memory on that is not very certain) in such a primitive manner
but here they are.

There was some research and first fit turned out to be pretty good.
But it is implemented slightly differently: there is moving pointer,
search advances it. So, after deallocation you wait till moving
pointer arrives to the hole. Way better than immediately filling
hole: there is reasonable chance that holes will coalece into
bigger free area. Another point is that you refuse allocation
when disc is too full (say at 95% utilization). The two things
together mean that normally fragmentation is not a problem.

This might be somewhat better than plain first fit but it can be no
match to worst fit. They probably don't do worst fit because they
must have huge amounts of memory to dig through due to poor
filesystem design (they have to go through 32 bits or more for
each cluster).
In DPS the CAT is one bit per cluster which makes worst fit doable
without any problems like that. I have enhanced it a little
(many years ago) noticing that the partition could come to a
point where it has two almost equally sized empty parts; now you append
to a file, it gets some clusters from one of these; next time you
append to it the *other* one is the largest and it takes from it... :).
So now dps first checks if there are empty clusters to allocate
in succession of these already allocated to the file, if none
then it does worst fit.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From antispam@math.uni.wroc.pl@21:1/5 to dp@tgi-sci.com on Sun Oct 31 19:21:14 2021

Dimiter_Popoff <dp@tgi-sci.com> wrote:

Now this is where the worst fit allocation strategy becomes the game
changer. A newly created file is almost certainly contiguously
allocated; fragmentation occurs when it is appended to (and the
space past its last block has been already allocated in the meantime).
I think I saw somewhere (never really got interested) that mainstream operating systems of today do just first fit - which means once you
delete a file, no matter how small, its space will be allocated as
part of the next request etc., no idea why they do it (if they do so,
my memory on that is not very certain) in such a primitive manner
but here they are.

There was some research and first fit turned out to be pretty good.
But it is implemented slightly differently: there is moving pointer,
search advances it. So, after deallocation you wait till moving
pointer arrives to the hole. Way better than immediately filling
hole: there is reasonable chance that holes will coalece into
bigger free area. Another point is that you refuse allocation
when disc is too full (say at 95% utilization). The two things
together mean that normally fragmentation is not a problem.

--
Waldek Hebisch

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (1 / 15)
Uptime:	155:25:21
Calls:	10,383
Files:	14,054
Messages:	6,417,848

Multithreaded disk access

Who's Online

System Info