As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?
E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.
[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]
Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?
You know that you can't actually get data off the media faster than the fundamental data rate of the media.
As you mention, cache can give an apparent rate faster than the media, but you
seem to be willing to assume that caching doesn't affect your rate, and each chunk will only be returned once.
Pathological access patterns can reduce this rate dramatically, and worse case
can result in rates of only a few percent of this factor if you force significant seeks between each sector read (and overload the buffering so it can't hold larger reads for a given stream).
Non-Pathological access can often result in near 100% of the access rate.
The best test of if you are I/O bound is if the I/O system is constantly in use, and every I/O request has another pending when it finishes, then you are totally I/O bound.
As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?
E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.
[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]
Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?
On 10/15/2021 17:08, Don Y wrote:
If caching is disabled things can get really bad quite quickly,
think on updating directory entries to reflect modification/access
dates, file sizes, scattering etc., think also allocation
table accesses etc.
E.g. in dps on a larger disk partition
(say >100 gigabytes) the first CAT (cluster allocation table)
access after boot takes some noticeable time, a second maybe;
then it stops being noticeable at all as the CAT is updated
rarely and on a modified area basis only (this on a a processor
capable of 20 Mbytes/second) (dps needs the entire CAT to allocate
new space in order to do its (enhanced) worst fit scheme).
IOW if you torture the disk with constant seeks and scattered
accesses you can slow it down from somewhat to a lot, depends
on way too many factors to be worth wondering about.
Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?
Just try it out for some time and make your pick. Recently
I did that dfs (distributed file system, over tcp) for dps
and had to watch much of this going on, at some point you
reach something between 50 and 100% of the hardware limit,
depending on file sizes you copy and who knows what else
overhead you can think of.
On 10/15/2021 8:46 AM, Dimiter_Popoff wrote:
....
Just try it out for some time and make your pick. Recently
I did that dfs (distributed file system, over tcp) for dps
and had to watch much of this going on, at some point you
reach something between 50 and 100% of the hardware limit,
depending on file sizes you copy and who knows what else
overhead you can think of.
I think the cost of any extra complexity in the algorithm
(to dynamically try to optimize number of threads) is
hard to justify -- given no control over the actual
media. I.e., it seems like it's best to just aim for
"simple" and live with whatever throughput you get...
On 10/15/2021 8:38 AM, Richard Damon wrote:
You know that you can't actually get data off the media faster than
the fundamental data rate of the media.
Yes, but you don't know that rate *and* that rate varies based on
"where" you're accesses land on the physical medium (e.g., ZDR,
shingled drives, etc.)
As you mention, cache can give an apparent rate faster than the media,
but you seem to be willing to assume that caching doesn't affect your
rate, and each chunk will only be returned once.
Cache in the filesystem code will be counterproductive. Cache in
the drive may be a win for some accesses and a loss for others
(e.g., if the drive read ahead thinking the next read was going to
be sequential with the last -- and that proves to be wrong -- the
drive may have missed an opportunity to respond more quickly to the
ACTUAL access that follows).
[I'm avoiding talking about reads AND writes just to keep the
discussion complexity manageable -- to avoid having to introduce
caveats with every statement]
Pathological access patterns can reduce this rate dramatically, and
worse case can result in rates of only a few percent of this factor if
you force significant seeks between each sector read (and overload the
buffering so it can't hold larger reads for a given stream).
Exactly. But, you don't necessarily know where your next access will
take you. This variation in throughput is what makes defining
"i/o bound" tricky; if the access patterns at some instant (instant
being a period over which you base your decision) make the drive
look slow, then you would opt NOT to spawn a new thread to take
advantage of excess throughput. Similarly, if the drive "looks" serendipitously fast, you may spawn another thread and its
accesses will eventually conflict with those of the first thread
to lower overall throughput.
Non-Pathological access can often result in near 100% of the access rate.
The best test of if you are I/O bound is if the I/O system is
constantly in use, and every I/O request has another pending when it
finishes, then you are totally I/O bound.
But, if you make that assessment when the access pattern is "unfortunate", you erroneously conclude the disk is at its capacity. And, vice versa.
Without control over the access patterns, it seems like there is no
reliable strategy for determining when another thread can be
advantageous (?)
On 10/15/21 12:00 PM, Don Y wrote:
On 10/15/2021 8:38 AM, Richard Damon wrote:
You know that you can't actually get data off the media faster than the
fundamental data rate of the media.
Yes, but you don't know that rate *and* that rate varies based on
"where" you're accesses land on the physical medium (e.g., ZDR,
shingled drives, etc.)
But all of these still have a 'maximum' rate, so you can still define a maximum. It does say that the 'expected' rate you can get gets more variable.
As you mention, cache can give an apparent rate faster than the media, but >>> you seem to be willing to assume that caching doesn't affect your rate, and >>> each chunk will only be returned once.
Cache in the filesystem code will be counterproductive. Cache in
the drive may be a win for some accesses and a loss for others
(e.g., if the drive read ahead thinking the next read was going to
be sequential with the last -- and that proves to be wrong -- the
drive may have missed an opportunity to respond more quickly to the
ACTUAL access that follows).
[I'm avoiding talking about reads AND writes just to keep the
discussion complexity manageable -- to avoid having to introduce
caveats with every statement]
Yes, the drive might try to read ahead and hurt itself, or it might not. That is mostly out of your control.
Non-Pathological access can often result in near 100% of the access rate. >>>
The best test of if you are I/O bound is if the I/O system is constantly in >>> use, and every I/O request has another pending when it finishes, then you >>> are totally I/O bound.
But, if you make that assessment when the access pattern is "unfortunate", >> you erroneously conclude the disk is at its capacity. And, vice versa.
Without control over the access patterns, it seems like there is no
reliable strategy for determining when another thread can be
advantageous (?)
Yes, adding more threads might change the access pattern. it will TEND to make
the pattern less sequential, and thus more towards that pathological case (and
thus more threads actually decrease the rate you can do I/O and thus slow down
your I//O bound rate). It is possible that it just happens to be fortunate to make things more sequential, if the system can see that one thread wants sector
N and another wants sector N+1, something can schedule the reads together and drop a seek.
Predicting that sort of behavior can't be done 'in the abstract'. You need to think about the details of the system.
As a general principle, if the I/O system is saturated, the job is I/O bound.
Adding more threads will only help if you have the resources to queue up more requests and can optimize the order of servicing them to be more efficient with
I/O. Predicting that means you need to know and have some control over the access pattern.
Note, part of this is being able to trade memory to improve I/O speed. If you know that EVENTUALLY you will want the next sector after the one you are reading, reading that now and caching it will be a win, but only if you will be
able to use that data before you need to claim that memory for other uses. This
sort of improvement really does require knowing details you want to try to assume you don't want to know, so you are limiting your ability to make accurate decisions.
As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?
E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.
[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]
Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?
Don Y <blockedofcourse@foo.invalid> wrote:
As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?
E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.
[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]
Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?
Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125 revolutions a second and a seek takes half a revolution and the next file
is another half a revolution away on average, which gets you 125 files a second roughly speaking depending on the performance of the drive if my numbers are not too far off.
This is plenty to support a dozen Windows VM’s on average if it were not for Windows updates that saturate the disks with hundreds of little file updates at once, causing Microsoft SQL timeouts for the VM’s.
On 10/17/2021 1:27 PM, Brett wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?
E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.
[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]
Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?
Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125
revolutions a second and a seek takes half a revolution and the next file
is another half a revolution away on average, which gets you 125 files a
second roughly speaking depending on the performance of the drive if my
numbers are not too far off.
You're assuming files are laid out contiguously -- that no seeks are needed "between sectors".
But, seek time can be 10, 15, + ms. (on my enterprise drives, its 4; but, average rotational delay is 2.) And, if the desired sector lies on a "distant" cylinder, you can scale that almost linearly.
On 10/17/2021 1:27 PM, Brett wrote:
Don Y <blockedofcourse@foo.invalid> wrote:
As a *rough* figure, what would you expect the bandwidth of
a disk drive (spinning rust) to do as a function of number of
discrete files being accessed, concurrently?
E.g., if you can monitor the rough throughput of each
stream and sum them, will they sum to 100% of the drive's
bandwidth? 90%? 110? etc.
[Note that drives have read-ahead and write caches so
the speed of the media might not bleed through to the
application layer. And, filesystem code also throws
a wrench in the works. Assume caching in the system
is disabled/ineffective.]
Said another way, what's a reasonably reliable way of
determining when you are I/O bound by the hardware
and when more threads won't result in more performance?
Roughly speaking a drive spinning at 7500 rpm divided by 60 Is 125
revolutions a second and a seek takes half a revolution and the next file
is another half a revolution away on average
For a 7200 rpm (some are as slow as 5400, some as fast as 15K) drive,
AVERAGE rotational delay is 8.3+ ms/2 = ~4ms.
I.e., looking at the disk's specs is largely useless unless you know how
the data on it is laid out.
And, for writes, shingled drives throw all of that down the toilet.
You're assuming files are laid out contiguously -- that no seeks are needed >> "between sectors".
This is the typical case anyway, most files are contiguously allocated.
Even on popular filesystems which have long forgotten how to do worst
fit allocation and have to defragment their disks not so infrequently.
But I think they have to access at least 3 locations to get to a file;
the directory entry, some kind of FAT-like thing, then the file.
Unlike dps, where 2 accesses are enough. And of course dps does
worst fit allocation so defragmentating is just unnecessary.
On 10/17/2021 3:09 PM, Dimiter_Popoff wrote:
You're assuming files are laid out contiguously -- that no seeks are
needed
"between sectors".
This is the typical case anyway, most files are contiguously allocated.
I'm not sure that is the case for files that have been modified on a
medium.
....
Even on popular filesystems which have long forgotten how to do worst
fit allocation and have to defragment their disks not so infrequently.
But I think they have to access at least 3 locations to get to a file;
the directory entry, some kind of FAT-like thing, then the file.
Unlike dps, where 2 accesses are enough. And of course dps does
worst fit allocation so defragmentating is just unnecessary.
I think directories are cached. And, possibly entire drive structures (depending on how much physical RAM you have available).
...
E.g., my disk sanitizer times each (fixed size) access to profile the
drive's performance as well as looking for trouble spots on the media.
But, things like recal cycles or remapping bad sectors introduce
completely unpredictable blips in the throughput. So much so that
I've had to implement a fair bit of logic to identify whether a
"delay" was part of normal operation *or* a sign of an exceptional
event.
[But, the sanitizer has a very predictable access pattern so
there's no filesystem/content -specific issues involved; just
process sectors as fast as possible. (also, there is no
need to have multiple threads per spindle; just a thread *per*
spindle -- plus some overhead threads)
And, the sanitizer isn't as concerned with throughput as the
human operator is the bottleneck (I can crank out a 500+GB drive
every few minutes).]
I'll mock up some synthetic loads and try various thread-spawning
strategies to see the sorts of performance I *might* be able
to get -- with different "preexisting" media (to minimize my
impact on that).
I'm sure I can round up a dozen or more platforms to try -- just
from stuff I have lying around here! :>
The devil is not that black (literally translating a Bulgarian saying)
as you see. Worst fit allocation is of course crucial to get to
such figures, the mainstream OS-s don't do it and things there
must be much worse.
Even on popular filesystems which have long forgotten how to do worst
fit allocation and have to defragment their disks not so infrequently.
But I think they have to access at least 3 locations to get to a file;
the directory entry, some kind of FAT-like thing, then the file.
Unlike dps, where 2 accesses are enough. And of course dps does
worst fit allocation so defragmentating is just unnecessary.
I think directories are cached. And, possibly entire drive structures
(depending on how much physical RAM you have available).
Well of course they must be caching them, especially since there are gigabytes of RAM available.
I know what dps does: it caches longnamed
directories which coexist with the old 8.4 ones in the same filesystem
and work faster than the 8.4 ones which typically don't get cached
(these were done to work well even on floppies, a directory entry
update writes back only the sector(s , if crossing) it occupies
etc. Then in dps the CAT (cluster allocation tables) are cached all
the time (do that for a 500G partition and enjoy reading all the
4 megabytes each time the CAT is needed to allocate new space...
it can be done, in fact the caches are enabled upon boot explicitly
on a per LUN/partition basis).
E.g., my disk sanitizer times each (fixed size) access to profile the
drive's performance as well as looking for trouble spots on the media.
But, things like recal cycles or remapping bad sectors introduce
completely unpredictable blips in the throughput. So much so that
I've had to implement a fair bit of logic to identify whether a
"delay" was part of normal operation *or* a sign of an exceptional
event.
[But, the sanitizer has a very predictable access pattern so
there's no filesystem/content -specific issues involved; just
process sectors as fast as possible. (also, there is no
need to have multiple threads per spindle; just a thread *per*
spindle -- plus some overhead threads)
And, the sanitizer isn't as concerned with throughput as the
human operator is the bottleneck (I can crank out a 500+GB drive
every few minutes).]
I did something similar many years ago, wneh the largest drive
a nukeman had was 200 (230 IIRC) megabytes, i.e. in prior to
magnetoresistive heads came to the world. It did develop
bad sectors and did not do much internally about it (1993).
So I wrote the "lockout" command (still available, I see I have
recompiled it for power, last change 2016 - can't remember if it
did anything useful, nor why I did that). It accessed sector by
sector the LUN it was told to and built the lockout CAT on
its filesystem (LCAT being ORed to CAT prior to the LUN being
usable for the OS). Took quite some time on that drive but
did the job back then.
I'll mock up some synthetic loads and try various thread-spawning
strategies to see the sorts of performance I *might* be able
to get -- with different "preexisting" media (to minimize my
impact on that).
I'm sure I can round up a dozen or more platforms to try -- just
from stuff I have lying around here! :>
I think this will give you plenty of an idea how to go about it.
Once you know the limit you can run at some reasonable figure
below it and be happy. Getting more precise figures about all
that is neither easy nor will it buy you anything.
On 10/18/2021 9:25 AM, Dimiter_Popoff wrote:
The devil is not that black (literally translating a Bulgarian saying)
as you see. Worst fit allocation is of course crucial to get to
such figures, the mainstream OS-s don't do it and things there
must be much worse.
I think a lot depends on the amount of "churn" the filesystem
experiences, in normal operation. E.g., the "system" disk on
the workstation I'm using today has about 800G in use of 1T total.
But, the vast majority of it is immutable -- binaries, libraries,
etc. So, there's very low fragmentation (because I "build" the
disk in one shot, instead of incrementally revising and
"updating" its contents)
By contrast, the other disks in the machine all see a fair bit of
turnover as things get created, revised and deleted.
By contrast, the other disks in the machine all see a fair bit of
turnover as things get created, revised and deleted.
Now this is where the worst fit allocation strategy becomes the game
changer. A newly created file is almost certainly contiguously
allocated; fragmentation occurs when it is appended to (and the
space past its last block has been already allocated in the meantime).
I think I saw somewhere (never really got interested) that mainstream operating systems of today do just first fit
- which means once you
delete a file, no matter how small, its space will be allocated as
part of the next request etc., no idea why they do it (if they do so,
my memory on that is not very certain) in such a primitive manner
but here they are.
I think this will give you plenty of an idea how to go about it.
Once you know the limit you can run at some reasonable figure
below it and be happy. Getting more precise figures about all
that is neither easy nor will it buy you anything.
I suspect "1" is going to end up as the "best compromise". So,
I'm treating this as an exercise in *validating* that assumption.
I'll see if I can salvage some of the performance monitoring code
from the sanitizer to give me details from which I might be able
to ferret out "opportunities". If I start by restricting my
observations to non-destructive synthetic loads, then I can
pull a drive and see how it fares in a different host while
running the same code, etc.
On 10/18/2021 1:46 PM, Don Y wrote:
I think this will give you plenty of an idea how to go about it.
Once you know the limit you can run at some reasonable figure
below it and be happy. Getting more precise figures about all
that is neither easy nor will it buy you anything.
I suspect "1" is going to end up as the "best compromise". So,
I'm treating this as an exercise in *validating* that assumption.
I'll see if I can salvage some of the performance monitoring code
from the sanitizer to give me details from which I might be able
to ferret out "opportunities". If I start by restricting my
observations to non-destructive synthetic loads, then I can
pull a drive and see how it fares in a different host while
running the same code, etc.
Actually, '2' turns out to be marginally better than '1'.
Beyond that, its hard to generalize without controlling some
of the other variables.
'2' wins because there is always the potential to make the
disk busy, again, just after it satisfies the access for
the 1st thread (which is now busy using the data, etc.)
But, if the first thread finishes up before the second thread's
request has been satisfied, then the presence of a THIRD thread
would just be clutter. (i.e., the work performed correlates
with the number of threads that can have value)
On 10/25/21 7:19 AM, Don Y wrote:
On 10/18/2021 1:46 PM, Don Y wrote:
I think this will give you plenty of an idea how to go about it.
Once you know the limit you can run at some reasonable figure
below it and be happy. Getting more precise figures about all
that is neither easy nor will it buy you anything.
I suspect "1" is going to end up as the "best compromise". So,
I'm treating this as an exercise in *validating* that assumption.
I'll see if I can salvage some of the performance monitoring code
from the sanitizer to give me details from which I might be able
to ferret out "opportunities". If I start by restricting my
observations to non-destructive synthetic loads, then I can
pull a drive and see how it fares in a different host while
running the same code, etc.
Actually, '2' turns out to be marginally better than '1'.
Beyond that, its hard to generalize without controlling some
of the other variables.
'2' wins because there is always the potential to make the
disk busy, again, just after it satisfies the access for
the 1st thread (which is now busy using the data, etc.)
But, if the first thread finishes up before the second thread's
request has been satisfied, then the presence of a THIRD thread
would just be clutter. (i.e., the work performed correlates
with the number of threads that can have value)
Actually, 2 might be slower than 1, because the new request from the second thread is apt to need a seek, while a single thread making all the calls is more apt to sequentially read much more of the disk.
The controller, if not given a new chained command, might choose to automatically start reading the next sector of the cylinder, which could be likely the next one asked for.
The real optimum is likely a single process doing asynchronous requests queuing
up a series of requests, and then distributing the data as it comes in to processing threads to do what ever crunching needs to be done.
These threads then send the data to a single thread that does asynchronous writes of the results.
You can easily tell if the input or output processes are I/O bound or not and use that to adjust the number of crunching threads in the middle.
Dimiter_Popoff <dp@tgi-sci.com> wrote:
Now this is where the worst fit allocation strategy becomes the game
changer. A newly created file is almost certainly contiguously
allocated; fragmentation occurs when it is appended to (and the
space past its last block has been already allocated in the meantime).
I think I saw somewhere (never really got interested) that mainstream
operating systems of today do just first fit - which means once you
delete a file, no matter how small, its space will be allocated as
part of the next request etc., no idea why they do it (if they do so,
my memory on that is not very certain) in such a primitive manner
but here they are.
There was some research and first fit turned out to be pretty good.
But it is implemented slightly differently: there is moving pointer,
search advances it. So, after deallocation you wait till moving
pointer arrives to the hole. Way better than immediately filling
hole: there is reasonable chance that holes will coalece into
bigger free area. Another point is that you refuse allocation
when disc is too full (say at 95% utilization). The two things
together mean that normally fragmentation is not a problem.
Now this is where the worst fit allocation strategy becomes the game
changer. A newly created file is almost certainly contiguously
allocated; fragmentation occurs when it is appended to (and the
space past its last block has been already allocated in the meantime).
I think I saw somewhere (never really got interested) that mainstream operating systems of today do just first fit - which means once you
delete a file, no matter how small, its space will be allocated as
part of the next request etc., no idea why they do it (if they do so,
my memory on that is not very certain) in such a primitive manner
but here they are.
Sysop: | Keyop |
---|---|
Location: | Huddersfield, West Yorkshire, UK |
Users: | 546 |
Nodes: | 16 (1 / 15) |
Uptime: | 155:25:21 |
Calls: | 10,383 |
Files: | 14,054 |
Messages: | 6,417,848 |