Forum: >>> Magnum BBS <<<

Stealing a Great Idea from the 6600

From John Savard@21:1/5 to All on Wed Apr 17 15:19:03 2024

Not that I expect Mitch Alsup to approve!

The 6600 had several I/O processors with a 12-bit word length that
were really one processor, basicallty using SMT.

Well, if I have a processor with an ISA that involves register banks
of 32 registers each... an alternate instruction set involving
register banks of 8 registers each would let me allocate either one
compute thread or four threads with the I/O processor instruction set.

And what would the I/O processor instruction set look like?

Think of the PDP-11 or the 9900 but give more impiortance to
floating-point. So I've come up with this format for a part of the
instruction set:

0 : 1 bit
(First two bits of opcode: 00, 01, or 10 but not 11): 2 bits
(remainder of opcode): 5 bits
(mode, not 11): 2 bits
(destination register): 3 bits
(source register): 3 bits

is the format of register-to-register instructions;

but memory-to-register instructions are load-store:

0: 1 bit
(first two bits of opcode: 00, 01, or 10 but not 11): 2 bits
(remainder of load/store opcode): 3 bits
(base register): 2 bits
(mode: 11): 2 bits
(destination register): 3 bits
(index register): 3 bits

(displacement): 16 bits

If the index register is zero, the instruction refers to memory, but
is not indexed, as usual.

An almost complete instruction set, using 3/8 of the available opcode
space. Subroutine call and branch instructions, of course, are still
also needed.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to John Savard on Wed Apr 17 21:50:26 2024

John Savard <quadibloc@servername.invalid> writes:

Not that I expect Mitch Alsup to approve!

The 6600 had several I/O processors with a 12-bit word length that
were really one processor, basicallty using SMT.

Well, if I have a processor with an ISA that involves register banks
of 32 registers each... an alternate instruction set involving
register banks of 8 registers each would let me allocate either one
compute thread or four threads with the I/O processor instruction set.

And what would the I/O processor instruction set look like?

On the Burroughs B4900, it looked a lot like an 8085. In fact,
it was an 8085.

Think of the PDP-11 or the 9900 but give more impiortance to
floating-point.

Why on earth would an I/O processor use floating point?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Wed Apr 17 23:32:20 2024

While I much admire CDC 6600 PPs and how much work those puppies did
allowing the big number crunchers to <well> crunch numbers::

With modern technology allowing 32-128 CPUs on a single die--there is
no reason to limit the width of a PP to 12-bits (1965:: yes there was
ample reason:: 2024 no reason whatsoever.) There is little reason to
even do 32-bit PPs when it cost so little more to get a 64-bit core.

In 2005-6 I was looking into a Verilog full x86-64 core {less FP} so
that those smaller CPUs could run ISRs and kernel codes to offload the
big CPUs from I/O duties. Done in Verilog meant anyone could compile it
onto another die so the I/O CPUs were out on the PCIe tree nanoseconds
away from the peripherals rather than microseconds away. Close enough
to perform the DMA activities on behalf of the devices; and consuming interrupts so the bigger cores did not see any of them (except timer).

As Scott stated:: there does not seem to be any reason to need FP on a
core only doing I/O and kernel queueing services.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed Apr 17 21:14:39 2024

On Wed, 17 Apr 2024 23:32:20 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

With modern technology allowing 32-128 CPUs on a single die--there is
no reason to limit the width of a PP to 12-bits (1965:: yes there was
ample reason:: 2024 no reason whatsoever.) There is little reason to
even do 32-bit PPs when it cost so little more to get a 64-bit core.

Well, I'm not. The PP instruction set I propose uses 16-bit and 32-bit instructions, and so uses the same bus as the main instruction set.

As Scott stated:: there does not seem to be any reason to need FP on a
core only doing I/O and kernel queueing services.

That's true.

This isn't about cores, though. Instead, a core running the main ISA
of the processor will simply have the option to replace one
regular-ISA thread by four threads which use 8 registers instead of
32, allowing SMT with more threads.

So we're talking about the same core. The additional threads will get
to execute instructions 1/4 as often as regular threads, so their
performance is reduced, matching an ISA that gives them fewer
registers.

Since the design is reminiscent of the 6600 PPs, these threads might
be used for I/O tasks, but nothing stops them from being used for
other purposes for which access to the FP capabilities of the chip may
be relevant.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Thu Apr 18 03:34:31 2024

On Wed, 17 Apr 2024 15:19:03 -0600, John Savard wrote:

The 6600 had several I/O processors with a 12-bit word length that were really one processor, basicallty using SMT.

Originally these “PPUs” (“Peripheral Processor Units”) were for running the OS, while the main CPU was primarily dedicated to running user
programs.

Aparently this idea did not work out so well, and in later versions of the
OS, more code ran on the CPU instead of the PPUs.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Apr 18 16:59:18 2024

Lawrence D'Oliveiro wrote:

On Wed, 17 Apr 2024 15:19:03 -0600, John Savard wrote:

The 6600 had several I/O processors with a 12-bit word length that were
really one processor, basicallty using SMT.

Originally these “PPUs” (“Peripheral Processor Units”) were for running
the OS,

Including polling DMA performed by the PPS.

while the main CPU was primarily dedicated to running user
programs.

Aparently this idea did not work out so well, and in later versions of the OS, more code ran on the CPU instead of the PPUs.

Imagine that 10×12-bit CPUs, running 1/10 frequency of the main CPU, having
a hard time performing OS workloads while the 50× faster CPU cores perform user workloads.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu Apr 18 16:55:37 2024

John Savard wrote:

On Wed, 17 Apr 2024 23:32:20 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

With modern technology allowing 32-128 CPUs on a single die--there is
no reason to limit the width of a PP to 12-bits (1965:: yes there was
ample reason:: 2024 no reason whatsoever.) There is little reason to
even do 32-bit PPs when it cost so little more to get a 64-bit core.

Well, I'm not. The PP instruction set I propose uses 16-bit and 32-bit instructions, and so uses the same bus as the main instruction set.

As Scott stated:: there does not seem to be any reason to need FP on a
core only doing I/O and kernel queueing services.

That's true.

This isn't about cores, though. Instead, a core running the main ISA
of the processor will simply have the option to replace one
regular-ISA thread by four threads which use 8 registers instead of
32, allowing SMT with more threads.

The hard thing is to run the Operating System in the PPs using the same compiled code in either a big core or in a little core. The big cores
are on a CPU centric die, the little ones out on device oriented dies.
In 7nm a MIPS R2000 is less than 0.07mm^2 using std cells. At this size
every device can have its own core.

So we're talking about the same core. The additional threads will get
to execute instructions 1/4 as often as regular threads, so their
performance is reduced, matching an ISA that gives them fewer
registers.

I knew you were talking about it that way, I was trying to get you to
change your mind and use the same ISA in the device cores as you use
in the CPU cores so you can run the same OS code and even a bit of the
device drivers as well.

Since the design is reminiscent of the 6600 PPs, these threads might
be used for I/O tasks, but nothing stops them from being used for
other purposes for which access to the FP capabilities of the chip may
be relevant.

Yes, exactly, and it is for those other purposes that you want these
device cores to operate on the same ISA as the big cores. This way if
anything goes wrong, you can simply lob the code back to a CPU centric
core and finish the job.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Thu Apr 18 23:42:15 2024

On Thu, 18 Apr 2024 16:55:37 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Yes, exactly, and it is for those other purposes that you want these
device cores to operate on the same ISA as the big cores. This way if >anything goes wrong, you can simply lob the code back to a CPU centric
core and finish the job.

If the design has P-cores and E-cores, both will have the same *pair*
of ISAs.

Code written in the big ISA will run on both kinds of core, and code
written in the little ISA will also run on both kinds of core, but use
less resources on whichever core it is placed.

So I won't have _that_ problem.

Each core can just switch between compute duty with N threads, and I/O
service duty with 4*N threads - or anywhere in between.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Fri Apr 19 01:38:45 2024

On Thu, 18 Apr 2024 23:42:15 -0600, John Savard
<quadibloc@servername.invalid> wrote:

Each core can just switch between compute duty with N threads, and I/O >service duty with 4*N threads - or anywhere in between.

So I hope it is clear now I'm talking about SMT threads, not cores.
Threads are orthogonal to cores.

But I did make one oversimplification that could be confusing.

The full instruction set assumes banks of 32 registers, one each for
integer and floats, the reduced instruction set assumes banks of 8
registers, one each for integer and floats.

So one thread of the full ISA can be replaced by four threads of the
reduced ISA, both use the same number of registes.

That's all right for an in-order design. But in real life, computers
are out-of-order. So the *rename* registers would have to be split up.

Since the reduced ISA threads are four times greater in number, their instructions have four times longer to finish executing before their
thread gets a chance to execute again. So presumably reduced ISA
threads will need less agressive OoO, and 1/4 the rename registers
might be adequate, but there's obviously no guarantee that this would
indeed be an ideal fit.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Fri Apr 19 18:40:45 2024

John Savard wrote:

On Thu, 18 Apr 2024 23:42:15 -0600, John Savard <quadibloc@servername.invalid> wrote:

Each core can just switch between compute duty with N threads, and I/O >>service duty with 4*N threads - or anywhere in between.

So I hope it is clear now I'm talking about SMT threads, not cores.
Threads are orthogonal to cores.

That was already clear.

But I did make one oversimplification that could be confusing.

The full instruction set assumes banks of 32 registers, one each for
integer and floats, the reduced instruction set assumes banks of 8
registers, one each for integer and floats.

So one thread of the full ISA can be replaced by four threads of the
reduced ISA, both use the same number of registes.

So how does a 32-register thread "call" an 8 register thread ?? or vice
versa ??

What ABI model does the compiler use ??

When an 8-register thread takes an exception, is it handled by a 8-reg
thread or a 32-register thread ??

That's all right for an in-order design. But in real life, computers
are out-of-order. So the *rename* registers would have to be split up.

In K9 we unified the x86 register files into a single file to simplify
HW maintenance of the OoO state.

Since the reduced ISA threads are four times greater in number, their instructions have four times longer to finish executing before their
thread gets a chance to execute again.

Now all that forwarding logic is wasting its gates of delay and area
without adding any performance.

Now all those instruction schedulers are sitting around doing nothing.

So presumably reduced ISA
threads will need less agressive OoO, and 1/4 the rename registers
might be adequate, but there's obviously no guarantee that this would
indeed be an ideal fit.

LoL.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sat Apr 20 01:09:53 2024

On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

So presumably reduced ISA
threads will need less agressive OoO, and 1/4 the rename registers
might be adequate, but there's obviously no guarantee that this would
indeed be an ideal fit.

LoL.

Well, yes. The fact that pretty much all serious high-performance
designs these days _are_ OoO basically means that my brilliant idea is
DoA.

Of course, instead of replacing 1 full-ISA thread with 4 light-ISA
threads, one could use a different number, based on what is optimum
for a given implementation. But that ratio would now vary from one
chip to another, being model-dependent.

So it's not *totally* destroyed, but this is still a major blow.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Sat Apr 20 01:12:25 2024

On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

So presumably reduced ISA
threads will need less agressive OoO, and 1/4 the rename registers
might be adequate, but there's obviously no guarantee that this would
indeed be an ideal fit.

LoL.

Well, yes. The fact that pretty much all serious high-performance
designs these days _are_ OoO basically means that my brilliant idea is
DoA.

Of course, instead of replacing 1 full-ISA thread with 4 light-ISA
threads, one could use a different number, based on what is optimum
for a given implementation. But that ratio would now vary from one
chip to another, being model-dependent.

So it's not *totally* destroyed, but this is still a major blow.

And, hey, I'm not the first guy to get sunk because of forgetting what
lies under the tip of the iceberg that's above the water.

That also happened to the captain of the _Titanic_.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sat Apr 20 01:06:33 2024

On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

So how does a 32-register thread "call" an 8 register thread ?? or vice
versa ??

That sort of thing would be done by supervisor mode instructions,
similar to the ones used to start additional threads on a given core,
or start threads on a new core.

Since the lightweight ISA has the benefit of having fewer registers
allocated, it's not the same as, slay, a "thumb mode" which offers
more compact code as its benefit. Instead, this is for use in classes
of threads that are separate from ordinary code.

I/O processing threads being one example of this.

The intent of this kind of lightweight ISA is to reduce the temptation
to decide "oh, we've got to put special smaller cores in the SoC/on
the motherboard to perform this specialized task, because the main CPU
is overkill". Because now you're using a smaller slice of the main
CPU, so it's not a waste to do it there any more.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Sat Apr 20 17:07:11 2024

John Savard wrote:

On Sat, 20 Apr 2024 01:09:53 -0600, John Savard <quadibloc@servername.invalid> wrote:

And, hey, I'm not the first guy to get sunk because of forgetting what
lies under the tip of the iceberg that's above the water.

That also happened to the captain of the _Titanic_.

Concer-tina-tanic !?!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Apr 20 22:03:21 2024

BGB wrote:

On 4/20/2024 12:07 PM, MitchAlsup1 wrote:

John Savard wrote:

On Sat, 20 Apr 2024 01:09:53 -0600, John Savard
<quadibloc@servername.invalid> wrote:

And, hey, I'm not the first guy to get sunk because of forgetting what
lies under the tip of the iceberg that's above the water.

That also happened to the captain of the _Titanic_.

Concer-tina-tanic !?!

Seems about right.
Seems like a whole lot of flailing with designs that seem needlessly complicated...

Meanwhile, has looked around and noted:
In some ways, RISC-V is sort of like MIPS with the field order reversed,

They, in effect, Litle-Endian-ed the fields.

and (ironically) actually smaller immediate fields (MIPS was using a lot
of Imm16 fields. whereas RISC-V mostly used Imm12).

Yes, RISC-V took a step back with the 12-bit immediates. My 66000, on
the other hand, only has 12-bit immediates for shift instructions--
allowing all shifts to reside in one Major OpCode; the rest inst[31]=1
have 16-bit immediates (universally sign extended).

But, seemed to have more wonk:
A mode with 32x 32-bit GPRs; // unnecessary
A mode with 32x 64-bit GPRs;
Apparently a mode with 32x 32-bit GPRs that can be paired to 16x 64-bits
as needed for 64-bit operations?...

Repeating the mistake I made on Mc 88100....

Integer operations (on 64-bit registers) that give UB or trap if values
are outside of signed Int32 range;

Isn't it just wonderful ??

Other operations that sign-extend the values but are ironically called "unsigned" (apparently, similar wonk to RISC-V by having signed-extended Unsigned Int);
Branch operations are bit-sliced;
....

I had preferred a different strategy in some areas:
Assume non-trapping operations by default;

Assume trap/"do the expected thing" under a user accessible flag.

Sign-extend signed values, zero-extend unsigned values.

Another mistake I mad in Mc 88100.

Do you sign extend the 16-bit displacement on an unsigned LD ??

Though, this is partly the source of some operations in my case assuming
33 bit sign-extended: This can represent both the signed and unsigned
32-bit ranges.

These are some of the reasons My 66000 is 64-bit register/calculation only.

One could argue that sign-extending both could save 1 bit in some cases.
But, this creates wonk in other cases, such as requiring an explicit
zero extension for "unsigned int" to "long long" casts; and more cases
where separate instructions are needed for Int32 and Int64 cases (say,
for example, RISC-V needed around 4x as many Int<->Float conversion
operators due to its design choices in this area).

It also gets difficult when you consider EADD Rd,Rdouble,Rexponent ??
is it a FP calculation or an integer calculation ?? If Rdouble is a
constant is the constant FP or int, if Rexponent is a constant is it
double or int,..... Does it raise FP overflow or integer overflow ??

Say:
RV64:
Int32<->Binary32, UInt32<->Binary32
Int64<->Binary32, UInt64<->Binary32
Int32<->Binary64, UInt32<->Binary64
Int64<->Binary64, UInt64<->Binary64
BJX2:
Int64<->Binary64, UInt64<->Binary64

My 66000:
int64_t -> { uint64_t, float, double }
uint64_t -> { int64_t, float, double }
float -> { uint64_t, int64_t, double }
double -> { uint64_t, int64_t, float }

With the Uint64 case mostly added because otherwise one needs a wonky
edge case to deal with this (but is rare in practice).

The separate 32-bit cases were avoided by tending to normalize
everything to Binary64 in registers (with Binary32 only existing in SIMD
form or in memory).

I saved LD and ST instructions by leaving float 32-bits in the registers.

Annoyingly, I did end up needing to add logic for all of these cases to
deal with RV64G.

No rest for the wicked.....

Currently no plans to implement RISC-V's Privileged ISA stuff, mostly
because it would likely be unreasonably expensive.

The sea of control registers or the sequencing model applied thereon ??
My 66000 allows access to all control registers via memory mapped I/O
space.

It is in theory
possible to write an OS to run in RISC-V mode, but it would need to deal
with the different OS level and hardware-level interfaces (in much the
same way, as I needed to use a custom linker script for GCC, as my stuff
uses a different memory map from the one GCC had assumed; namely that of
RAM starting at the 64K mark, rather than at the 16MB mark).

In some cases in my case, there are distinctions between 32-bit and
64-bit compare-and-branch ops. I am left thinking this distinction may
be unnecessary, and one may only need 64 bit compare and branch.

No 32-bit stuff, thereby no 32-bit distinctions needed.

In the emulator, the current difference ended up mostly that the 32-bit version sees if the 32-bit and 64-bit version would give a different
result and faulting if so, since this generally means that there is a
bug elsewhere (such as other code is producing out-of-range values).

Saving vast amounts of power {{{not}}}

For a few newer cases (such as the 3R compare ops, which produce a 1-bit output in a register), had only defined 64-bit versions.

Oh what a tangled web we.......

One could just ignore the distinction between 32 and 64 bit compare in hardware, but had still burnt the encoding space on this. In a new ISA design, I would likely drop the existence of 32-bit compare and use exclusively 64-bit compare.

In many cases, the distinction between 32-bit and 64-bit operations, or between 2R and 3R cases, had ended up less significant than originally thought (and now have ended up gradually deprecating and disabling some
of the 32-bit 2R encodings mostly due to "lack of relevance").

I deprecated all of them.

Though, admittedly, part of the reason for a lot of separate 2R cases existing was that I had initially had the impression that there may have
been a performance cost difference between 2R and 3R instructions. This
ended up not really the case, as the various units ended up typically
using 3R internally anyways.

So, say, one needs an ALU with, say:
2 inputs, one output;

you forgot carry, and inversion to perform subtraction.

Ability to bit-invert the second input
along with inverting carry-in, ...
Ability to sign or zero extend the output.

So, My 66000 integer adder has 3 carry inputs, and I discovered a way to perform these that takes no more gates of delay than the typical 1-carry
in 64-bit integer adder. This gives me a = -b -c; for free.

So, say, operations:
ADD / SUB (Add, 64-bit)
ADDSL / SUBSL (Add, 32-bit, sign extent) // nope
ADDUL / SUBUL (Add, 32-bit, zero extent) // nope
AND
OR
XOR
CMPEQ // 1 ICMP inst
CMPNE
CMPGT (CMPLT implicit)
CMPGE (CMPLE implicit)
CMPHI (unsigned GT)
CMPHS (unsigned GE)
....

Where, internally compare works by performing a subtract and then
producing a result based on some status bits (Z,C,S,O). As I see it,
ideally these bits should not be exposed at the ISA level though (much
pain and hair results from the existence of architecturally visible ALU status-flag bits).

I agree that these flags should not be exposed through ISA; and I did not.
On the other hand multi-precision arithmetic demands at least carry {or
some other means which is even more powerful--such as CARRY.....}

Some other features could still be debated though, along with how much simplification could be possible.

If I did a new design, would probably still keep predication and jumbo prefixes.

I kept predication but not the way most predication works.
My work on Mc 88120 and K9 taught me the futility of things in the
instruction stream that provide artificial boundaries. I have a suspicion
that if you have the FPGA capable of allowing you to build a 8-wide
machine, you would do the jumbo stuff differently, too.

Explicit bundling vs superscalar could be argued either way, as
superscalar isn't as expensive as initially thought, but in a simpler
form is comparably weak (the compiler has an advantage that it can
invest more expensive analysis into this, reorder instructions, etc; but
this only goes so far as the compiler understands the CPU's pipeline,

Compilers are notoriously unable to outguess a good branch predictor.

ties the code to a specific pipeline structure, and becomes effectively
moot with OoO CPU designs).

OoO exists, in a practical sense, to abstract the pipeline out of the
compiler; or conversely, to allow multiple implementations to run the
same compiled code optimally on each implementation.

So, a case could be made that a "general use" ISA be designed without
the use of explicit bundling. In my case, using the bundle flags also requires the code to use an instruction to signal to the CPU what configuration of pipeline it expects to run on, with the CPU able to
fall back to scalar (or superscalar) execution if it does not match.

Sounds like a bridge too far for your 8-wide GBOoO machine.

For the most part, thus far nearly everything has ended up as "Mode 2", namely:
3 lanes;
Lane 1 does everything;
Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
Lane 3 only does Basic ALU ops and a few CONV ops and similar.
Lane 3 originally also did Shift, dropped to reduce cost.
Mem ops may eat Lane 3, ...

Try 6-lanes:
1,2,3 Memory ops + integer ADD and Shifts
4 FADD ops + integer ADD and FMisc
5 FMAC ops + integer ADD
6 CMP-BR ops + integer ADD

Where, say:
Mode 0 (Default):
Only scalar code is allowed, CPU may use superscalar (if available).
Mode 1:
2 lanes:
Lane 1 does everything;
Lane 2 does ALU, Shift, and CONV.
Mem ops take up both lanes.
Effectively scalar for Load/Store.
Later defined that 128-bit MOV.X is allowed in a Mode 1 core.

Modeless.

Had defined wider modes, and ones that allow dual-lane IO and FPU instructions, but these haven't seen use (too expensive to support in hardware).

Had ended up with the ambiguous "extension" to the Mode 2 rules of
allowing an FPU instruction to be executed from Lane 2 if there was not
an FPU instruction in Lane 1, or allowing co-issuing certain FPU
instructions if they effectively combine into a corresponding SIMD op.

In my current configurations, there is only a single memory access port.

This should imply that your 3-wide pipeline is running at 90%-95%
memory/cache saturation.

A second memory access port would help with performance, but is
comparably a rather expensive feature (and doesn't help enough to
justify its fairly steep cost).

For lower-end cores, a case could be made for assuming a 1-wide CPU with
a 2R1W register file, but designing the whole ISA around this limitation
and not allowing for anything more is limiting (and mildly detrimental
to performance). If we can assume cores with an FPU, we can probably
also assume cores with more than two register read ports available.

If you design around the notion of a 3R1W register file, FMAC and INSERT
fall out of the encoding easily. Done right, one can switch it into a 4R
or 4W register file for ENTER and EXIT--lessening the overhead of call/ret.

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sat Apr 20 17:59:12 2024

On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

BGB wrote:

Sign-extend signed values, zero-extend unsigned values.

Another mistake I mad in Mc 88100.

As that is a mistake the IBM 360 made, I make it too. But I make it
the way the 360 did: there are no signed and unsigned values, in the
sense of a Burroughs machine, there are just Load, Load Unsigned - and
Insert - instructions.

Index and base register values are assumed to be unsigned.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Sat Apr 20 18:01:49 2024

On Sat, 20 Apr 2024 01:06:33 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Fri, 19 Apr 2024 18:40:45 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

So how does a 32-register thread "call" an 8 register thread ?? or vice >>versa ??

That sort of thing would be done by supervisor mode instructions,
similar to the ones used to start additional threads on a given core,
or start threads on a new core.

Since the lightweight ISA has the benefit of having fewer registers >allocated, it's not the same as, slay, a "thumb mode" which offers
more compact code as its benefit. Instead, this is for use in classes
of threads that are separate from ordinary code.

I/O processing threads being one example of this.

Of course, though, there's nothing preventing using the lightweight
ISA as the basic for something that _could_ interoperate with the full
ISA. Keep all 32 registers in each bank, and have a sliding 8-register
window, or use bundles of instructions, say up to seven instructions,
using one of three groups of eight integer registers and one of four
groups of floating-point registers. (The fourth group of integer
registers is the base registers.)

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Sat Apr 20 18:06:22 2024

On Sat, 20 Apr 2024 17:59:12 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

BGB wrote:

Sign-extend signed values, zero-extend unsigned values.

Another mistake I mad in Mc 88100.

As that is a mistake the IBM 360 made, I make it too. But I make it
the way the 360 did: there are no signed and unsigned values, in the
sense of a Burroughs machine, there are just Load, Load Unsigned - and
Insert - instructions.

Since there was only one set of arithmetic instrucions, that meant
that when you wrote code to operate on unsigned values, you had to
remember that the normal names of the condition code values were
oriented around signed arithmetic.

So during unsigned arithmetic, "overflow" didn't _mean_ overflow.
Instead, carry was overflow.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Sun Apr 21 00:43:21 2024

John Savard wrote:

On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

BGB wrote:

Sign-extend signed values, zero-extend unsigned values.

Another mistake I mad in Mc 88100.

As that is a mistake the IBM 360 made, I make it too. But I make it
the way the 360 did: there are no signed and unsigned values, in the
sense of a Burroughs machine, there are just Load, Load Unsigned - and
Insert - instructions.

Index and base register values are assumed to be unsigned.

I would use the term signless as opposed to unsigned.

Address arithmetic is ADD only and does not care about signs or
overflow. There is no concept of a negative base register or a
negative index register (or, for that matter, a negative displace-
ment), overflow, underflow, carry, ...

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Apr 21 18:57:27 2024

BGB wrote:

On 4/20/2024 5:03 PM, MitchAlsup1 wrote:

BGB wrote:

Compilers are notoriously unable to outguess a good branch predictor.

Errm, assuming the compiler is capable of things like general-case
inlining and loop-unrolling.

I was thinking of simpler things, like shuffling operators between independent (sub)expressions to limit the number of register-register dependencies.

Like, in-order superscalar isn't going to do crap if nearly every
instruction depends on every preceding instruction. Even pipelining
can't help much with this.

Pipelining CREATED this (back to back dependencies). No amount of
pipelining can eradicate RAW data dependencies.

The compiler can shuffle the instructions into an order to limit the
number of register dependencies and better fit the pipeline. But, then,
most of the "hard parts" are already done (so it doesn't take much more
for the compiler to flag which instructions can run in parallel).

Compiler scheduling works for exactly 1 pipeline implementation and
is suboptimal for all others.

Meanwhile, a naive superscalar may miss cases that could be run in
parallel, if it is evaluating the rules "coarsely" (say, evaluating what
is safe or not safe to run things in parallel based on general groupings
of opcodes rather than the rules of specific opcodes; or, say,
false-positive register alias if, say, part of the Imm field of a 3RI instruction is interpreted as a register ID, ...).

Granted, seemingly even a naive approach is able to get around 20% ILP
out of "GCC -O3" output for RV64G...

But, the GCC output doesn't seem to be quite as weak as some people are claiming either.

ties the code to a specific pipeline structure, and becomes
effectively moot with OoO CPU designs).

OoO exists, in a practical sense, to abstract the pipeline out of the
compiler; or conversely, to allow multiple implementations to run the
same compiled code optimally on each implementation.

Granted, but OoO isn't cheap.

But it does get the job done.

So, a case could be made that a "general use" ISA be designed without
the use of explicit bundling. In my case, using the bundle flags also
requires the code to use an instruction to signal to the CPU what
configuration of pipeline it expects to run on, with the CPU able to
fall back to scalar (or superscalar) execution if it does not match.

Sounds like a bridge too far for your 8-wide GBOoO machine.

For sake of possible fancier OoO stuff, I upheld a basic requirement for
the instruction stream:
The semantics of the instructions as executed in bundled order needs to
be equivalent to that of the instructions as executed in sequential order.

In this case, the OoO CPU can entirely ignore the bundle hints, and
treat "WEXMD" as effectively a NOP.

This would have broken down for WEX-5W and WEX-6W (where enforcing a parallel==sequential constraint effectively becomes unworkable, and/or renders the wider pipeline effectively moot), but these designs are
likely dead anyways.

And, with 3-wide, the parallel==sequential order constraint remains in effect.

For the most part, thus far nearly everything has ended up as "Mode
2", namely:
   3 lanes;
     Lane 1 does everything;
     Lane 2 does Basic ALU ops, Shift, Convert (CONV), ...
     Lane 3 only does Basic ALU ops and a few CONV ops and similar.
       Lane 3 originally also did Shift, dropped to reduce cost.
     Mem ops may eat Lane 3, ...

Try 6-lanes:
   1,2,3 Memory ops + integer ADD and Shifts
   4     FADD   ops + integer ADD and FMisc
   5     FMAC   ops + integer ADD
   6     CMP-BR ops + integer ADD

As can be noted, my thing is more a "LIW" rather than a "true VLIW".

Mine is neither LIW or VLIW but it definitely is LBIO through GBOoO

So, MEM/BRA/CMP/... all end up in Lane 1.

Lanes 2/3 effectively ending up used for fold over most of the ALU ops turning Lane 1 mostly into a wall of Load and Store instructions.

Where, say:
   Mode 0 (Default):
     Only scalar code is allowed, CPU may use superscalar (if available).
   Mode 1:
     2 lanes:
       Lane 1 does everything;
       Lane 2 does ALU, Shift, and CONV.
     Mem ops take up both lanes.
       Effectively scalar for Load/Store.
       Later defined that 128-bit MOV.X is allowed in a Mode 1 core. >> Modeless.

Had defined wider modes, and ones that allow dual-lane IO and FPU
instructions, but these haven't seen use (too expensive to support in
hardware).

Had ended up with the ambiguous "extension" to the Mode 2 rules of
allowing an FPU instruction to be executed from Lane 2 if there was
not an FPU instruction in Lane 1, or allowing co-issuing certain FPU
instructions if they effectively combine into a corresponding SIMD op.

In my current configurations, there is only a single memory access port.

This should imply that your 3-wide pipeline is running at 90%-95%
memory/cache saturation.

If you mean that execution is mostly running end-to-end memory
operations, yeah, this is basically true.

Comparably, RV code seems to end up running a lot of non-memory ops in
Lane 1, whereas BJX2 is mostly running lots of memory ops, with Lane 2 handling most of the ALU ops and similar (and Lane 3, occasionally).

One of the things that I notice with My 66000 is when you get all the
constants you ever need at the calculation OpCodes, you end up with
FEWER instructions that "go random places" such as instructions that
<well> paste constants together. This leave you with a data dependent
string of calculations with occasional memory references. That is::
universal constants gets rid of the easy to pipeline extra instructions
leaving the meat of the algorithm exposed.

If you design around the notion of a 3R1W register file, FMAC and INSERT
fall out of the encoding easily. Done right, one can switch it into a 4R
or 4W register file for ENTER and EXIT--lessening the overhead of call/ret. >>

Possibly.

It looks like some savings could be possible in terms of prologs and
epilogs.

As-is, these are generally like:
MOV LR, R18
MOV GBR, R19
ADD -192, SP
MOV.X R18, (SP, 176) //save GBR and LR
MOV.X ... //save registers

Why not an instruction that saves LR and GBR without wasting instructions
to place them side by side prior to saving them ??

WEXMD 2 //specify that we want 3-wide execution here

//Reload GBR, *1
MOV.Q (GBR, 0), R18
MOV 0, R0 //special reloc here
MOV.Q (GBR, R0), R18
MOV R18, GBR

It is gorp like that that lead me to do it in HW with ENTER and EXIT.
Save registers to the stack, setup FP if desired, allocate stack on SP,
and decide if EXIT also does RET or just reloads the file. This would
require 2 free registers if done in pure SW, along with several MOVs...

//Generate Stack Canary, *2
MOV 0x5149, R18 //magic number (randomly generated)
VSKG R18, R18 //Magic (combines input with SP and magic numbers)
MOV.Q R18, (SP, 144)

...
function-specific stuff
...

MOV 0x5149, R18
MOV.Q (SP, 144), R19
VSKC R18, R19 //Validate canary
...

*1: This part ties into the ABI, and mostly exists so that each PE image
can get GBR reloaded back to its own ".data"/".bss" sections (with

Universal displacements make GBR unnecessary as a memory reference can
be accompanied with a 16-bit, 32-bit, or 64-bit displacement. Yes, you
can read GOT[#i] directly without a pointer to it.

multiple program instances in a single address space). But, does mean
that pretty much every non-leaf function ends up needing to go through
this ritual.

Universal constant solves the underlying issue.

*2: Pretty much any function that has local arrays or similar, serves to protect register save area. If the magic number can't regenerate a
matching canary at the end of the function, then a fault is generated.

My 66000 can place the callee save registers in a place where user cannot access them with LDs or modify them with STs. So malicious code cannot
damage the contract between ABI and core.

The cost of some of this starts to add up.

In isolation, not much, but if all this happens, say, 500 or 1000 times
or more in a program, this can add up.

Was thinking about that last night. H&P "book" statistics say that call/ret represents 2% of instructions executed. But if you add up the prologue and epilogue instructions you find 8% of instructions are related to calling
and returning--taking the problem from (at 2%) ignorable to (at 8%) a big ticket item demanding something be done.

8% represents saving/restoring only 3 registers vis stack and associated SP arithmetic. So, it can easily go higher.

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Apr 21 23:31:55 2024

BGB wrote:

On 4/21/2024 1:57 PM, MitchAlsup1 wrote:

BGB wrote:

One of the things that I notice with My 66000 is when you get all the
constants you ever need at the calculation OpCodes, you end up with
FEWER instructions that "go random places" such as instructions that
<well> paste constants together. This leave you with a data dependent
string of calculations with occasional memory references. That is::
universal constants gets rid of the easy to pipeline extra instructions
leaving the meat of the algorithm exposed.

Possibly true.

RISC-V tends to have a lot of extra instructions due to lack of big
constants and lack of indexed addressing.

You forgot the "every one an his brother" design of the ISA>

And, BJX2 has a lot of frivolous register-register MOV instructions.

I empower you to get rid of them....
<snip>

If you design around the notion of a 3R1W register file, FMAC and INSERT >>>> fall out of the encoding easily. Done right, one can switch it into a 4R >>>> or 4W register file for ENTER and EXIT--lessening the overhead of
call/ret.

Possibly.

It looks like some savings could be possible in terms of prologs and
epilogs.

As-is, these are generally like:
   MOV    LR, R18
   MOV    GBR, R19
   ADD    -192, SP
   MOV.X R18, (SP, 176) //save GBR and LR
   MOV.X ... //save registers

Why not an instruction that saves LR and GBR without wasting instructions
to place them side by side prior to saving them ??

I have an optional MOV.C instruction, but would need to restructure the
code for generating the prologs to make use of them in this case.

Say:
MOV.C GBR, (SP, 184)
MOV.C LR, (SP, 176)

Though, MOV.C is considered optional.

There is a "MOV.C Lite" option, which saves some cost by only allowing
it for certain CR's (mostly LR and GBR), which also sort of overlaps
with (and is needed) by RISC-V mode, because these registers are in GPR
land for RV.

But, in any case, current compiler output shuffles them to R18 and R19
before saving them.

   WEXMD 2 //specify that we want 3-wide execution here

   //Reload GBR, *1
   MOV.Q (GBR, 0), R18
   MOV    0, R0 //special reloc here
   MOV.Q (GBR, R0), R18
   MOV    R18, GBR

Correction:

MOV.Q (R18, R0), R18

It is gorp like that that lead me to do it in HW with ENTER and EXIT.
Save registers to the stack, setup FP if desired, allocate stack on SP,
and decide if EXIT also does RET or just reloads the file. This would
require 2 free registers if done in pure SW, along with several MOVs...

Possibly.
The partial reason it loads into R0 and uses R0 as an index, was that I defined this mechanism before jumbo prefixes existed, and hadn't updated
it to allow for jumbo prefixes.

No time like the present...

Well, and if I used a direct displacement for GBR (which, along with PC,
is always BYTE Scale), this would have created a hard limit of 64 DLL's
per process-space (I defined it as Disp24, which allows a more
reasonable hard upper limit of 2M DLLs per process-space).

In my case, restricting myself to 32-bit IP relative addressing, GOT can
be anywhere within ±2GB of the accessing instruction and can be as big as
one desires.

Granted, nowhere near even the limit of 64 as of yet. But, I had noted
that Windows programs would often easily exceed this limit, with even a fairly simple program pulling in a fairly large number of random DLLs,
so in any case, a larger limit was needed.

Due to the way linkages work in My 66000, each DLL gets its own GOT.
So there is essentially no bounds on how many can be present/in-use.
A LD of a GOT[entry] gets a pointer to the external variable.
A CALX of GOT[entry] is a call through the GOT table using std ABI.
{{There is no PLT}}

One potential optimization here is that the main EXE will always be 0 in
the process, so this sequence could be reduced to, potentially:
MOV.Q (GBR, 0), R18
MOV.C (R18, 0), GBR

Early on, I did not have the constraint that main EXE was always 0, and
had initially assumed it would be treated equivalently to a DLL.

   //Generate Stack Canary, *2
   MOV    0x5149, R18 //magic number (randomly generated)
   VSKG   R18, R18 //Magic (combines input with SP and magic numbers) >>>    MOV.Q R18, (SP, 144)

   ...
   function-specific stuff
   ...

   MOV    0x5149, R18
   MOV.Q (SP, 144), R19
   VSKC   R18, R19 //Validate canary
   ...

*1: This part ties into the ABI, and mostly exists so that each PE
image can get GBR reloaded back to its own ".data"/".bss" sections (with

Universal displacements make GBR unnecessary as a memory reference can
be accompanied with a 16-bit, 32-bit, or 64-bit displacement. Yes, you
can read GOT[#i] directly without a pointer to it.

If I were doing a more conventional ABI, I would likely use (PC,
Disp33s) for accessing global variables.

Even those 128GB away ??

Problem is:
What if one wants multiple logical instances of a given PE image in a
single address space?

Not a problem when each PE has a different set of mapping tables (at least
the entries pointing at GOTs[*].

PC REL breaks in this case, unless you load N copies of each PE image,
which is a waste of memory (well, or use COW mappings, mandating the use
of an MMU).

ELF FDPIC had used a different strategy, but then effectively turned
each function call into something like (in SH):
MOV R14, R2 //R14=GOT
MOV disp, R0 //offset into GOT
ADD R0, R2 //adjust by offset
//R2=function pointer
MOV.L (R2, 0), R1 //function address
MOV.L (R2, 4), R3 //GOT
JSR R1

Which I do with::

CALX [IP,R0,#GOT+index<<3-.]

In the callee:
... save registers ...
MOV R3, R14 //put GOT into a callee-save register
...

In the BJX2 ABI, had rolled this part into the callee, reasoning that handling it in the callee (per-function) was less overhead than handling
it in the caller (per function call).

Though, on the RISC-V side, it has the relative advantage of compiling
for absolute addressing, albeit still loses in terms of performance.

Compiling and linking to absolute addresses works "really well" when one
needs to place different sections in different memory every time the application/kernel runs due to malicious codes trying to steal everything. ASLR.....

I don't imagine an FDPIC version of RISC-V would win here, but this is
only assuming there exists some way to get GCC to output FDPIC binaries
(most I could find, was people debating whether to add FDPIC support for RISC-V).

PIC or PIE would also sort of work, but these still don't really allow
for multiple program instances in a single address space.

Once you share the code and some of the data, the overhead of using different mappings for special stuff {GOT, local thread data,...} is

multiple program instances in a single address space). But, does mean
that pretty much every non-leaf function ends up needing to go through
this ritual.

Universal constant solves the underlying issue.

I am not so sure that they could solve the "map multiple instances of
the same binary into a single address space" issue, which is sort of the whole thing for why GBR is being used.

Otherwise, I would have been using PC-REL...

*2: Pretty much any function that has local arrays or similar, serves
to protect register save area. If the magic number can't regenerate a
matching canary at the end of the function, then a fault is generated.

My 66000 can place the callee save registers in a place where user cannot
access them with LDs or modify them with STs. So malicious code cannot
damage the contract between ABI and core.

Possibly. I am using a conventional linear stack.

Downside: There is a need either for bounds checking or canaries.
Canaries are the cheaper option in this case.

The cost of some of this starts to add up.

In isolation, not much, but if all this happens, say, 500 or 1000
times or more in a program, this can add up.

Was thinking about that last night. H&P "book" statistics say that call/ret >> represents 2% of instructions executed. But if you add up the prologue and >> epilogue instructions you find 8% of instructions are related to calling
and returning--taking the problem from (at 2%) ignorable to (at 8%) a big
ticket item demanding something be done.

8% represents saving/restoring only 3 registers vis stack and associated SP >> arithmetic. So, it can easily go higher.

I guess it could make sense to add a compiler stat for this...

The save/restore can get folded off, but generally only done for
functions with a larger number of registers being saved/restored (and
does not cover secondary things like GBR reload or stack canary stuff,
which appears to possibly be a significant chunk of space).

Goes and adds a stat for averages:
Prolog: 8% (avg= 24 bytes)
Epilog: 4% (avg= 12 bytes)
Body : 88% (avg=260 bytes)

With 959 functions counted (excluding empty functions/prototypes).

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sun Apr 21 19:16:04 2024

On Sun, 21 Apr 2024 18:57:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

BGB wrote:

Like, in-order superscalar isn't going to do crap if nearly every
instruction depends on every preceding instruction. Even pipelining
can't help much with this.

Pipelining CREATED this (back to back dependencies). No amount of
pipelining can eradicate RAW data dependencies.

This is quite true. However, in case an unsophisticated individual
might read this thread, I think that I shall clarify.

Without pipelining, it is not a problem if each instruction depends on
the one immediately previous, and so people got used to writing
programs that way, as it was simple to write the code to do one thing
before starting to write the code to begin doing another thing.

This remained true when the simplest original form of pipelining was
brought in - where fetching one instruction from memory was overlapped
with decoding the previous instruction, and executing the instruction
before that.

It's only when what was originally called "superpipelining" came
along, where the execute stages of multiple successive instructions
could be overlapped, that it was necessary to do something about
dependencies in order to take advantage of the speedup that could
provide.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Mon Apr 22 07:49:30 2024

MitchAlsup1 wrote:

BGB wrote:

On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
Like, in-order superscalar isn't going to do crap if nearly every
instruction depends on every preceding instruction. Even pipelining
can't help much with this.

Pipelining CREATED this (back to back dependencies). No amount of
pipelining can eradicate RAW data dependencies.

The compiler can shuffle the instructions into an order to limit the
number of register dependencies and better fit the pipeline. But,
then, most of the "hard parts" are already done (so it doesn't take
much more for the compiler to flag which instructions can run in
parallel).

Compiler scheduling works for exactly 1 pipeline implementation and
is suboptimal for all others.

Well, yeah.

OTOH, if your (definitely not my!) compiler can schedule a 4-wide static ordering of operations, then it will be very nearly optimal on 2-wide
and 3-wide as well. (The difference is typically in a bit more loop
setup and cleanup code than needed.)

Hand-optimizing Pentium asm code did teach me to "think like a cpu",
which is probably the only part of the experience which is still kind of relevant. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Mon Apr 22 14:13:41 2024

On Sat, 20 Apr 2024 17:07:11 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

And, hey, I'm not the first guy to get sunk because of forgetting what
lies under the tip of the iceberg that's above the water.

That also happened to the captain of the _Titanic_.

Concer-tina-tanic !?!

Oh, dear. This discussion has inspired me to rework the basic design
of Concertina II _yet again_!

The new design, not yet online, will have the following features:

The code stream will continue to be divided into 256-bit blocks.

However, block headers wil be eliminated. Instead, this functionality
will be subsumed into the instruction set.

Case I:

Indicating that from 1 to 7 32-bit instruction slots in a block are
not used for instructions, but instead may contain pseudo-immediates
will be achieved by:

Placing a two-address register-to-register operate instruction in the
first instruction slot in a block. These instructions will have a
three-bit field which, if nonzero, indicates the amount of space
reserved.

To avoid waste, when such an instruction is present in any slot other
than the first, that field will have the following function:

If nonzero, it points to an instruction slot (slots 1 through 7, in
the second through eighth positions) and a duplicate copy of the
instruction in that slot will be placed in the instruction stream
immediately following the instruction with that field.

The following special conditions apply:

If the instruction slot contains a pair of 16-bit instructions, only
the first of those instructions is so inserted for execution.

The instruction slot may not be one that is reserved for
pseudo-immediates, except that it may be the _first_ such slot, in
which case, the first 16 bits of that slot are taken as a 16-bit
instruction, with the format indicated by the first bit (as opposed to
the usual 17th bit) of that instruction slot's contents.

So it's possible to reserve an odd multiple of 16 bits for
pseudo-immediates, so as to avoid waste.

Case II:

Instructions longer than 32 bits are specified by being of the form:

The first instruction slot:

11111
00
(3 bits) length in instruction slots, from 2 to 7
(22 bits) rest of the first part of the instruction

All remaining instruction slots:

11111
(3 bits) position within instruction, from 2 to 7
(24 bits) rest of this part of the instruction

This mechanism, however, will _also_ be used for VLIW functionality or
prefix functionality which was formerly in block headers.

In that case, the first instruction slot, and the remaining
instruction slots, no longer need to be contiguous; instead, ordinary
32-bit instructions or pairs of 16-bit instlructions can occur between
the portions of the ensemble of prefixed instructions formed by this
means.

And there is a third improvement.

When Case I above is in effect, the block in which space for
pseudo-immediates is reserved will be stored in an internal register
in the processor.

Subsequent blocks can contain operate instructions with
pseudo-immediate operands even if no space for pseudo-immediates is
reserved in those blocks. In that case, the retained copy of the last
block encountered in which pseudo-immediates were reserved shall be
referenced instead.

I think these changes will improve code density... or, at least, they
will make it appear that no space is obviously forced to be wasted,
even if no real improvement in code density results.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Mon Apr 22 16:22:11 2024

On Mon, 22 Apr 2024 14:13:41 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Sat, 20 Apr 2024 17:07:11 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

And, hey, I'm not the first guy to get sunk because of forgetting what
lies under the tip of the iceberg that's above the water.

That also happened to the captain of the _Titanic_.

Concer-tina-tanic !?!

Oh, dear. This discussion has inspired me to rework the basic design
of Concertina II _yet again_!

The new design, not yet online, will have the following features:

The code stream will continue to be divided into 256-bit blocks.

However, block headers wil be eliminated. Instead, this functionality
will be subsumed into the instruction set.

Case I:

Indicating that from 1 to 7 32-bit instruction slots in a block are
not used for instructions, but instead may contain pseudo-immediates
will be achieved by:

Placing a two-address register-to-register operate instruction in the
first instruction slot in a block. These instructions will have a
three-bit field which, if nonzero, indicates the amount of space
reserved.

To avoid waste, when such an instruction is present in any slot other
than the first, that field will have the following function:

If nonzero, it points to an instruction slot (slots 1 through 7, in
the second through eighth positions) and a duplicate copy of the
instruction in that slot will be placed in the instruction stream
immediately following the instruction with that field.

The following special conditions apply:

If the instruction slot contains a pair of 16-bit instructions, only
the first of those instructions is so inserted for execution.

The instruction slot may not be one that is reserved for
pseudo-immediates, except that it may be the _first_ such slot, in
which case, the first 16 bits of that slot are taken as a 16-bit
instruction, with the format indicated by the first bit (as opposed to
the usual 17th bit) of that instruction slot's contents.

So it's possible to reserve an odd multiple of 16 bits for
pseudo-immediates, so as to avoid waste.

Case II:

Instructions longer than 32 bits are specified by being of the form:

The first instruction slot:

11111
00
(3 bits) length in instruction slots, from 2 to 7
(22 bits) rest of the first part of the instruction

All remaining instruction slots:

11111
(3 bits) position within instruction, from 2 to 7
(24 bits) rest of this part of the instruction

This mechanism, however, will _also_ be used for VLIW functionality or
prefix functionality which was formerly in block headers.

In that case, the first instruction slot, and the remaining
instruction slots, no longer need to be contiguous; instead, ordinary
32-bit instructions or pairs of 16-bit instlructions can occur between
the portions of the ensemble of prefixed instructions formed by this
means.

And there is a third improvement.

When Case I above is in effect, the block in which space for >pseudo-immediates is reserved will be stored in an internal register
in the processor.

Subsequent blocks can contain operate instructions with
pseudo-immediate operands even if no space for pseudo-immediates is
reserved in those blocks. In that case, the retained copy of the last
block encountered in which pseudo-immediates were reserved shall be >referenced instead.

I think these changes will improve code density... or, at least, they
will make it appear that no space is obviously forced to be wasted,
even if no real improvement in code density results.

The page has now been updated to reflect this modified design.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Mon Apr 22 19:36:54 2024

On Mon, 22 Apr 2024 16:22:11 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Mon, 22 Apr 2024 14:13:41 -0600, John Savard ><quadibloc@servername.invalid> wrote:

The first instruction slot:

11111
00
(3 bits) length in instruction slots, from 2 to 7
(22 bits) rest of the first part of the instruction

All remaining instruction slots:

11111
(3 bits) position within instruction, from 2 to 7
(24 bits) rest of this part of the instruction

The page has now been updated to reflect this modified design.

And I thought I was on to something.

The functionality - pseudo-immediates and VLIW features - was all the
same, but now everything was so much simpler. The only thing that
needed to be in a header, the three-bit field that reserved space for pseudo-immediates, now had just three bits of overhead.

Everything else followed a normal instruction model, instead of a
complicated header.

But... if I use a header with 22 bits usable to turn instruction words
that have 22 bits available...

into instructions that are _longer_ than 32 bits...

well, guess what?

If I use half the opcode space for four-word instructions, then one
header with 22 bits available can add 7 bits to each of three
subsequent instructions.

However, 24 plus 7 is 31.

So I'm stuck at putting two instructions in three words even for a
modest extension of the instruction set...

never mind adding a whole bunch of bits for stuff like predication!

I can tease out a couple of extra bits, so that I have a 22-bit
starting word, but 26 bits in each following one, by replacing the
three bit "position" field with a field that just contains 0 in every instruction slot but the last one, indicated with a 1.

With 26 bits, to get 33 bits - all I need for a nice expansion of the instruction set to its "full" form - I need to add seven bits to each
one, so that now does allow one starting word to prefix three
instructions.

Still not great, but adequate. And the first word doesn't really need
a length field either, it just needs to indicate it's the first one.
Which is how I had worked something like this before.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Tue Apr 23 01:53:26 2024

John Savard wrote:

On Sat, 20 Apr 2024 17:07:11 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

And, hey, I'm not the first guy to get sunk because of forgetting what
lies under the tip of the iceberg that's above the water.

That also happened to the captain of the _Titanic_.

Concer-tina-tanic !?!

Oh, dear. This discussion has inspired me to rework the basic design
of Concertina II _yet again_!

I suggest it is time for Concertina III.......

The new design, not yet online, will have the following features:

The code stream will continue to be divided into 256-bit blocks.

Why not a whole cache line ??

However, block headers wil be eliminated. Instead, this functionality
will be subsumed into the instruction set.

Case I:

Indicating that from 1 to 7 32-bit instruction slots in a block are
not used for instructions, but instead may contain pseudo-immediates
will be achieved by:

Placing a two-address register-to-register operate instruction in the
first instruction slot in a block. These instructions will have a
three-bit field which, if nonzero, indicates the amount of space
reserved.

To avoid waste, when such an instruction is present in any slot other
than the first, that field will have the following function:

If nonzero, it points to an instruction slot (slots 1 through 7, in
the second through eighth positions) and a duplicate copy of the
instruction in that slot will be placed in the instruction stream
immediately following the instruction with that field.

The following special conditions apply:

If the instruction slot contains a pair of 16-bit instructions, only
the first of those instructions is so inserted for execution.

The instruction slot may not be one that is reserved for
pseudo-immediates, except that it may be the _first_ such slot, in
which case, the first 16 bits of that slot are taken as a 16-bit
instruction, with the format indicated by the first bit (as opposed to
the usual 17th bit) of that instruction slot's contents.

So it's possible to reserve an odd multiple of 16 bits for
pseudo-immediates, so as to avoid waste.

Case II:

Instructions longer than 32 bits are specified by being of the form:

The first instruction slot:

11111
00
(3 bits) length in instruction slots, from 2 to 7
(22 bits) rest of the first part of the instruction

All remaining instruction slots:

11111
(3 bits) position within instruction, from 2 to 7
(24 bits) rest of this part of the instruction

This mechanism, however, will _also_ be used for VLIW functionality or
prefix functionality which was formerly in block headers.

In that case, the first instruction slot, and the remaining
instruction slots, no longer need to be contiguous; instead, ordinary
32-bit instructions or pairs of 16-bit instlructions can occur between
the portions of the ensemble of prefixed instructions formed by this
means.

And there is a third improvement.

When Case I above is in effect, the block in which space for pseudo-immediates is reserved will be stored in an internal register
in the processor.

Subsequent blocks can contain operate instructions with
pseudo-immediate operands even if no space for pseudo-immediates is
reserved in those blocks. In that case, the retained copy of the last
block encountered in which pseudo-immediates were reserved shall be referenced instead.

I think these changes will improve code density... or, at least, they
will make it appear that no space is obviously forced to be wasted,
even if no real improvement in code density results.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Mon Apr 22 20:19:36 2024

On Tue, 23 Apr 2024 01:53:26 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

I suggest it is time for Concertina III.......

If the old Concertina II were worth keeping...

Why not a whole cache line ??

That would be one way to allow the overhead of a block prefix to be
minimized.

But that starts to look like just having mode bits for an entire
program.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Mon Apr 22 23:09:43 2024

On Sun, 21 Apr 2024 00:43:21 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Address arithmetic is ADD only and does not care about signs or
overflow. There is no concept of a negative base register or a
negative index register (or, for that matter, a negative displace-
ment), overflow, underflow, carry, ...

Stack frame pointers often point to the middle of the frame and need
to access data using both positive and negative displacements.

Some GC schemes use negative displacements to access object headers.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Mon Apr 22 20:22:12 2024

On Mon, 22 Apr 2024 19:36:54 -0600, John Savard
<quadibloc@servername.invalid> wrote:

I can tease out a couple of extra bits, so that I have a 22-bit
starting word, but 26 bits in each following one, by replacing the
three bit "position" field with a field that just contains 0 in every >instruction slot but the last one, indicated with a 1.

With 26 bits, to get 33 bits - all I need for a nice expansion of the >instruction set to its "full" form - I need to add seven bits to each
one, so that now does allow one starting word to prefix three
instructions.

Still not great, but adequate. And the first word doesn't really need
a length field either, it just needs to indicate it's the first one.
Which is how I had worked something like this before.

But fully half the opcode space is allocated to 16-bit instructions.
EVen though that half doesn't really play nice with other things, it's
too tempting a target to ignore. But the price would be losing the
fully parallel nature of decoding.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Tue Apr 23 00:54:00 2024

On Mon, 22 Apr 2024 20:22:12 -0600, John Savard
<quadibloc@servername.invalid> wrote:

But fully half the opcode space is allocated to 16-bit instructions.
EVen though that half doesn't really play nice with other things, it's
too tempting a target to ignore. But the price would be losing the
fully parallel nature of decoding.

After heading out to buy groceries, my head cleared enough to discard
the various complicated and bizarre schemes I was considering to deal
with the issue, and instead to drastically reduce the overhead for the instructions longer than 32 bits, now that this had become a major
concern due to also usiing this format for prefixed instructions as
well, in a simple and straightforward manner.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to George Neuner on Tue Apr 23 17:58:41 2024

George Neuner wrote:

On Sun, 21 Apr 2024 00:43:21 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Address arithmetic is ADD only and does not care about signs or
overflow. There is no concept of a negative base register or a
negative index register (or, for that matter, a negative displace-
ment), overflow, underflow, carry, ...

Stack frame pointers often point to the middle of the frame and need
to access data using both positive and negative displacements.

Yes, one accesses callee saved registers with positive displacements
and local variables with negative accesses. One simply needs to know
where the former stops and the later begins. ENTER and EXIT know this
by the register count and by the stack allocation size.

Some GC schemes use negative displacements to access object headers.

Those are negative displacements not negative bases or indexes.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Apr 23 22:55:50 2024

BGB wrote:

On 4/23/2024 1:54 AM, John Savard wrote:

On Mon, 22 Apr 2024 20:22:12 -0600, John Savard
<quadibloc@servername.invalid> wrote:

But fully half the opcode space is allocated to 16-bit instructions.
EVen though that half doesn't really play nice with other things, it's
too tempting a target to ignore. But the price would be losing the
fully parallel nature of decoding.

After heading out to buy groceries, my head cleared enough to discard
the various complicated and bizarre schemes I was considering to deal
with the issue, and instead to drastically reduce the overhead for the
instructions longer than 32 bits, now that this had become a major
concern due to also usiing this format for prefixed instructions as
well, in a simple and straightforward manner.

You know, one could just be like, say:
xxxx-xxxx-xxxx-xxx0 //16-bit op
xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-xx01 //32-bit op
xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-x011 //32-bit op
xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-0111 //32-bit op
xxxx-xxxx-xxxx-xxxx xxxx-xxxx-xxxx-1111 //jumbo prefix (64+)

And call it "good enough"...

Then, say (6b registers):
zzzz-mmmm-nnnn-zzz0 //16-bit op (2R)
zzzz-tttt-ttss-ssss nnnn-nnpp-zzzz-xxx1 //32-bit op (3R)
iiii-iiii-iiss-ssss nnnn-nnpp-zzzz-xxx1 //32-bit op (3RI, Imm10)
iiii-iiii-iiii-iiii nnnn-nnpp-zzzz-xxx1 //32-bit op (2RI, Imm16)
iiii-iiii-iiii-iiii iiii-iipp-zzzz-xxx1 //32-bit op (Branch)

Or (5b registers):
zzzz-mmmm-nnnn-zzz0 //16-bit op (2R)
zzzz-zttt-ttzs-ssss nnnn-nzpp-zzzz-xxx1 //32-bit op (3R)
iiii-iiii-iiis-ssss nnnn-nzpp-zzzz-xxx1 //32-bit op (3RI, Imm11)
iiii-iiii-iiii-iiii nnnn-nzpp-zzzz-xxx1 //32-bit op (2RI, Imm16)
iiii-iiii-iiii-iiii iiii-iipp-zzzz-xxx1 //32-bit op (Branch)

....

Punt on the 16-bit instructions::

000110 CONDI rrrrr PRED xxthen xxelse
000111 ddddd rrrrr SHF rwidth roffst
001001 ddddd rrrrr DscLd MemOp rrrrr
001010 ddddd rrrrr I12Sd 2-OPR rrrrr
001100 ddddd rrrrr I12 3OP rrrrr rrrrr
001101 ddddd rrrrr I12Sd 1-OPERA TIONx
01100 bitnum rrrrr 16-bit-displacement
011010 CONDI rrrrr 16-bit-displacement
011110 26-bit-displacementtttttttttttt
011111 26-bit-displacementtttttttttttt
100000
to
101110 ddddd rrrrr 16-bit-displacement
110000
to
111100 ddddd rrrrr 16-bit-immediateeee

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to John Savard on Thu Apr 25 01:00:21 2024

On Sat, 20 Apr 2024 18:06:22 -0600, John Savard wrote:

Since there was only one set of arithmetic instrucions, that meant that
when you wrote code to operate on unsigned values, you had to remember
that the normal names of the condition code values were oriented around signed arithmetic.

I thought architectures typically had separate condition codes for “carry” versus “overflow”. That way, you didn’t need signed versus unsigned versions of add, subtract and compare; it was just a matter of looking at
the right condition codes on the result.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Apr 25 02:50:09 2024

Lawrence D'Oliveiro wrote:

On Sat, 20 Apr 2024 18:06:22 -0600, John Savard wrote:

Since there was only one set of arithmetic instrucions, that meant that
when you wrote code to operate on unsigned values, you had to remember
that the normal names of the condition code values were oriented around
signed arithmetic.

I thought architectures typically had separate condition codes for “carry”
versus “overflow”. That way, you didn’t need signed versus unsigned versions of add, subtract and compare; it was just a matter of looking at
the right condition codes on the result.

Maybe now with 4-or-5-bit condition codes yes,
But the early machines (360) with 2-bit codes were already constricted.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Thu Apr 25 03:28:36 2024

On Thu, 25 Apr 2024 02:50:09 +0000, MitchAlsup1 wrote:

Lawrence D'Oliveiro wrote:

On Sat, 20 Apr 2024 18:06:22 -0600, John Savard wrote:

Since there was only one set of arithmetic instrucions, that meant
that when you wrote code to operate on unsigned values, you had to
remember that the normal names of the condition code values were
oriented around signed arithmetic.

I thought architectures typically had separate condition codes for
“carry” versus “overflow”. That way, you didn’t need signed versus >> unsigned versions of add, subtract and compare; it was just a matter of
looking at the right condition codes on the result.

Maybe now with 4-or-5-bit condition codes yes,
But the early machines (360) with 2-bit codes were already constricted.

The DEC PDP-6, from around 1964, same time as the System/360, had separate carry and overflow flags.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to ldo@nz.invalid on Wed Apr 24 23:12:01 2024

On Thu, 25 Apr 2024 01:00:21 -0000 (UTC), Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

On Sat, 20 Apr 2024 18:06:22 -0600, John Savard wrote:

Since there was only one set of arithmetic instrucions, that meant that
when you wrote code to operate on unsigned values, you had to remember
that the normal names of the condition code values were oriented around
signed arithmetic.

I thought architectures typically had separate condition codes for �carry� >versus �overflow�. That way, you didn�t need signed versus unsigned
versions of add, subtract and compare; it was just a matter of looking at
the right condition codes on the result.

Yes; I thought that was the same as what I just said.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to All on Thu Apr 25 17:01:55 2024

On Tue, 23 Apr 2024 17:58:41 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

George Neuner wrote:

On Sun, 21 Apr 2024 00:43:21 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Address arithmetic is ADD only and does not care about signs or
overflow. There is no concept of a negative base register or a
negative index register (or, for that matter, a negative displace-
ment), overflow, underflow, carry, ...

Stack frame pointers often point to the middle of the frame and need
to access data using both positive and negative displacements.

Yes, one accesses callee saved registers with positive displacements
and local variables with negative accesses. One simply needs to know
where the former stops and the later begins. ENTER and EXIT know this
by the register count and by the stack allocation size.

Some GC schemes use negative displacements to access object headers.

Those are negative displacements not negative bases or indexes.

I was reacting to your message (quoted fully above) which,
paraphrased, says "address arithmetic is add only and there is no
concept of a negative displacement".

In one sense you are correct: the result of the calculation has to be considered as unsigned in the range 0..max_memory ... ie. there is no
concept of negative *address*.

However, the components being added to form the address, I believe are
a different matter.

I agree that negative base is meaningless.

However, negative index and negative displacement both do have
meaning. The inclusion of specialized index registers is debatable
[I'm in the GPR camp], but I do believe that index and displacement
*values* both always should be considered as signed.

YMMV.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Apr 26 13:25:03 2024

BGB wrote:

On 4/25/2024 4:01 PM, George Neuner wrote:

On Tue, 23 Apr 2024 17:58:41 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Agreed in the sense that negative displacements exist.

However, can note that positive displacements tend to be significantly
more common than negative ones. Whether or not it makes sense to have a negative displacement, depending mostly on the probability of greater
than half of the missed displacements being negative.

From what I can tell, this seems to be:
~ 10 bits, scaled.
~ 13 bits, unscaled.

So, say, an ISA like RISC-V might have had a slightly hit rate with
unsigned displacements than with signed displacements, but if one added
1 or 2 bits, signed would have still been a clear winner (or, with 1 or
2 fewer bits, unsigned a clear winner).

I ended up going with signed displacements for XG2, but it was pretty
close to break-even in this case (when expanding from the 9-bit unsigned displacements in Baseline).

Granted, all signed or all-unsigned might be better from an ISA design consistency POV.

If one had 16-bit displacements, then unscaled displacements would make sense; otherwise scaled displacements seem like a win (misaligned displacements being much less common than aligned displacements).

What we need is ~16-bit displacements where 82½%-91¼% are positive.

How does one use a frame pointer without negative displacements ??

[FP+disp] accesses callee save registers
[FP-disp] accesses local stack variables and descriptors

[SP+disp] accesses argument and result values

But, admittedly, main reason I went with unscaled for GBR-rel and PC-rel Load/Store, was because using scaled displacements here would have
required more relocation types (nevermind if the hit rate for unscaled
9-bit displacements is "pretty weak").

Though, did end up later adding specialized Scaled GBR-Rel Load/Store
ops (to improve code density), so it might have been better in
retrospect had I instead just went the "keep it scaled and add more
reloc types to compensate" option.

....

YMMV.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Fri Apr 26 15:34:57 2024

mitchalsup@aol.com (MitchAlsup1) writes:

What we need is ~16-bit displacements where 82½%-91¼% are positive.

What are these funny numbers about?

Do you mean that you want number ranges like -11468..54067 (82.5%
positive) or -5734..59801 (91.25% positive)? Which one of those? And
why not, say -8192..57343 (87.5% positive)?

How does one use a frame pointer without negative displacements ??

You let it point to the lowest address you want to access. That moves
the problem to unwinding frame pointer chains where the unwinder does
not know the frame-specific difference between the frame pointer and
the pointer of the next frame.

An alternative is to have a frame-independent difference that leaves
enough room that, say 90% (or 99%, or whatever) of the frames don't
need negative offsets from that frame.

Likewise, if you have signed displacements, and are unhappy about the
skewed usage, you can let the frame pointer point at an offset from
the pointer to the next fram such that the usage is less skewed.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Fri Apr 26 14:59:43 2024

MitchAlsup1 wrote:

BGB wrote:

If one had 16-bit displacements, then unscaled displacements would
make sense; otherwise scaled displacements seem like a win (misaligned
displacements being much less common than aligned displacements).

What we need is ~16-bit displacements where 82½%-91¼% are positive.

How does one use a frame pointer without negative displacements ??

[FP+disp] accesses callee save registers
[FP-disp] accesses local stack variables and descriptors

[SP+disp] accesses argument and result values

A sign extended 16-bit offsets would cover almost all such access needs
so I really don't see the need for funny business.

But if you really want a skewed range offset it could use something like excess-256 encoding which zero extends the immediate then subtract 256
(or whatever) from it, to give offsets in the range -256..+65535-256.
So an immediate value of 0 equals an offset of -256.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Fri Apr 26 21:01:35 2024

Anton Ertl wrote:

mitchalsup@aol.com (MitchAlsup1) writes:

What we need is ~16-bit displacements where 82½%-91¼% are positive.

What are these funny numbers about?

In typical usages in MY 66000 ISA <only> one needs only 18 DW of
negative addressing and we have a 16-bit displacement. So, technically
it might get by at the 99% level with -32..+65500. Other usages might
need a few more on the negative end of things so 1/8..1/16 in the
negative direction, 7/8..15/16 in the positive.

Do you mean that you want number ranges like -11468..54067 (82.5%
positive) or -5734..59801 (91.25% positive)? Which one of those? And
why not, say -8192..57343 (87.5% positive)?

Roughly.

How does one use a frame pointer without negative displacements ??

You let it point to the lowest address you want to access. That moves
the problem to unwinding frame pointer chains where the unwinder does
not know the frame-specific difference between the frame pointer and
the pointer of the next frame.

An alternative is to have a frame-independent difference that leaves
enough room that, say 90% (or 99%, or whatever) of the frames don't
need negative offsets from that frame.

Likewise, if you have signed displacements, and are unhappy about the
skewed usage, you can let the frame pointer point at an offset from
the pointer to the next fram such that the usage is less skewed.

Such a hassle....

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Apr 26 21:07:24 2024

BGB wrote:

On 4/26/2024 8:25 AM, MitchAlsup1 wrote:

BGB wrote:

On 4/25/2024 4:01 PM, George Neuner wrote:

On Tue, 23 Apr 2024 17:58:41 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Agreed in the sense that negative displacements exist.

However, can note that positive displacements tend to be significantly
more common than negative ones. Whether or not it makes sense to have
a negative displacement, depending mostly on the probability of
greater than half of the missed displacements being negative.

From what I can tell, this seems to be:
~ 10 bits, scaled.
~ 13 bits, unscaled.

So, say, an ISA like RISC-V might have had a slightly hit rate with
unsigned displacements than with signed displacements, but if one
added 1 or 2 bits, signed would have still been a clear winner (or,
with 1 or 2 fewer bits, unsigned a clear winner).

I ended up going with signed displacements for XG2, but it was pretty
close to break-even in this case (when expanding from the 9-bit
unsigned displacements in Baseline).

Granted, all signed or all-unsigned might be better from an ISA design
consistency POV.

If one had 16-bit displacements, then unscaled displacements would
make sense; otherwise scaled displacements seem like a win (misaligned
displacements being much less common than aligned displacements).

What we need is ~16-bit displacements where 82½%-91¼% are positive.

I was seeing stats more like 99.8% positive, 0.2% negative.

After pulling out the calculator and thinking about the frames, My
66000 needs no more than 18 DW of negative addressing. This is just
over 0.2% as you indicate.

There was enough of a bias that, below 10 bits, if one takes all the remaining cases, zero extending would always win, until reaching 10
bits, when the number of missed reaches 50% negative (along with
positive displacements larger than 512).

So, one can make a choice: -512..511, or 0..1023, ...

In XG2, I ended up with -512..511, for pros or cons (for some programs,
this choice is optimal, for others it is not).

Where, when scaled for QWORD, this is +/- 4K.

If one had a 16-bit displacement, it would be a choice between +/- 32K,
or (scaled) +/- 256K, or 0..512K, ...

We looked at this in Mc88100 (scaling of the displacement). The drawback
was that the ISA and linker were slightly mismatched: The linker wanted
to use a single upper 16-bit LUI <if it were> over several LD/STs of potentially different sizes, and scaling of the displacement failed in
those regards; so we dropped scaled displacements.

For the special purpose "LEA.Q (GBR, Disp16), Rn" instruction, I ended
up going unsigned, where for a lot of the programs I am dealing with,
this is big enough to cover ".data" and part of ".bss", generally used
for arrays which need the larger displacements (the compiler lays things
out so that most of the commonly used variables are closer to the start
of ".data", so can use smaller displacements).

Not even an issue when one has universal constants.

How does one use a frame pointer without negative displacements ??

[FP+disp] accesses callee save registers
[FP-disp] accesses local stack variables and descriptors

[SP+disp] accesses argument and result values

In my case, all of these are [SP+Disp], granted, there is no frame
pointer and stack frames are fixed-size in BGBCC.

This is typically with a frame layout like:
Argument/Spill space
-- Frame Top
Register Save
(Stack Canary)
Local arrays/structs
Local variables
Argument/Spill Space
-- Frame Bottom

Contrast with traditional x86 layout, which puts saved registers and
local variables near the frame-pointer, which points near the top of the stack frame.

Though, in a majority of functions, the MOV.L and MOV.Q functions have a
big enough displacement to cover the whole frame (excludes functions
which have a lot of local arrays or similar, though overly large local
arrays are auto-folded to using heap allocation, but at present this
logic is based on the size of individual arrays rather than on the total combined size of the stack frame).

Adding a frame pointer (with negative displacements) wouldn't make a big difference in XG2 Mode, but would be more of an issue for (pure)
Baseline, where options are either to load the displacement into a
register, or use a jumbo prefix.

But, admittedly, main reason I went with unscaled for GBR-rel and
PC-rel Load/Store, was because using scaled displacements here would
have required more relocation types (nevermind if the hit rate for
unscaled 9-bit displacements is "pretty weak").

Though, did end up later adding specialized Scaled GBR-Rel Load/Store
ops (to improve code density), so it might have been better in
retrospect had I instead just went the "keep it scaled and add more
reloc types to compensate" option.

....

YMMV.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Fri Apr 26 21:16:28 2024

BGB wrote:

On 4/26/2024 8:25 AM, MitchAlsup1 wrote:

How does one use a frame pointer without negative displacements ??

[FP+disp] accesses callee save registers
[FP-disp] accesses local stack variables and descriptors

[SP+disp] accesses argument and result values

In my case, all of these are [SP+Disp], granted, there is no frame
pointer and stack frames are fixed-size in BGBCC.

I only have FP when the base language is block structured and scoped.
Not C, C++ or FORTRAN, but Algol, ADA, Pascal: Yes.

This is typically with a frame layout like:
Argument/Spill space
-- Frame Top
Register Save
(Stack Canary)
Local arrays/structs
Local variables
Argument/Spill Space
-- Frame Bottom

Previous Argument/Result space
{ Register Save area
Return Pointer }

Local Descriptors -------------------\

Local Variables |
Dynamically allocated Stack space <--/

My Argument/Result space

When safe stack is in use, Register Save area and return pointer are
placed on a separate stack not accessible with LD/ST instructions.

Contrast with traditional x86 layout, which puts saved registers and
local variables near the frame-pointer, which points near the top of the stack frame.

Though, in a majority of functions, the MOV.L and MOV.Q functions have a
big enough displacement to cover the whole frame (excludes functions
which have a lot of local arrays or similar, though overly large local
arrays are auto-folded to using heap allocation, but at present this
logic is based on the size of individual arrays rather than on the total combined size of the stack frame).

By making a Local Descriptor area on the stack, one can access the
descriptors off of FP and access the dynamic stuff via that pointer.
Both Local Descriptors and Local Variables may be allocated into
registers and not actually exist on the stack.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sat Apr 27 20:37:34 2024

BGB wrote:

On 4/26/2024 1:59 PM, EricP wrote:

MitchAlsup1 wrote:

BGB wrote:

If one had 16-bit displacements, then unscaled displacements would
make sense; otherwise scaled displacements seem like a win
(misaligned displacements being much less common than aligned
displacements).

What we need is ~16-bit displacements where 82½%-91¼% are positive.

How does one use a frame pointer without negative displacements ??

[FP+disp] accesses callee save registers
[FP-disp] accesses local stack variables and descriptors

[SP+disp] accesses argument and result values

A sign extended 16-bit offsets would cover almost all such access needs
so I really don't see the need for funny business.

But if you really want a skewed range offset it could use something like
excess-256 encoding which zero extends the immediate then subtract 256
(or whatever) from it, to give offsets in the range -256..+65535-256.
So an immediate value of 0 equals an offset of -256.

Yeah, my thinking was that by the time one has 16 bits for Load/Store displacements, they could almost just go +/- 32K and call it done.

But, much smaller than this, there is an advantage to scaling the displacements.

In other news, got around to getting the RISC-V code to build in PIE
mode for Doom (by using "riscv64-unknown-linux-gnu-*").

Can note that RV64 code density takes a hit in this case:
RV64: 299K (.text)
XG2 : 284K (.text)

Is this indicative that your ISA and RISC-V are within spitting distance
of each other in terms of the number of instructions in .text ?? or not ??

So, apparently using this version of GCC and using "-fPIE" works in my
favor regarding code density...

I guess a question is what FDPIC would do if GCC supported it, since
this would be the closest direct analog to my own ABI.

What is FDPIC ?? Federal Deposit Processor Insurance Corporation ??
Final Dopey Position Independent Code ??

I guess some people are dragging their feet on FDPIC, as there is some
debate as to whether or not NOMMU makes sense for RISC-V, along with its associated performance impact if used.

In my case, if I wanted to go over to simple base-relocatable images,
this would technically eliminate the need for GBR reloading.

Checks:
Simple base-relocatable case actually currently generates bigger
binaries, I suspect because in this case it is less space-efficient to
use PC-rel vs GBR-rel.

Went and added a "pbostatic" option, which sidesteps saving and
restoring GBR (making the simplifying assumption that functions will
never be called from outside the current binary).

This saves roughly 4K (Doom's ".text" shrinks to 280K).

Would you be willing to compile DOOM with Brian's LLVM compiler and
show the results ??

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Apr 28 01:45:59 2024

BGB wrote:

On 4/27/2024 3:37 PM, MitchAlsup1 wrote:

BGB wrote:

On 4/26/2024 1:59 PM, EricP wrote:

MitchAlsup1 wrote:

BGB wrote:

If one had 16-bit displacements, then unscaled displacements would >>>>>> make sense; otherwise scaled displacements seem like a win
(misaligned displacements being much less common than aligned
displacements).

What we need is ~16-bit displacements where 82½%-91¼% are positive. >>>>>
How does one use a frame pointer without negative displacements ??

[FP+disp] accesses callee save registers
[FP-disp] accesses local stack variables and descriptors

[SP+disp] accesses argument and result values

A sign extended 16-bit offsets would cover almost all such access needs >>>> so I really don't see the need for funny business.

But if you really want a skewed range offset it could use something like >>>> excess-256 encoding which zero extends the immediate then subtract 256 >>>> (or whatever) from it, to give offsets in the range -256..+65535-256.
So an immediate value of 0 equals an offset of -256.

Yeah, my thinking was that by the time one has 16 bits for Load/Store
displacements, they could almost just go +/- 32K and call it done.

But, much smaller than this, there is an advantage to scaling the
displacements.

In other news, got around to getting the RISC-V code to build in PIE
mode for Doom (by using "riscv64-unknown-linux-gnu-*").

Can note that RV64 code density takes a hit in this case:
   RV64: 299K (.text)
   XG2 : 284K (.text)

Is this indicative that your ISA and RISC-V are within spitting distance
of each other in terms of the number of instructions in .text ?? or not ?? >>

It would appear that, with my current compiler output, both BJX2-XG2 and RISC-V RV64G are within a few percent of each other...

If adjusting for Jumbo prefixes (with the version that omits GBR reloads):
XG2: 270K (-10K of Jumbo Prefixes)

Implying RISC-V now has around 11% more instructions in this scenario.

Based on Brian's LLVM compiler; RISC-V has about 40% more instructions
than My 66000, or My 66000 has 70% the number of instructions that
RISC-V has (same compilation flags, same source code).

It also has an additional 20K of ".rodata" that is likely constants,
which likely overlap significantly with the jumbo prefixes.

My 66000 has vastly smaller .rodata because constants are part of .text

So, apparently using this version of GCC and using "-fPIE" works in my
favor regarding code density...

I guess a question is what FDPIC would do if GCC supported it, since
this would be the closest direct analog to my own ABI.

What is FDPIC ?? Federal Deposit Processor Insurance   Corporation ??
                Final   Dopey   Position Independent Code ??

Required a little digging: "Function Descriptor Position Independent Code".

But, I think the main difference is that, normal PIC does calls like like:
LD Rt, [GOT+Disp]
BSR Rt

CALX [IP,,#GOT+#disp-.]

It is unlikely that %GOT can be represented with 16-bit offset from IP
so the 32-bit displacement form (,,) is used.

Wheres, FDPIC was typically more like (pseudo ASM):
MOV SavedGOT, GOT
LEA Rt, [GOT+Disp]
MOV GOT, [Rt+8]
MOV Rt, [Rt+0]
BSR Rt
MOV GOT, SavedGOT

Since GOT is not in a register but is an address constant this is also::

CALX [IP,,#GOT+#disp-.]

But, in my case, noting that function calls tend to be more common than
the functions themselves, and functions will know whether or not they
need to access global variables or call other functions, ... it made
more sense to move this logic into the callee.

No official RISC-V FDPIC ABI that I am aware of, though some proposals
did seem vaguely similar in some areas to what I was doing with PBO.

Where, they were accessing globals like:
LUI Xt, DispHi
ADD Xt, Xt, DispLo
ADD Xt, Xt, GP
LD Xd, Xt, 0

Granted, this is less efficient than, say:
MOV.Q (GBR, Disp33s), Rd

LDD Rd,[IP,,#GOT+#disp-.]

Though, people didn't really detail the call sequence or prolog/epilog sequences, so less sure how this would work.

Likely guess, something like:
MV Xs, GP
LUI Xt, DispHi
ADD Xt, Xt, DispLo
ADD Xt, Xt, GP
LD GP, Xt, 8
LD Xt, Xt, 0
JALR LR, Xt, 0
MV GP, Xs

Well, unless they have a better way to pull this off...

CALX [IP,,#GOT+#disp-.]

But, yeah, as far as I saw it, my "better solution" was to put this part
into the callee.

Main tradeoff with my design is:
From any GBR, one needs to be able to get to every other GBR;
We need to have a way to know which table entry to reload (not
statically known at compile time).

Resolved by linker or accessed through GOT in mine. Each dynamic
module gets its own GOT.

In my PBO ABI, this was accomplished by using base relocs (but, this is
N/A for ELF, where PE/COFF style base relocs are not a thing).

One other option might be to use a PC-relative load to load the index.
Say:
AUIPC Xs, DispHi //"__global_pbo_offset$" ?
LD Xs, DispLo
LD Xt, GP, 0 //get table of offsets
ADD Xt, Xt, Xs
LD GP, Xt, 0

In this case, "__global_pbo_offset$" would be a magic constant variable
that gets fixed up by the ELF loader.

LDD Rd,[IP,,#GOT+#disp-.]

I guess some people are dragging their feet on FDPIC, as there is some
debate as to whether or not NOMMU makes sense for RISC-V, along with
its associated performance impact if used.

In my case, if I wanted to go over to simple base-relocatable images,
this would technically eliminate the need for GBR reloading.

Checks:
Simple base-relocatable case actually currently generates bigger
binaries, I suspect because in this case it is less space-efficient to
use PC-rel vs GBR-rel.

Went and added a "pbostatic" option, which sidesteps saving and
restoring GBR (making the simplifying assumption that functions will
never be called from outside the current binary).

This saves roughly 4K (Doom's ".text" shrinks to 280K).

Would you be willing to compile DOOM with Brian's LLVM compiler and
show the results ??

Will need to download and build this compiler...

Might need to look into this.

Please do.

But, yeah, current standing for this is:
XG2 : 280K (static linked, Modified PDPCLIB + TestKern)
RV64G : 299K (static linked, Modified PDPCLIB + TestKern)
X86-64: 288K ("gcc -O3", dynamically linked GLIBC)
X64 : 1083K (VS2022, static linked MSVCRT)

But, MSVC is an outlier here for just how bad it is on this front.

To get more reference points, would need to install more compilers.

Could have provided an ARM reference point, except that the compiler
isn't compiling stuff at the moment (would need to beat on stuff a bit
more to try to get it to build; appears to be trying to build with static-linked Newlib but is missing symbols, ...).

But, yeah, for good comparison, one needs to have everything build with
the same C library, etc.

I am thinking it may be possible to save a little more space by folding
some of the stuff for "va_start()" into an ASM blob (currently, a lot of stuff is folded off into the function prolog, but probably doesn't need
to be done inline for every varargs function).

Mostly this would be the logic for spilling all of the argument
registers to a location on the stack and similar.

Part of ENTER already does this: A typical subroutine will use::

ENTER R27,R0,#local_stack_size

Where the varargs subroutine will use::

ENTER R27,R8,#local_stack_size
ADD Rva_ptr,SP,#local_stack_size+64

notice all we had to do was to specify 8 more registers to be stored;
and exit with::

EXIT R27,R0,#local_stack_size+64

Here we skip over the 8 register variable arguments without reloading
them.

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB on Sun Apr 28 09:05:26 2024

BGB <cr88192@gmail.com> schrieb:

Still watching LLVM build (several hours later), kinda of an interesting
meta aspect in its behaviors.

Don't build it in debug mode.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB on Sun Apr 28 20:11:56 2024

BGB <cr88192@gmail.com> schrieb:

"--target my66000-none-elf" or similar just gets it to complain about an unknown triple, not sure how to query for known targets/triples with clang.

Grepping around the CMakeCache.txt file in my build directory, I find

//Semicolon-separated list of experimental targets to build. LLVM_EXPERIMENTAL_TARGETS_TO_BUILD:STRING=My66000

This is documented in llvm/lib/Target/My66000/README .

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Apr 28 19:24:51 2024

BGB wrote:

On 4/27/2024 8:45 PM, MitchAlsup1 wrote:

But, I think the main difference is that, normal PIC does calls like
like:
   LD Rt, [GOT+Disp]
   BSR Rt

    CALX   [IP,,#GOT+#disp-.]

It is unlikely that %GOT can be represented with 16-bit offset from IP
so the 32-bit displacement form (,,) is used.

Wheres, FDPIC was typically more like (pseudo ASM):
   MOV SavedGOT, GOT
   LEA Rt, [GOT+Disp]
   MOV GOT, [Rt+8]
   MOV Rt, [Rt+0]
   BSR Rt
   MOV GOT, SavedGOT

Since GOT is not in a register but is an address constant this is also::

    CALX   [IP,,#GOT+#disp-.]

So... Would this also cause GOT to point to a new address on the callee
side (that is dependent on the GOT on the caller side, and *not* on the
PC address at the destination) ?...

The module on the calling side has its GOT and the module on the called side has its own GOT where offsets to/in GOT are determined by linker making the module. There may be cases where multiple link edits on a final module have some of the functions in this module accessed via GOT in this module and in these cases one uses

CALA [IP,,#GOT+#disp-.] // LDD ip changes to LDA ip

In effect, the context dependent GOT daisy-chaining is a fundamental
aspect of FDPIC that is different from conventional PIC.

Yes, understood, and it happens.

But, in my case, noting that function calls tend to be more common
than the functions themselves, and functions will know whether or not
they need to access global variables or call other functions, ... it
made more sense to move this logic into the callee.

No official RISC-V FDPIC ABI that I am aware of, though some proposals
did seem vaguely similar in some areas to what I was doing with PBO.

Where, they were accessing globals like:
   LUI Xt, DispHi
   ADD Xt, Xt, DispLo
   ADD Xt, Xt, GP
   LD Xd, Xt, 0

Granted, this is less efficient than, say:
   MOV.Q (GBR, Disp33s), Rd

    LDD   Rd,[IP,,#GOT+#disp-.]

As noted, BJX2 can handle this in a single 64-bit instruction, vs 4 instructions.

Though, people didn't really detail the call sequence or prolog/epilog
sequences, so less sure how this would work.

Likely guess, something like:
   MV    Xs, GP
   LUI   Xt, DispHi
   ADD   Xt, Xt, DispLo
   ADD   Xt, Xt, GP
   LD    GP, Xt, 8
   LD    Xt, Xt, 0
   JALR LR, Xt, 0
   MV    GP, Xs

Well, unless they have a better way to pull this off...

    CALX   [IP,,#GOT+#disp-.]

Well, can you explain the semantics of this one...

But, yeah, as far as I saw it, my "better solution" was to put this
part into the callee.

Main tradeoff with my design is:
   From any GBR, one needs to be able to get to every other GBR;
   We need to have a way to know which table entry to reload (not
statically known at compile time).

Resolved by linker or accessed through GOT in mine. Each dynamic
module gets its own GOT.

The important thing is not associating a GOT with an ELF module, but
with an instance of said module.

Yes.

So, say, one copy of an ELF image, can have N separate GOTs and data
sections (each associated with a program instance).

In my PBO ABI, this was accomplished by using base relocs (but, this
is N/A for ELF, where PE/COFF style base relocs are not a thing).

One other option might be to use a PC-relative load to load the index.
Say:
   AUIPC Xs, DispHi //"__global_pbo_offset$" ?
   LD Xs, DispLo
   LD Xt, GP, 0   //get table of offsets
   ADD Xt, Xt, Xs
   LD GP, Xt, 0

In this case, "__global_pbo_offset$" would be a magic constant
variable that gets fixed up by the ELF loader.

    LDD   Rd,[IP,,#GOT+#disp-.]

Still going to need to explain the semantics here...

IP+&GOT+disp-IP is a 64-bit pointer into GOT where the external linkage
pointer resides.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Tim Rentsch@21:1/5 to mitchalsup@aol.com on Sun Apr 28 15:37:17 2024

mitchalsup@aol.com (MitchAlsup1) writes:

John Savard wrote:

On Sat, 20 Apr 2024 22:03:21 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

BGB wrote:

Sign-extend signed values, zero-extend unsigned values.

Another mistake I mad in Mc 88100.

As that is a mistake the IBM 360 made, I make it too. But I make it
the way the 360 did: there are no signed and unsigned values, in the
sense of a Burroughs machine, there are just Load, Load Unsigned - and
Insert - instructions.

Index and base register values are assumed to be unsigned.

I would use the term signless as opposed to unsigned.

What's the point of using a non-standard term when there is a
common and firmly established standard term? I don't see how the
non-standard term conveys anything different. Next thing you
know someone will want to say "signful" rather than "signed".

Address arithmetic is ADD only and does not care about signs or
overflow. There is no concept of a negative base register or a
negative index register (or, for that matter, a negative displace-
ment), overflow, underflow, carry, ...

Some people here have argued that (for some architectures), addresses
with the high-order bit set should be taken as negative rather than
positive. Or did you mean your comment to apply only to certain
architectures (IBM 360, Mc 88100, perhaps others?), and not to
all architectures?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Sun Apr 28 22:56:41 2024

BGB wrote:

On 4/28/2024 2:24 PM, MitchAlsup1 wrote:

Still going to need to explain the semantics here...

IP+&GOT+disp-IP is a 64-bit pointer into GOT where the external linkage
pointer resides.

OK.

Not sure I follow here what exactly is going on...

While I am sure I don't understand what is going on....

As noted, if I did a similar thing to the RISC-V example, but with my
own ISA (with the MOV.C extension):
MOV.Q (PC, Disp33), R0 // What data does this access ?
MOV.Q (GBR, 0), R18
MOV.C (R18, R0), GBR

It appears to me that you are placing an array of GOT pointers at
the first entry of any particular GOT ?!?

Whereas My 66000 uses IP relative access to the GOT the linker
(or LD.so) setup avoiding the indirection.

Then My 66000 does not have or need a pointer to GOT since it can
synthesize such a pointer at link time and then just use a IP relative
plus DISP32 to access said GOT.

So, say we have some external variables::

extern uint64_t fred, wilma, barney, betty;

AND we postulate that the linker found all 4 externs in the same module
so that it can access them all via 1 pointer. The linker assigns an
index into GOT and setups a relocation to that memory segment and when
LD.so runs, it stores a proper pointer in that index of GOT, call this
index fred_index.

And we access one of these::

if( fred at_work )

The compiler will obtain the pointer to the area fred is positioned via:

LDD Rfp,[IP,,#GOT+fred_index<<3] // *

and from here one can access barney, betty and wilma using the pointer
to fred and standard offsetting.

LDD Rfred,[Rfp,#0] // fred
LDD Rbarn,[Rfp,#16] // barney
LDD Rbett,[Rfp,#24] // betty
LDD Rwilm,[Rfp,#8] // wilma

These offsets are known at link time and possibly not at compile time.

(*) if the LDD through GOT takes a page fault, we have a procedure setup
so LD.so can run figure out which entry is missing, look up where it is (possibly load and resolve it) and insert the required data into GOT.
When control returns to LDD, the entry is now present, and we now have
access to fred, wilma, barney and betty.

Differing mostly in that it doesn't require base relocs.

The normal version in my case avoids the extra memory load, but uses a
base reloc for the table index.

....

{{ // this looks like stuff that should be accessible to LD.so

Though, the reloc format is at least semi-dense, eg, for a block of relocs:
{ DWORD rvaPage; //address of page (4K)
DWORD szRelocs; //size of relocs in block
}
With each reloc encoded as a 16-bit entry:
(15:12): Reloc Type
(11: 0): Address within Page (4K)

One downside is this format is less efficient for sparse relocs (current situation), where often there are only 1 or 2 relocs per page (typically
the PBO index fixups and similar).

One situation could be to have a modified format that partially omits
the block structuring, say:
0ddd: Advance current page position by ddd pages (4K);
0000: Effectively a NOP (as before)
1ddd..Cddd: Apply the given reloc.
These represent typical relocs, target dependent.
HI16, LO16, DIR32, HI32ADJ, ...
8ddd: Was assigned for PBO fixups;
Addd: Fixup for a 64-bit address, also semi common.
Dzzz/Ezzz: Extended Relocs
These ones are configurable from a larger set of reloc types.
Fzzz: Command-Escape
...

Where, say, rather than needing 1 block per 4K page, it is 1 block per
PE section.

Though, base relocs are a relatively small part of the size of the binary.

To some extent, the PBO reloc is magic in that it works by
pattern-matching the instruction that it finds at the given address. So,
in effect, is only defined for a limited range of instructions.

Contrast with, say, the 1/2/3/4/A relocs, which expect raw 16/32/64 bit values. Though, a lot of these are not currently used for BJX2 (does not
use 16-bit addressing nides, ...).

Here:
5/6/7/8/9/B/C, ended up used for BJX2 relocs in BJX2 mode.
For other targets, they would have other meanings.
D/E/F were reserved as expanded/escape-case relocs, in case I need to
add more. These would differ partly in that the reloc sub-type would be assigned as a sort of state-machine.

but not the program itself}}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to BGB on Mon Apr 29 16:53:29 2024

BGB <cr88192@gmail.com> schrieb:

Meanwhile, got the My66000 LLVM/Clang compiler built so far as that it
at least seems to try to build something (and seems to know that the
target exists).

But, also tends to die in s storm of error messages, eg:

/tmp/m_swap-822054.s:6: Error: no such instruction: `bitr r1,r1,<8:48>'

You can only generate assembly code, so just use "-S".

If you want to assemble to object files, you can use my binutils
branch on github. I have not yet started on the linker (there
are still quite a few decisions to be made regarding relocations,
which is a topic that I do not enjoy too much.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Apr 30 15:45:21 2024

mitchalsup@aol.com (MitchAlsup1) writes:

Lawrence D'Oliveiro wrote:

On Sat, 20 Apr 2024 18:06:22 -0600, John Savard wrote:

Since there was only one set of arithmetic instrucions, that meant that
when you wrote code to operate on unsigned values, you had to remember
that the normal names of the condition code values were oriented around
signed arithmetic.

I thought architectures typically had separate condition codes for “carry”
versus “overflow”. That way, you didn’t need signed versus unsigned
versions of add, subtract and compare; it was just a matter of looking at
the right condition codes on the result.

Maybe now with 4-or-5-bit condition codes yes,
But the early machines (360) with 2-bit codes were already constricted.

The B3500 (contemporaneous with 360) had COMS toggles (2 bits) and OVERFLOW toggle (1 bit).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Kent Dickey@21:1/5 to terje.mathisen@tmsw.no on Thu Jun 13 16:06:07 2024

In article <v04tpb$pqus$1@dont-email.me>,
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup1 wrote:

BGB wrote:

On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
Like, in-order superscalar isn't going to do crap if nearly every
instruction depends on every preceding instruction. Even pipelining
can't help much with this.

Pipelining CREATED this (back to back dependencies). No amount of
pipelining can eradicate RAW data dependencies.

The compiler can shuffle the instructions into an order to limit the
number of register dependencies and better fit the pipeline. But,
then, most of the "hard parts" are already done (so it doesn't take
much more for the compiler to flag which instructions can run in
parallel).

Compiler scheduling works for exactly 1 pipeline implementation and
is suboptimal for all others.

Well, yeah.

OTOH, if your (definitely not my!) compiler can schedule a 4-wide static >ordering of operations, then it will be very nearly optimal on 2-wide
and 3-wide as well. (The difference is typically in a bit more loop
setup and cleanup code than needed.)

Hand-optimizing Pentium asm code did teach me to "think like a cpu",
which is probably the only part of the experience which is still kind of >relevant. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

This is a late reply, but optimal static ordering for N-wide may be
very non-optimal for N-1 (or N-2, etc.). As an example, assume a perfectly scheduled 4-wide sequence of instructions with the instructions labeled
with the group number, and letter A-D for the position in the group.
There is a dependency from A to A, B to B, etc., and a dependency from D
to A. Here's what the instruction groupings look like on a 4-way machine:

INST0_A
INST0_B
INST0_C
INST0_D
-------
INST1_A
INST1_B
INST1_C
INST1_D
-------
INST2_A

There will obviously be other dependencies (say, INST2_A depends on INST0_B) but they don't affect how this will be executed.
The ----- lines indicate group boundaries. All instructions in a group
execute in the same cycle. So the first 8 instruction take just 2 clocks
on a 4-wide.

If you run this sequence on a 3-wide, then the groupings will become:

INST0_A
INST0_B
INST0_C
-------
INST0_D
-------
INST1_A
INST1_B
INST1_C
-------
INST1_D
-------

What took 2 clocks on the 4-wide now takes 4 clocks on the 3-wide. And
a different arrangement would take just 3 clocks:

INST0_A
INST0_B
INST0_D
-------
INST1_A
INST0_C
INST1_B
-------
INST1_C
INST1_D

-------------------------------

A similar problem occurs when the 4-wide is optimally scheduled, but doesn't issue 4 instructions due to dependencies. These dependencies can hit at
bad times for 2-wide causing it to not be optimal. Here's a new 4-wide sequence where INST1_A depends on INST0_C and INST0_A, and INST2_* all
depends on INST1_A, with this pattern repeating in even/odd groups.

INST0_A
INST0_B
INST0_C
-------
INST1_A
-------
INST2_A
INST2_B
INST2_C
-------
INST3_A
-------

This sequence takes 4 clocks on a 4-wide machine.

When run on a 2-wide machine, these are the cycle counts:

INST0_A
INST0_B
-------
INST0_C
-------
INST1_A
-------
INST2_A
INST2_B
-------
INST2_C
-------
INST3_A
-------

This takes 6 clocks. But by moving INSTx_B, it could be faster:

INST0_A
INST0_C
-------
INST0_B
INST1_A
-------
INST2_A
INST2_C
-------
INST2_B
INST3_A
-------

Now it takes just 4 clocks. So an optimal 4-wide schedule can be shown to
not be very non-optimal on 3-wide or 2-wide systems. And this isn't taking into account other delays and resource limits (like number of loads and
stores supported per cycle).

It's an interesting problem as to how bad it can get. With resource
limits, I suspect it can be an integer multiple bad, but just using
register dependencies, I'm not sure how bad it can get. I just showed 50%,
but I'm not sure if 100% slower is possible.

Kent

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to All on Thu Jun 13 17:00:55 2024

Kent writes an interesting thesis on why OoO overtook IO wrt
scheduling.

One can add other scheduling effects::
a) change in latency on some FUs but not on others
b) change is register ports per instruction
c) branch prediction changes
d) cache timing wrt prefetch

Having gone through the transition (1-wide IO, 2-wide IO, 6-wide OoO)
the OoO machine was simply less complexity--or to say it a different
way--the complexity was more orderly (and more easily verified).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Jun 13 12:52:07 2024

This is a late reply, but optimal static ordering for N-wide may be
very non-optimal for N-1 (or N-2, etc.). As an example, assume a perfectly

AFAICT Terje was talking about scheduling for OoO CPUs, and wasn't
talking about the possible worst case situations, but about how things
usually turn out in practice.

For statically-scheduled or in-order CPUs, it can be indeed more
difficult to generate code that will run (almost) optimally on a variety
of CPUs.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stefan Monnier@21:1/5 to All on Thu Jun 13 13:24:47 2024

Having gone through the transition (1-wide IO, 2-wide IO, 6-wide OoO)
the OoO machine was simply less complexity--or to say it a different
way--the complexity was more orderly (and more easily verified).

I tend to think of it a bit like the transition from using overlays and segments to the use of on-demand paging.

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stefan Monnier on Thu Jun 13 20:41:34 2024

Stefan Monnier wrote:

Having gone through the transition (1-wide IO, 2-wide IO, 6-wide OoO)
the OoO machine was simply less complexity--or to say it a different
way--the complexity was more orderly (and more easily verified).

I tend to think of it a bit like the transition from using overlays and segments to the use of on-demand paging.

Who (in their right minds) would go back from paging to overlays ???

{Although many people enjoy driving Model Ts ...}

Stefan

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Jun 13 20:40:18 2024

BGB wrote:

On 6/13/2024 11:52 AM, Stefan Monnier wrote:

This is a late reply, but optimal static ordering for N-wide may be
very non-optimal for N-1 (or N-2, etc.). As an example, assume a
perfectly

AFAICT Terje was talking about scheduling for OoO CPUs, and wasn't
talking about the possible worst case situations, but about how things
usually turn out in practice.

For statically-scheduled or in-order CPUs, it can be indeed more
difficult to generate code that will run (almost) optimally on a
variety
of CPUs.

Yeah, you need to know the specifics of the pipeline for either optimal

machine code (in-order superscalar) or potentially to be able to run at

all (LIW / VLIW).

That said, on some OoO CPU's, such as when I was running a Piledriver
based core, it did seem as if things were scheduled to assume an
in-order CPU (such as putting other instructions between memory loads
and the instructions using the results, etc), it did perform better (seemingly implying there are limits to the OoO magic).

When doing both Mc 88120 and K9 we found lots of sequences if code
where
the scheduling to more orderly or narrower implementations were
impeding
performance on the GBOoO core.

Though, OTOH, a lot of the sorts of optimization tricks I found for the

Piledriver were ineffective on the Ryzen, albeit mostly because the
more
generic stuff caught up.

For example, I had an LZ compressor that was faster than LZ4 on that
CPU
(it was based around doing everything in terms of aligned 32-bit
dwords,
gaining speed at the cost of worse compression), but then when going
over to the Ryzen, LZ4 got faster...

It is the continuous nature of having to reschedule code every
generation
that lead to my wanting the compiler to just spit out correct code and
in the fewest number of instructions that lead to a lot of My 66000 architecture and microarchitectures.

Like, seemingly all my efforts in "aggressively optimizing" some things

became moot simply by upgrading my PC.

I want to compile once and then use forever (in a dynamic library).

....

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Fri Jun 14 11:30:12 2024

MitchAlsup1 wrote:

Stefan Monnier wrote:

Having gone through the transition (1-wide IO, 2-wide IO, 6-wide OoO)
the OoO machine was simply less complexity--or to say it a different
way--the complexity was more orderly (and more easily verified).

I tend to think of it a bit like the transition from using overlays and
segments to the use of on-demand paging.

Who (in their right minds) would go back from paging to overlays ???

{Although many people enjoy driving Model Ts ...}

I have been in a lovingly restored 1914 (or 1913?) model T, but only as
a passenger. When I saw how difficult it was to start and drive it
(steering, gear changes etc) I was very glad the owner didn't suggest I
should try.

The year number was _very_ important to the owner, he had worked on it
for years, sourcing the correct vintage of every part where the year
model is documented. Finally the only "wrong" part was the wishbone
which he knew was from the other year, then he finally found out that a
farm about 3 hours away from the Twin Cities used such a wishbone as
their dinner bell. He unmounted the part from his car, drove all the way
out there, found that the "bell" was in fact from the other year, then
swapped them (with the farmer's permission of course!).

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Kent Dickey on Fri Jun 14 11:22:02 2024

Kent Dickey wrote:

In article <v04tpb$pqus$1@dont-email.me>,
Terje Mathisen <terje.mathisen@tmsw.no> wrote:

MitchAlsup1 wrote:

BGB wrote:

On 4/20/2024 5:03 PM, MitchAlsup1 wrote:
Like, in-order superscalar isn't going to do crap if nearly every
instruction depends on every preceding instruction. Even pipelining
can't help much with this.

Pipelining CREATED this (back to back dependencies). No amount of
pipelining can eradicate RAW data dependencies.

The compiler can shuffle the instructions into an order to limit the
number of register dependencies and better fit the pipeline. But,
then, most of the "hard parts" are already done (so it doesn't take
much more for the compiler to flag which instructions can run in
parallel).

Compiler scheduling works for exactly 1 pipeline implementation and
is suboptimal for all others.

Well, yeah.

OTOH, if your (definitely not my!) compiler can schedule a 4-wide static
ordering of operations, then it will be very nearly optimal on 2-wide
and 3-wide as well. (The difference is typically in a bit more loop
setup and cleanup code than needed.)

Hand-optimizing Pentium asm code did teach me to "think like a cpu",
which is probably the only part of the experience which is still kind of
relevant. :-)

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

This is a late reply, but optimal static ordering for N-wide may be
very non-optimal for N-1 (or N-2, etc.). As an example, assume a perfectly scheduled 4-wide sequence of instructions with the instructions labeled
with the group number, and letter A-D for the position in the group.
There is a dependency from A to A, B to B, etc., and a dependency from D
to A. Here's what the instruction groupings look like on a 4-way machine:

INST0_A
INST0_B
INST0_C
INST0_D
-------
INST1_A
INST1_B
INST1_C
INST1_D
-------
INST2_A

There will obviously be other dependencies (say, INST2_A depends on INST0_B) but they don't affect how this will be executed.
The ----- lines indicate group boundaries. All instructions in a group execute in the same cycle. So the first 8 instruction take just 2 clocks
on a 4-wide.

If you run this sequence on a 3-wide, then the groupings will become:

INST0_A
INST0_B
INST0_C
-------
INST0_D
-------
INST1_A
INST1_B
INST1_C
-------
INST1_D
-------

OK, you did state that A1 depends on D0, but then showed a bit later
that neither A nor D depended on C, so you could use that as a filler.

INST0_A
INST0_B
INST0_D
-------
INST1_A
INST0_C
INST1_B
-------
INST1_C
INST1_D

Obviously you cannot follow this up with INST2_A, you would need INST2B
here and then A/C/D on the next cycle

INST2_B
-------
INST2_A
INST2_C
INST2_D

at which point the pattern could repeat itself.

Running this slightly modified ordering on a 4-wide would again fail,
but if I instead write it like this:

INST0_A
INST0_B
INST0_D
--------
INST0_C
INST1_A
INST1_B
--------
INST1_D
INST1_C
INST2_B
--------
INST2_A
INST2_C
INST2_D

then a re-grouping for the 4-wide would still have one instruction from
each ABCD group in each cycle and A would never stall waiting for a
previous D in the same cycle.

This is probably close to the patterns an OoO 3 or 4-wide would settle
down on after a bunch of iterations.

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Fri Jun 14 16:57:05 2024

Terje Mathisen wrote:

MitchAlsup1 wrote:

Stefan Monnier wrote:

Having gone through the transition (1-wide IO, 2-wide IO, 6-wide OoO)
the OoO machine was simply less complexity--or to say it a different
way--the complexity was more orderly (and more easily verified).

I tend to think of it a bit like the transition from using overlays and
segments to the use of on-demand paging.

Who (in their right minds) would go back from paging to overlays ???

{Although many people enjoy driving Model Ts ...}

I have been in a lovingly restored 1914 (or 1913?) model T, but only as

a passenger. When I saw how difficult it was to start and drive it
(steering, gear changes etc) I was very glad the owner didn't suggest I

should try.

Well you on both of you. Oh, BTW, you adjust the advance by the sound
of
the exhaust note.

The year number was _very_ important to the owner, he had worked on it
for years, sourcing the correct vintage of every part where the year
model is documented. Finally the only "wrong" part was the wishbone
which he knew was from the other year, then he finally found out that a

farm about 3 hours away from the Twin Cities used such a wishbone as
their dinner bell. He unmounted the part from his car, drove all the
way

out there, found that the "bell" was in fact from the other year, then swapped them (with the farmer's permission of course!).

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Jun 18 21:09:51 2024

BGB wrote:

On 6/13/2024 3:40 PM, MitchAlsup1 wrote:

BGB wrote:

On 6/13/2024 11:52 AM, Stefan Monnier wrote:

This is a late reply, but optimal static ordering for N-wide may be
very non-optimal for N-1 (or N-2, etc.). As an example, assume a
perfectly

AFAICT Terje was talking about scheduling for OoO CPUs, and wasn't
talking about the possible worst case situations, but about how things >>>> usually turn out in practice.

For statically-scheduled or in-order CPUs, it can be indeed more
difficult to generate code that will run (almost) optimally on a
variety
of CPUs.

Yeah, you need to know the specifics of the pipeline for either optimal

machine code (in-order superscalar) or potentially to be able to run at

all (LIW / VLIW).

That said, on some OoO CPU's, such as when I was running a Piledriver
based core, it did seem as if things were scheduled to assume an
in-order CPU (such as putting other instructions between memory loads
and the instructions using the results, etc), it did perform better
(seemingly implying there are limits to the OoO magic).

When doing both Mc 88120 and K9 we found lots of sequences if code
where
the scheduling to more orderly or narrower implementations were
impeding
performance on the GBOoO core.

In this case, scheduling as-if it were an in-order core was leading to
better performance than a more naive ordering (such as directly using
the results of previous instructions or memory loads, vs shuffling
other

instructions in between them).

Either way, seemed to be different behavior than seen on either the
Ryzen or on Intel Core based CPUs (where, seemingly, the CPU does not
care about the relative order).

Because it had no requirement of code scheduling, unlike 1st generation

RISCs, so the cores were designed to put up good performance scores
without any code scheduling.

Though, OTOH, a lot of the sorts of optimization tricks I found for the

Piledriver were ineffective on the Ryzen, albeit mostly because the
more
generic stuff caught up.

For example, I had an LZ compressor that was faster than LZ4 on that
CPU
(it was based around doing everything in terms of aligned 32-bit
dwords,
gaining speed at the cost of worse compression), but then when going
over to the Ryzen, LZ4 got faster...

It is the continuous nature of having to reschedule code every
generation
that lead to my wanting the compiler to just spit out correct code and
in the fewest number of instructions that lead to a lot of My 66000
architecture and microarchitectures.

Mostly works for x86-64 as well.

Though, I had noted that the optimization strategies that worked well
on
MSVC + Piledriver, continue to work effectively on my custom ISA /
core.

One of the things we found in Mc 88120 was that the compiler should
NEVER
be allowed to put unnecessary instructions in decode-execute slots that
were unused--and that almost invariable--the best code for the GBOoO
machine was almost invariably the one with the fewest instructions, and
if several sequences had equally few instructions, it basically did not
matter.

For example::

for( i = 0; i < max, i++ )
a[i] = b[i];

was invariably faster than::

for( ap = &a[0], bp = & b[0];, i = 0; i < max; i++ )
*ap++ = *bp++;

because the later has 3 ADDs in the loop wile the former has but 1.
Because of this, I altered my programming style and almost never end up
using ++ or -- anymore.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Jun 19 16:11:20 2024

BGB wrote:

On 6/18/2024 4:09 PM, MitchAlsup1 wrote:

BGB wrote:

On 6/13/2024 3:40 PM, MitchAlsup1 wrote:

In this case, scheduling as-if it were an in-order core was leading to
better performance than a more naive ordering (such as directly using
the results of previous instructions or memory loads, vs shuffling
other

instructions in between them).

Either way, seemed to be different behavior than seen on either the
Ryzen or on Intel Core based CPUs (where, seemingly, the CPU does not
care about the relative order).

Because it had no requirement of code scheduling, unlike 1st generation

RISCs, so the cores were designed to put up good performance scores
without any code scheduling.

Yeah, but why was Bulldozer/Piledriver seemingly much more sensitive to

instruction scheduling issues than either its predecessors (such as the

Phenom II) and successors (Ryzen)?...

They "blew" the microarchitecture.

It was a 12-gate machine (down from 16-gates from Athlon). this puts
a "lot more stuff" on critical paths and some forwarding was not done, particularly change in size between produced result and consumed
operand.

Though, apparently "low IPC" was a noted issue with this processor
family (apparently trying to gain higher clock-speeds at the expense of

IPC; using a 20-stage pipeline, ...).

Though, less obvious how having a longer pipeline than either its predecessors or successors would effect instruction scheduling.

One of the things we found in Mc 88120 was that the compiler should
NEVER
be allowed to put unnecessary instructions in decode-execute slots that
were unused--and that almost invariable--the best code for the GBOoO
machine was almost invariably the one with the fewest instructions, and
if several sequences had equally few instructions, it basically did not
matter.

For example::

    for( i = 0; i < max, i++ )
         a[i] = b[i];

was invariably faster than::

    for( ap = &a[0], bp = & b[0];, i = 0; i < max; i++ )
         *ap++ = *bp++;

because the later has 3 ADDs in the loop wile the former has but 1.
Because of this, I altered my programming style and almost never end up
using ++ or -- anymore.

In this case, it would often be something more like:
maxn4=max&(~3);
for(i=0; i<maxn4; i+=4)
{
ap=a+i; bp=b+i;
t0=ap[0]; t1=ap[1];
t2=ap[2]; t3=ap[3];
bp[0]=t0; bp[1]=t1;
bp[2]=t2; bp[3]=t3;
}
if(max!=maxn4)
{
for(; i < max; i++ )
a[i] = b[i];
}

That is what VVM does, without you having to lift a finger.

If things are partially or fully unrolled, they often go faster.

And ALWAYS eat more code space.

Using a
large number of local variables seems to be effective (even in cases
where the number of local variables exceeds the number of CPU
registers).

Generally also using as few branches as possible.
Etc...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Jun 19 18:43:45 2024

BGB wrote:

On 6/19/2024 11:11 AM, MitchAlsup1 wrote:

BGB wrote:

For example::

     for( i = 0; i < max, i++ )
          a[i] = b[i];

In this case, it would often be something more like:
   maxn4=max&(~3);
   for(i=0; i<maxn4; i+=4)
   {
     ap=a+i;    bp=b+i;
     t0=ap[0]; t1=ap[1];
     t2=ap[2]; t3=ap[3];
     bp[0]=t0; bp[1]=t1;
     bp[2]=t2; bp[3]=t3;
   }
   if(max!=maxn4)
   {
     for(; i < max; i++ )
       a[i] = b[i];
   }

That is what VVM does, without you having to lift a finger.

If things are partially or fully unrolled, they often go faster.

And ALWAYS eat more code space.

Granted, but it is faster in this case, though mostly due to being able

to sidestep some of the interlock penalties and reducing the amount of
cycles spent on the loop itself.

The loop does not have any dependencies, except on the loop index
variable
and a possible alias between a and b.

Now, if &b[1] happens to = &a[0], then your construction fails while VVM

succeeds--it just runs slower because there IS a dependency checked by
HW and enforced. In those situations where the dependency is
nonexistent,
then the loop vectorizes--and the programmer remains blissfuly unaware.

Say, since branching isn't free, more so if one does an increment
directly before checking the condition and branching, as is typically
the case in a "for()" loop.

Note: the ADD, CMP, BC of the LOOP instruction runs in a single cycle.
ADD is performed by a 2-input adder
ADD-CMP is performed by a 3-input adder
BC is performed by predicting a return to the top of the loop while
leaving the decoder decoding the instruction following LOOP.

Most of the time, and especially in wider machines, the instructions
of the loop remain in their instruction queues, and instruction FETCH-
DECODE is not performed while processing the LOOP. Each iteration has
the queue spit out its instructions to its FU as dependencies resolve.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Wed Jun 19 21:24:05 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

One of the things we found in Mc 88120 was that the compiler should
NEVER
be allowed to put unnecessary instructions in decode-execute slots that
were unused--and that almost invariable--the best code for the GBOoO
machine was almost invariably the one with the fewest instructions, and
if several sequences had equally few instructions, it basically did not matter.

For example::

for( i = 0; i < max, i++ )
a[i] = b[i];

was invariably faster than::

for( ap = &a[0], bp = & b[0];, i = 0; i < max; i++ )
*ap++ = *bp++;

because the later has 3 ADDs in the loop wile the former has but 1.
Because of this, I altered my programming style and almost never end up
using ++ or -- anymore.

Interesting.

I looked at a variant of this,

void foo (int *a, int *b, int n)
{
int i;
for( i = 0; i < n; i++ )
a[i] = b[i] + 1;
}

void bar (int *a, int *b, int n)
{

int *ap, *bp;
int i;

for( ap = &a[0], bp = & b[0], i = 0; i < n; i++ )
*ap++ = (*bp++) + 1;
}

I would expect ivopt to optimize these to the same assembly.
Gcc does so, at least for AMD64 and POWER, and so does clang for
the usual architectures if translated with -O2 -fno-unroll-loops -fno-tree-vectorize.

But I think there are other reasons to chose the array version,
I think - clarity is one of them, the other being uses
of "restrict" on arguments.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed Jun 19 21:22:54 2024

MitchAlsup1 wrote:

BGB wrote:

Yeah, but why was Bulldozer/Piledriver seemingly much more sensitive to

instruction scheduling issues than either its predecessors (such as the

Phenom II) and successors (Ryzen)?...

They "blew" the microarchitecture.

It was a 12-gate machine (down from 16-gates from Athlon). this puts a
"lot more stuff" on critical paths and some forwarding was not done, particularly change in size between produced result and consumed
operand.

So was the problem that it could not do back-to-back forwarding and need
to take a 1 cycle hiccup in some cases that turned out to be frequent?
Like say forwarding a byte to a long?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Wed Jun 19 21:44:34 2024

MitchAlsup1 wrote:

One of the things we found in Mc 88120 was that the compiler should
NEVER
be allowed to put unnecessary instructions in decode-execute slots that
were unused--and that almost invariable--the best code for the GBOoO
machine was almost invariably the one with the fewest instructions, and
if several sequences had equally few instructions, it basically did not matter.

For example::

for( i = 0; i < max, i++ )
a[i] = b[i];

was invariably faster than::

for( ap = &a[0], bp = & b[0];, i = 0; i < max; i++ )
*ap++ = *bp++;

because the later has 3 ADDs in the loop wile the former has but 1.
Because of this, I altered my programming style and almost never end up
using ++ or -- anymore.

The 88000 had a scaled-indexed address mode on LD and ST.
Alpha didn't but had a scaled-indexed S8ADDQ (aka LEA) instruction.
ISA's that didn't used individual shifts and adds (like real RISC's do!)
so for them an optimizing compiler converting to the ++ form is optimal,
and might be folded into an auto-increment address mode.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu Jun 20 02:44:21 2024

EricP wrote:

MitchAlsup1 wrote:

One of the things we found in Mc 88120 was that the compiler should
NEVER
be allowed to put unnecessary instructions in decode-execute slots that
were unused--and that almost invariable--the best code for the GBOoO
machine was almost invariably the one with the fewest instructions, and
if several sequences had equally few instructions, it basically did not
matter.

For example::

for( i = 0; i < max, i++ )
a[i] = b[i];

was invariably faster than::

for( ap = &a[0], bp = & b[0];, i = 0; i < max; i++ )
*ap++ = *bp++;

because the later has 3 ADDs in the loop wile the former has but 1.
Because of this, I altered my programming style and almost never end up
using ++ or -- anymore.

By the way:: The statement about Always applies to Mc 881x0's and a few
other RISC architectures with indexed address modes.

The 88000 had a scaled-indexed address mode on LD and ST.
Alpha didn't but had a scaled-indexed S8ADDQ (aka LEA) instruction.
ISA's that didn't used individual shifts and adds (like real RISC's
do!)
so for them an optimizing compiler converting to the ++ form is
optimal,
and might be folded into an auto-increment address mode.

No disagreement when other architectures made different choices getting

to where they got.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to EricP on Thu Jun 20 10:40:32 2024

EricP wrote:

MitchAlsup1 wrote:

BGB wrote:

Yeah, but why was Bulldozer/Piledriver seemingly much more sensitive to

instruction scheduling issues than either its predecessors (such as the

Phenom II) and successors (Ryzen)?...

They "blew" the microarchitecture.

It was a 12-gate machine (down from 16-gates from Athlon). this puts a
"lot more stuff" on critical paths and some forwarding was not done,
particularly change in size between produced result and consumed
operand.

So was the problem that it could not do back-to-back forwarding and need
to take a 1 cycle hiccup in some cases that turned out to be frequent?
Like say forwarding a byte to a long?

BTW in Jan 2023 Chips-and-cheese did a two part retrospective deep dive
on Bulldozer microarchitecture compared to Sandy Bridge and others.

It doesn't single out a culprit. It mentions the Bulldozer integer
scheduler as choosing between either a single oldest entry or based on
physical location, which sounds like a plain linear priority picker.
That might cause problems where it can cause saw-tooth performance as a
stalled dependency chain builds up waiting on one item, then releases.
That is why I used a circular priority picker, aka a round robin arbiter,
which makes the last selected slot the lowest circular priority
and thereby ensures that each scheduler slot is serviced evenly.

But it sounds like there were many interacting issues.

https://chipsandcheese.com/2023/01/22/bulldozer-amds-crash-modernization-frontend-and-execution-engine/

https://chipsandcheese.com/2023/01/24/bulldozer-amds-crash-modernization-caching-and-conclusion/

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to ThatWouldBeTelling@thevillage.com on Thu Jun 20 11:01:38 2024

On Thu, 20 Jun 2024 10:40:32 -0400, EricP
<ThatWouldBeTelling@thevillage.com> wrote:

BTW in Jan 2023 Chips-and-cheese did a two part retrospective deep dive
on Bulldozer microarchitecture compared to Sandy Bridge and others.

It doesn't single out a culprit. It mentions the Bulldozer integer
scheduler as choosing between either a single oldest entry or based on >physical location, which sounds like a plain linear priority picker.
That might cause problems where it can cause saw-tooth performance as a >stalled dependency chain builds up waiting on one item, then releases.
That is why I used a circular priority picker, aka a round robin arbiter, >which makes the last selected slot the lowest circular priority
and thereby ensures that each scheduler slot is serviced evenly.

But it sounds like there were many interacting issues.

https://chipsandcheese.com/2023/01/22/bulldozer-amds-crash-modernization-frontend-and-execution-engine/

https://chipsandcheese.com/2023/01/24/bulldozer-amds-crash-modernization-caching-and-conclusion/

That's interesting. And here I thought the problem with Bulldozer was well-known: it tried for efficiency by making the same mistake as the
Pentium 4, having fewer gate delays per cycle so it could have a
faster clock rate.

As those who post here realize, this isn't just for bragging rights:
it means a larger number of instructions can be executing in parallel
at different stages of operation on the pipeline. And the problem with
that, at least for the Pentium 4, was that thermal issues limited
further increases to the clock speed with so many transistors active
at once from the extra instructions.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Thu Jun 20 16:41:16 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Now, if &b[1] happens to = &a[0], then your construction fails while VVM succeeds--it just runs slower because there IS a dependency checked by
HW and enforced. In those situations where the dependency is
nonexistent,
then the loop vectorizes--and the programmer remains blissfuly unaware.

The performance loss can be significant, unfortunately, depending
on the ratio of the width of the data in quesiton to the width of
the SIMD which actually performs the operation. In the case of
8-bit data and 256-bit wide SIMD, this would be a factor of 32,
which could lead to a slowdown of a factor of... 25, maybe?
This would be enough to trigger bug reports, I can tell you from
experience :-)

One technique that could get around that would be loop reversal,
with a branch to the correct loop at runtime (or a predicate
chosing the right values for the loop constants).

An option to raise an exception when there is a slowdown due
to loops running the wrong direction could be helpful in this
context.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Thu Jun 20 19:20:12 2024

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Now, if &b[1] happens to = &a[0], then your construction fails while
VVM
succeeds--it just runs slower because there IS a dependency checked by
HW and enforced. In those situations where the dependency is
nonexistent,
then the loop vectorizes--and the programmer remains blissfuly unaware.

The performance loss can be significant, unfortunately, depending
on the ratio of the width of the data in quesiton to the width of
the SIMD which actually performs the operation. In the case of
8-bit data and 256-bit wide SIMD, this would be a factor of 32,
which could lead to a slowdown of a factor of... 25, maybe?
This would be enough to trigger bug reports, I can tell you from
experience :-)

Between 2000 and 2006 I was enamored with SIMD as a proxy for
vectorization. Then, as illustrated above, I started to recognize
the dark alley SIMD was leading others down. In effect it was no
better than vectorization and made the compilers so much more
difficult (alias analysis),... and then there is the explosion
of OpCode space consumption,... All of which I wanted to avoid.
The R in RISC should stand for Reduced, not Ridiculous.

Those are simply some of the reasons VVM is "not like that".
VVM requires that the compiler solve exactly none of the
alias problems, the minimum length problems, and works when
typical CRAY-like vectors fail, and where SIMD fails. The
reason it works is the HW recognizes aliasing and slows down
instead of jumping off the cliff.

One technique that could get around that would be loop reversal,
with a branch to the correct loop at runtime (or a predicate
choosing the right values for the loop constants).

Costing code density, compiler complexity, assembly language
programming complexity, and is unnecessary.

An option to raise an exception when there is a slowdown due
to loops running the wrong direction could be helpful in this
context.

Making the slow down even slower/greater.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Fri Jun 21 18:10:57 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

Now, if &b[1] happens to = &a[0], then your construction fails while
VVM
succeeds--it just runs slower because there IS a dependency checked by
HW and enforced. In those situations where the dependency is
nonexistent,
then the loop vectorizes--and the programmer remains blissfuly unaware.

The performance loss can be significant, unfortunately, depending
on the ratio of the width of the data in quesiton to the width of
the SIMD which actually performs the operation. In the case of
8-bit data and 256-bit wide SIMD, this would be a factor of 32,
which could lead to a slowdown of a factor of... 25, maybe?
This would be enough to trigger bug reports, I can tell you from
experience :-)

Between 2000 and 2006 I was enamored with SIMD as a proxy for
vectorization. Then, as illustrated above, I started to recognize
the dark alley SIMD was leading others down. In effect it was no
better than vectorization and made the compilers so much more
difficult (alias analysis),... and then there is the explosion
of OpCode space consumption,... All of which I wanted to avoid.
The R in RISC should stand for Reduced, not Ridiculous.

Those are simply some of the reasons VVM is "not like that".
VVM requires that the compiler solve exactly none of the
alias problems, the minimum length problems, and works when
typical CRAY-like vectors fail, and where SIMD fails. The
reason it works is the HW recognizes aliasing and slows down
instead of jumping off the cliff.

One technique that could get around that would be loop reversal,
with a branch to the correct loop at runtime (or a predicate
choosing the right values for the loop constants).

Costing code density,

Managable, and for a large fractor in performance very much
acceptable.

compiler complexity,

Been there, done that. For a language like Fortran with its array
expressions, it's not too bad.

assembly language
programming complexity,

That should not be a large matter, compilers exist for a reason :-)

and is unnecessary.

In my experience, 50% speed is already enough for a bug report.

An option to raise an exception when there is a slowdown due
to loops running the wrong direction could be helpful in this
context.

Making the slow down even slower/greater.

This would be used for performance optimization only, something
like gfortran's -fcheck=array-temps.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Mon Jul 29 06:49:16 2024

On Sun, 28 Apr 2024 14:00:28 -0500, BGB wrote:

Some stuff say builds should not take quite this long, but this is what
I am seeing...

On Windows, the filesystem can be quite a bottleneck.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From George Neuner@21:1/5 to BGB on Mon Jul 29 11:31:52 2024

On Sun, 28 Apr 2024 14:00:28 -0500, BGB <cr88192@gmail.com> wrote:

On 4/28/2024 4:43 AM, BGB wrote:

On 4/28/2024 4:05 AM, Thomas Koenig wrote:

BGB <cr88192@gmail.com> schrieb:

Still watching LLVM build (several hours later), kinda of an interesting >>>> meta aspect in its behaviors.

Don't build it in debug mode.

I was building it in MinSizeRel mode...

Also "-j 4".

Didn't want to go too much higher as this would likely bog down PC harder.

Some stuff say builds should not take quite this long, but this is what
I am seeing...

Maybe done already but you should check that the project is using
precompiled headers.

When used properly, precompiled headers can save enormous amounts of
build time. OTOH, if used poorly, they actually can add significantly
to already too long build times.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to Lawrence D'Oliveiro on Mon Jul 29 21:23:34 2024

Lawrence D'Oliveiro wrote:

On Sun, 28 Apr 2024 14:00:28 -0500, BGB wrote:

Some stuff say builds should not take quite this long, but this is what
I am seeing...

On Windows, the filesystem can be quite a bottleneck.

Said bottleneck is almost always related to anti-virus programs, if you
are accessing a lot of files the latency of the scanner green light can
be pretty awful.

I know that rustup, the Rust installation tool, used to be around an
order of magnitude slower on Windows than Linux, with the same hardware platform. It turns out that the maximum throughput for AV+NTFS is about
the same as for Linux, so by making all the file create/write/close
operations async, they managed to get speed parity.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Mon Jul 29 20:46:28 2024

On Mon, 29 Jul 2024 19:23:34 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Sun, 28 Apr 2024 14:00:28 -0500, BGB wrote:

Some stuff say builds should not take quite this long, but this is what
I am seeing...

On Windows, the filesystem can be quite a bottleneck.

Said bottleneck is almost always related to anti-virus programs, if you
are accessing a lot of files the latency of the scanner green light can
be pretty awful.

People still run anti-virus software ?!? Gasp....no wonder we need
5GHz processors.......

I don't run them anymore...

I know that rustup, the Rust installation tool, used to be around an
order of magnitude slower on Windows than Linux, with the same hardware platform. It turns out that the maximum throughput for AV+NTFS is about
the same as for Linux, so by making all the file create/write/close operations async, they managed to get speed parity.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 30 01:41:30 2024

On Mon, 29 Jul 2024 20:46:28 +0000, MitchAlsup1 wrote:

I don't run [antivirus programs] anymore...

On a Linux system, there is (still) little or no need for them.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Stefan Monnier on Tue Jul 30 02:08:38 2024

On Thu, 13 Jun 2024 13:24:47 -0400, Stefan Monnier wrote:

I tend to think of it a bit like the transition from using overlays and segments to the use of on-demand paging.

Burroughs-style segmentation I think made a lot of sense, even if the
overhead of implementing it was high.

Intel was the one that gave segmentation a bad name.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Tue Jul 30 01:40:50 2024

On Mon, 29 Jul 2024 12:31:27 -0500, BGB wrote:

Would be nice if, maybe they could add a white-list or something so that
it can stop checking files it has already looked at recently (and which
have not changed).

Any kind of exemptions from checking will sooner or later be exploited by
some piece of malware.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 30 02:10:05 2024

On Thu, 13 Jun 2024 17:00:55 +0000, MitchAlsup1 wrote:

Having gone through the transition (1-wide IO, 2-wide IO, 6-wide OoO)
the OoO machine was simply less complexity--or to say it a different
way--the complexity was more orderly (and more easily verified).

Isn’t it true that a recent out-of-order processor can have something like 100 instructions in flight at the same time? (Probably more by now.)

That would be equivalent to a 100-wide in-order processor, would it not?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Tue Jul 30 12:04:59 2024

On Tue, 30 Jul 2024 02:10:05 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Thu, 13 Jun 2024 17:00:55 +0000, MitchAlsup1 wrote:

Having gone through the transition (1-wide IO, 2-wide IO, 6-wide
OoO) the OoO machine was simply less complexity--or to say it a
different way--the complexity was more orderly (and more easily
verified).

Isn’t it true that a recent out-of-order processor can have something
like 100 instructions in flight at the same time? (Probably more by
now.)

That would be equivalent to a 100-wide in-order processor, would it
not?

It would not.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to mitchalsup@aol.com on Tue Jul 30 12:24:05 2024

On Mon, 29 Jul 2024 20:46:28 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 29 Jul 2024 19:23:34 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Sun, 28 Apr 2024 14:00:28 -0500, BGB wrote:

Some stuff say builds should not take quite this long, but this
is what I am seeing...

On Windows, the filesystem can be quite a bottleneck.

Said bottleneck is almost always related to anti-virus programs, if
you are accessing a lot of files the latency of the scanner green
light can be pretty awful.

People still run anti-virus software ?!? Gasp....no wonder we need
5GHz processors.......

I don't run them anymore...

In corporate environments people are forced to run virii software
mistakenly called "antivirus". It's not their own choice.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to mitchalsup@aol.com on Tue Jul 30 17:41:05 2024

mitchalsup@aol.com (MitchAlsup1) writes:

On Tue, 30 Jul 2024 9:24:05 +0000, Michael S wrote:

In corporate environments people are forced to run virii software
mistakenly called "antivirus". It's not their own choice.

Ahh, another advantage of retirement.

MS tends to break things.

Executable email content? WTF?
Invisible file extensions? WTF?
HTML in usenet messages? WTF?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Tue Jul 30 17:13:14 2024

On Tue, 30 Jul 2024 9:24:05 +0000, Michael S wrote:

On Mon, 29 Jul 2024 20:46:28 +0000
mitchalsup@aol.com (MitchAlsup1) wrote:

On Mon, 29 Jul 2024 19:23:34 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Sun, 28 Apr 2024 14:00:28 -0500, BGB wrote:

Some stuff say builds should not take quite this long, but this
is what I am seeing...

On Windows, the filesystem can be quite a bottleneck.

Said bottleneck is almost always related to anti-virus programs, if
you are accessing a lot of files the latency of the scanner green
light can be pretty awful.

People still run anti-virus software ?!? Gasp....no wonder we need
5GHz processors.......

I don't run them anymore...

In corporate environments people are forced to run virii software
mistakenly called "antivirus". It's not their own choice.

Ahh, another advantage of retirement.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Tue Jul 30 23:51:02 2024

On Mon, 29 Jul 2024 22:40:39 -0500, BGB wrote:

If a file has not been modified, and was already confirmed good, you
don't really need to verify it again...

How do you tell whether it’s been modified or not, without actually
examining its entire contents?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Tue Jul 30 23:56:13 2024

On Tue, 18 Jun 2024 21:09:51 +0000, MitchAlsup1 wrote:

Because of this, I altered my programming style and almost never end up
using ++ or -- anymore.

I never got into the habit of combining those with other operations in the
same expression. It always seemed unsound from the viewpoint of code comprehensibility to me.

Since I ignore the results from these operations, post- versus pre- increment/decrement makes no difference, so I normally use the pre- forms.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to Michael S on Tue Jul 30 23:51:39 2024

On Tue, 30 Jul 2024 12:04:59 +0300, Michael S wrote:

On Tue, 30 Jul 2024 02:10:05 -0000 (UTC) Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

That would be equivalent to a 100-wide in-order processor, would it
not?

It would not.

Why not?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Michael S@21:1/5 to Lawrence D'Oliveiro on Wed Jul 31 10:40:55 2024

On Tue, 30 Jul 2024 23:51:39 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Tue, 30 Jul 2024 12:04:59 +0300, Michael S wrote:

On Tue, 30 Jul 2024 02:10:05 -0000 (UTC) Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

That would be equivalent to a 100-wide in-order processor, would it
not?

It would not.

Why not?

How many instructions in flight are there on 1-wide in-order MIPS
R3000? How many on 1-wide in-order MIPS R4000?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to BGB on Wed Jul 31 13:06:14 2024

BGB <cr88192@gmail.com> writes:

On 7/30/2024 6:51 PM, Lawrence D'Oliveiro wrote:

On Mon, 29 Jul 2024 22:40:39 -0500, BGB wrote:

If a file has not been modified, and was already confirmed good, you
don't really need to verify it again...

How do you tell whether it’s been modified or not, without actually
examining its entire contents?

If one is already intercepting every filesystem call, it is possible to
keep track of whether a given file was opened for writing, deleted,
written to, ...

Good grief.

Just store a 'modified' timestamp in the filesystem metadata
like every filesystem has done for the past six decades.

If a file has not been opened for writing or written to or similar,
since the last time it was looked at, it is possible to safely infer
that its contents are still the same (assuming a conventional
filesystem, like FAT or NTFS or similar).

I wouldn't consider that a safe assumption.

Alternatively, one could check modification times (like what "make" or >similar does), but this is less provable (if there exists any way to
modify a file without updating its modification time).

Of course there is. It's not necessarily easy, but e.g. on unix/linux
one can (if suitably privileged) access the underlying device directly bypassing the filesystem.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Michael S on Wed Jul 31 14:48:58 2024

On Wed, 31 Jul 2024 7:40:55 +0000, Michael S wrote:

On Tue, 30 Jul 2024 23:51:39 -0000 (UTC)
Lawrence D'Oliveiro <ldo@nz.invalid> wrote:

On Tue, 30 Jul 2024 12:04:59 +0300, Michael S wrote:

On Tue, 30 Jul 2024 02:10:05 -0000 (UTC) Lawrence D'Oliveiro
<ldo@nz.invalid> wrote:

That would be equivalent to a 100-wide in-order processor, would it
not?

It would not.

Why not?

How many instructions in flight are there on 1-wide in-order MIPS
R3000?

4 int or Mem up to 6 FP

How many on 1-wide in-order MIPS R4000?

6 Int or Mem up to 9 FP

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Wed Jul 31 23:18:27 2024

On Tue, 30 Jul 2024 20:15:21 -0500, BGB wrote:

I tend to always have Windows set to show file extensions ...

Surely that’s a function of the GUI, not the OS per se.

Oh wait, Windows dates from the 1990s, when it was the fashion to mash the
two together.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to All on Wed Jul 31 23:54:17 2024

On Wed, 31 Jul 2024 14:48:58 +0000, MitchAlsup1 wrote:

On Wed, 31 Jul 2024 7:40:55 +0000, Michael S wrote:

How many instructions in flight are there on 1-wide in-order MIPS
R3000?

4 int or Mem up to 6 FP

How many on 1-wide in-order MIPS R4000?

6 Int or Mem up to 9 FP

But that’s just “pipelining”, isn’t it?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Lawrence D'Oliveiro on Thu Aug 1 00:38:03 2024

On Wed, 31 Jul 2024 23:54:17 +0000, Lawrence D'Oliveiro wrote:

On Wed, 31 Jul 2024 14:48:58 +0000, MitchAlsup1 wrote:

On Wed, 31 Jul 2024 7:40:55 +0000, Michael S wrote:

How many instructions in flight are there on 1-wide in-order MIPS
R3000?

4 int or Mem up to 6 FP

How many on 1-wide in-order MIPS R4000?

6 Int or Mem up to 9 FP

But that’s just “pipelining”, isn’t it?

Yes, but both were 1-wide IO machines/pipelines

I am counting from the time the instruction leaves DECODE until
it finishes writing its result into the RF and can be discarded.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Lawrence D'Oliveiro@21:1/5 to BGB on Wed Jul 31 23:17:06 2024

On Tue, 30 Jul 2024 19:52:14 -0500, BGB wrote:

On 7/30/2024 6:51 PM, Lawrence D'Oliveiro wrote:

On Mon, 29 Jul 2024 22:40:39 -0500, BGB wrote:

If a file has not been modified, and was already confirmed good, you
don't really need to verify it again...

How do you tell whether it’s been modified or not, without actually
examining its entire contents?

If one is already intercepting every filesystem call, it is possible to
keep track of whether a given file was opened for writing, deleted,
written to, ...

Unless, of course, someone else is also intercepting those calls, and
inserting their own calls.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to mitchalsup@aol.com on Thu Aug 1 18:16:01 2024

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Mon, 29 Jul 2024 19:23:34 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Sun, 28 Apr 2024 14:00:28 -0500, BGB wrote:

Some stuff say builds should not take quite this long, but this is what >>>> I am seeing...

On Windows, the filesystem can be quite a bottleneck.

Said bottleneck is almost always related to anti-virus programs, if you
are accessing a lot of files the latency of the scanner green light can
be pretty awful.

People still run anti-virus software ?!? Gasp....no wonder we need
5GHz processors.......

You might have noticed the recent Crowdstrike fiasco...

And yes, a it is absolutely astonishing that people are willing to
accept the performance degradation this brings.

I don't run them anymore...

Do you use Windows? If so, did you disable Microsoft Defender?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Thu Aug 1 20:09:22 2024

On Thu, 1 Aug 2024 18:16:01 +0000, Thomas Koenig wrote:

MitchAlsup1 <mitchalsup@aol.com> schrieb:

On Mon, 29 Jul 2024 19:23:34 +0000, Terje Mathisen wrote:

Lawrence D'Oliveiro wrote:

On Sun, 28 Apr 2024 14:00:28 -0500, BGB wrote:

Some stuff say builds should not take quite this long, but this is what >>>>> I am seeing...

On Windows, the filesystem can be quite a bottleneck.

Said bottleneck is almost always related to anti-virus programs, if you
are accessing a lot of files the latency of the scanner green light can
be pretty awful.

People still run anti-virus software ?!? Gasp....no wonder we need
5GHz processors.......

You might have noticed the recent Crowdstrike fiasco...

And yes, a it is absolutely astonishing that people are willing to
accept the performance degradation this brings.

I don't run them anymore...

Do you use Windows?

Yes

If so, did you disable Microsoft Defender?

Don't know. Once I get windows configured like I want (and BTW W7 was
the
best version ever) I take the drive to a linux machine and DD it making recovery a 5 minute affair instead of a 4 hour affair. ...

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Volatile_Memory
  Wed Sep 17 07:20:57 2025
  from Des Moines, Iowa via SSH
- Volatile_Memory
  Wed Sep 17 07:17:26 2025
  from Des Moines, Iowa via SSH
- Bob Worm
  Tue Sep 16 21:01:27 2025
  from Wales, Uk via Telnet
- Bob Worm
  Tue Sep 16 15:15:42 2025
  from Wales, Uk via Telnet
- Gretchiie
  Tue Sep 16 05:20:21 2025
  from Derry, Nh via Telnet
- Ginger1
  Mon Sep 15 19:33:54 2025
  from London via SSH
- Bob Worm
  Mon Sep 15 15:42:34 2025
  from Wales, Uk via Telnet
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	45:42:50
Calls:	10,394
Calls today:	2
Files:	14,066
Messages:	6,417,268

Stealing a Great Idea from the 6600

Who's Online

Recent Visitors

System Info