Forum: >>> Magnum BBS <<<

Concertlina II: Full Circle

From John Savard@21:1/5 to All on Mon Jun 17 11:33:45 2024

I've noted earlier that I felt I had been going around in circles with Concertina II, changing the instruction format back and forth, instead
of making progress to flesh it out.
Recently, I added a new instruction to facilitate looping.
But the trouble was that it took up tooo much opcode space.
One thing that occured to me was that if I went back to an old method
of specifying instructions longer than 32-bits: using a 4-bit pSupp
field to point into the same reserved area in the block as used for pseudo-immediates, that would suit this instruction very well.

The reason is that if that techique were used, then I could use the
header that's also an instruction to just squeeze in the three-bit
decode field, and so access to the Loop instruction would be easy as
befits its importance.

Then I went back, and looked up an older version of Concertina II
which had it. It had complicated block headers. But worse than that,
it had _four_ different versions of the complete instruction set!
Which version was used depended on the header.The idea, of course,
that some headers required a pared-down version of the instruction set
so as to squeeze in more stuff.
It was also interesting to see how much further along I had gotten in
fleshing out that older version of the instruction set.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Mon Jun 17 20:20:03 2024

John Savard wrote:

I've noted earlier that I felt I had been going around in circles with Concertina II, changing the instruction format back and forth, instead
of making progress to flesh it out.

At least no one can say you give up easily.

Recently, I added a new instruction to facilitate looping.
But the trouble was that it took up tooo much opcode space.

Yes, indeed...

One thing that occured to me was that if I went back to an old method
of specifying instructions longer than 32-bits: using a 4-bit pSupp
field to point into the same reserved area in the block as used for pseudo-immediates, that would suit this instruction very well.

The reason is that if that techique were used, then I could use the
header that's also an instruction to just squeeze in the three-bit
decode field, and so access to the Loop instruction would be easy as
befits its importance.

Then I went back, and looked up an older version of Concertina II
which had it. It had complicated block headers. But worse than that,
it had _four_ different versions of the complete instruction set!
Which version was used depended on the header.The idea, of course,
that some headers required a pared-down version of the instruction set
so as to squeeze in more stuff.
It was also interesting to see how much further along I had gotten in fleshing out that older version of the instruction set.

As to looping, I faced the same delimma and came to a different
conclusion::
You don't do it in 1 instruction, instead, you do it in a way where
your
2 instruction encoding executes one of the instructions only once. I
call
this bookending the loop.

So, I have an instruction called VEC, which donates a register and
provides other guidance to the loop. And I have an instruction called
LOOP which performs the bottom of loop calculations. The register
donated
by VEC is given the address of the top of the loop so that the loop
terminating instruction is relieved of needing to supply it in the
form of a displacement,...

VEC is executed once at the top of the loop, and provides guidance as
to which registers from within the loop are live-out of the loop.
This allows HW to avoid writing everything into RF and facilitates
running the loop across multiple lanes of function units.

LOOP, then, performs the ADD, a CMP, and a BC to the top of the loop.

I ended up with 3 kinds of LOOPs:
a) counted -- for( i = 0; i < max; i++ )
b) searching -- for( i = 0; a[i] > 13; i++ )
c) both -- for( i = 0; i < max && a[i]; i++ )

This coves the majority of loops where the looping condition is
encodable.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Mon Jun 17 15:57:53 2024

On Mon, 17 Jun 2024 20:20:03 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

As to looping, I faced the same delimma and came to a different
conclusion::
You don't do it in 1 instruction, instead, you do it in a way where
your
2 instruction encoding executes one of the instructions only once. I
call
this bookending the loop.

I considered something like that.

My problem was that encoding the parameters of the loop in one
instruction takes too much space. So the first thing I thought of was
to put some of them in the instruction that repeats the loop.

The proiblem was, though, that since the instruction that repeats the
loop points to the start of the loop in memory, it's a
memory-reference instruction, so there isn't much extra room left in
it.

However, there is a little room left, so I may indeed go back and
explore that possibility some more.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Mon Jun 17 23:17:27 2024

John Savard wrote:

On Mon, 17 Jun 2024 20:20:03 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

As to looping, I faced the same delimma and came to a different >>conclusion::
You don't do it in 1 instruction, instead, you do it in a way where
your
2 instruction encoding executes one of the instructions only once. I
call
this bookending the loop.

I considered something like that.

My problem was that encoding the parameters of the loop in one
instruction takes too much space. So the first thing I thought of was
to put some of them in the instruction that repeats the loop.

The proiblem was, though, that since the instruction that repeats the
loop points to the start of the loop in memory, it's a
memory-reference instruction, so there isn't much extra room left in
it.

No, it is not a memref--it is a return ! using the register from the
VEC instruction. You "return" to the top of the loop. There is no
reason to use IP+Disp, and the fact there is no register nor disp-
lacement in LOOP enables it all to fit. In addition, when VEC executes,

IP is pointing at the top of the loop, requiring no calculation
whatsoever.

However, there is a little room left, so I may indeed go back and
explore that possibility some more.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Tue Jun 18 10:11:40 2024

On Tue, 18 Jun 2024 10:01:20 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Mon, 17 Jun 2024 23:17:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

No, it is not a memref--it is a return ! using the register from the
VEC instruction.

As should not surprise you, I was referring to the end-of-loop
instruction in my current Concertina II, not the one in your MY 66000.

I try to avoid stacks, and reserving extra registers, as much as I
can.

Also, this looping instruction is strictly a way to directly encode
the FORTRAN DO loop. It does not attempt any vectorization.

At one point, in the original Concertina, I did have a sort of
loop/vectorize instruction with a functionality that may be somewhat
similar to your VVM. I am definitely going to look at adding that to
Concertina II, as this will perhaps clarify the discussion.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Tue Jun 18 10:01:20 2024

On Mon, 17 Jun 2024 23:17:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

No, it is not a memref--it is a return ! using the register from the
VEC instruction.

As should not surprise you, I was referring to the end-of-loop
instruction in my current Concertina II, not the one in your MY 66000.

I try to avoid stacks, and reserving extra registers, as much as I
can.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Tue Jun 18 16:54:04 2024

John Savard wrote:

On Tue, 18 Jun 2024 10:01:20 -0600, John Savard <quadibloc@servername.invalid> wrote:

On Mon, 17 Jun 2024 23:17:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

No, it is not a memref--it is a return ! using the register from the
VEC instruction.

As should not surprise you, I was referring to the end-of-loop
instruction in my current Concertina II, not the one in your MY 66000.

I try to avoid stacks, and reserving extra registers, as much as I
can.

Also, this looping instruction is strictly a way to directly encode
the FORTRAN DO loop. It does not attempt any vectorization.

The semantics of instructions in a loop are subtly altered such
that they can be vectorized and to execute multi-lane style.

At one point, in the original Concertina, I did have a sort of
loop/vectorize instruction with a functionality that may be somewhat
similar to your VVM. I am definitely going to look at adding that to Concertina II, as this will perhaps clarify the discussion.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Tue Jun 18 16:17:33 2024

John Savard <quadibloc@servername.invalid> schrieb:

On Tue, 18 Jun 2024 10:01:20 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Mon, 17 Jun 2024 23:17:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

No, it is not a memref--it is a return ! using the register from the
VEC instruction.

As should not surprise you, I was referring to the end-of-loop
instruction in my current Concertina II, not the one in your MY 66000.

I try to avoid stacks, and reserving extra registers, as much as I
can.

Also, this looping instruction is strictly a way to directly encode
the FORTRAN DO loop. It does not attempt any vectorization.

Which one, the FORTRAN 66 one or the one since FORTRAN 77?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Tue Jun 18 16:52:22 2024

John Savard wrote:

On Mon, 17 Jun 2024 23:17:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

No, it is not a memref--it is a return ! using the register from the
VEC instruction.

As should not surprise you, I was referring to the end-of-loop
instruction in my current Concertina II, not the one in your MY 66000.

It may surprise you to know that I knew and know that you are talking
about Concer-tina-tanic.

I was merely trying to show you another way to get back to the top
of a loop--one that takes way fewer bits to encode.

I try to avoid stacks, and reserving extra registers, as much as I
can.

My LOOP has no stack.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to tkoenig@netcologne.de on Tue Jun 18 13:17:41 2024

On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Also, this looping instruction is strictly a way to directly encode
the FORTRAN DO loop. It does not attempt any vectorization.

Which one, the FORTRAN 66 one or the one since FORTRAN 77?

FORTRAN IV (or 66) indeed.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Tue Jun 18 13:38:23 2024

On Tue, 18 Jun 2024 16:54:04 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

The semantics of instructions in a loop are subtly altered such
that they can be vectorized and to execute multi-lane style.

I've decided that I will not be able to use the one from the original Concertina, and will need to design a VVM-like instruction for
Concertina II from scratch.

Unlike yours, it won't be...subtle.

The action of the instruction which begins the loop will, I think, be
basically the same as yours. It willl issue successive iterations of
the loop starting in consecutive cycles.

To do so, though, that instruction will contain a number of fields in
which to specify parameters:

(3 bits) An index register, which is initialized to zero at the start
of the loop, and "incremented" (the quote marks are, of course,
because it won't really be the same register on each iteration) for
subsequent iterations.
(3 bits) The power of two which is to serve as the increment.
(8 bits) A register mask, in which a 1 bit corresponds to a register
used for intermediate results within the loop. This will become a
forwarding node rather than a register; all other registers can only
be read, and serve as constant values only. The index register set up previously does not need to be indicated by this.
(2 bits) This indicates which of the four groups of 8 registers in a
bank of 32 registers the register mask applies to.
(1 bit) This indicates whether we're talking about the integer
registers or the floating-point ones.

In addition, in the long version of the instruction, there's a 16-bit
register mask for the short vector registers.

Because iterations are independent, one can't handle a stride in the
natural efficient manner of adding the stride value to a second
pointer register. This could be a common source of error, so I feel
the need to make some provision for this.

One scheme I am considering would be to include one bit in the
instruction that begins a loop to indicate the loop contains a
preamble. The preambles execute serially, and when they conclude,
everything that follows is issued immediately, to execute in parallel
(but now with a multi-cycle offset) to previous iterations.

Upon reflection, this doesn't waste a huge amount of time, so it is
better to go with it than including fields for stride value and a
second counter register in the loop start instruction.

Since the preambles do execute serially, the "end preamble"
instruction would point to the loop start instruction. Instead of full memory-reference, though, it would just include a short value that is
a negative program-relative address.

Iterations that execute in parallel, though, don't "branch back"
anywhere, so the loop end instruction has no parameters. At least
something is like your VVM.

So this is how I take your VVM concept, and mess it up by making it unnecessarily complicated; basically, because I don't want to make an
ISA that requires implementations to be, so to speak, "intelligent".
(i.e. upon the first store into a register in the loop, categorize
that register as a node reference)

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Tue Jun 18 19:40:48 2024

John Savard <quadibloc@servername.invalid> schrieb:

On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Also, this looping instruction is strictly a way to directly encode
the FORTRAN DO loop. It does not attempt any vectorization.

Which one, the FORTRAN 66 one or the one since FORTRAN 77?

FORTRAN IV (or 66) indeed.

It was actually not defined in the standard, in practice it
was usually implemented by a test at the bottom of the loop,
and programs depended on that.

FORTRAN 77 fixed that, so now

DO 100 I=1,0

...
100 CONTINUE

is executed zero times.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to tkoenig@netcologne.de on Tue Jun 18 14:15:48 2024

On Tue, 18 Jun 2024 19:40:48 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Savard <quadibloc@servername.invalid> schrieb:

On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig >><tkoenig@netcologne.de> wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Also, this looping instruction is strictly a way to directly encode
the FORTRAN DO loop. It does not attempt any vectorization.

Which one, the FORTRAN 66 one or the one since FORTRAN 77?

FORTRAN IV (or 66) indeed.

It was actually not defined in the standard, in practice it
was usually implemented by a test at the bottom of the loop,
and programs depended on that.

FORTRAN 77 fixed that, so now

DO 100 I=1,0

...
100 CONTINUE

is executed zero times.

Ah. I can't include that fix now, as I've changed things so that one
of the parameters is at the end of the loop, so the instruction that
heads the loop doesn't know if the "step" parameter is negative or
not.

The change has not yet been posted.

I thought you were asking about whether I included stuff like DO
WHILE. That would have to be done using old-fashioned conditional
branch instructions.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Tue Jun 18 21:23:57 2024

John Savard wrote:

On Tue, 18 Jun 2024 16:54:04 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

The semantics of instructions in a loop are subtly altered such
that they can be vectorized and to execute multi-lane style.

I've decided that I will not be able to use the one from the original Concertina, and will need to design a VVM-like instruction for
Concertina II from scratch.

Unlike yours, it won't be...subtle.

LOL

The action of the instruction which begins the loop will, I think, be basically the same as yours. It willl issue successive iterations of
the loop starting in consecutive cycles.

To do so, though, that instruction will contain a number of fields in
which to specify parameters:

(3 bits) An index register, which is initialized to zero at the start
of the loop, and "incremented" (the quote marks are, of course,
because it won't really be the same register on each iteration) for subsequent iterations.

This is in the LOOP at the end.

(3 bits) The power of two which is to serve as the increment.

The increment is in the LOOP at the end and can be any random value
and is not necessarily fixed from iteration to iteration.

(8 bits) A register mask, in which a 1 bit corresponds to a register
used for intermediate results within the loop. This will become a
forwarding node rather than a register; all other registers can only
be read, and serve as constant values only. The index register set up previously does not need to be indicated by this.

The inverse of this is in VEC at the top. VEC provides a bit vector of registers the compiler wants as Live-Out of the loop. That is, every-
thing else is temporary. This list rarely annotates more than 2
live-outs.

(2 bits) This indicates which of the four groups of 8 registers in a
bank of 32 registers the register mask applies to.

I have no register restraints.

(1 bit) This indicates whether we're talking about the integer
registers or the floating-point ones.

Loops controlled by floating point indexes do not vectorize, however
the
body of the loop can be any mix of int, logical, memory, or FP
instructions.

In addition, in the long version of the instruction, there's a 16-bit register mask for the short vector registers.

Because iterations are independent, one can't handle a stride in the
natural efficient manner of adding the stride value to a second
pointer register. This could be a common source of error, so I feel
the need to make some provision for this.

Are you using stride in the sense of::

for( i = 0; i < max; i +=7 )
a[i] = b[i];

??
It gives VVM no problem whatsoever, however multilane execution
is more difficult, but semantically, the results remain correct.

One scheme I am considering would be to include one bit in the
instruction that begins a loop to indicate the loop contains a
preamble. The preambles execute serially, and when they conclude,
everything that follows is issued immediately, to execute in parallel
(but now with a multi-cycle offset) to previous iterations.

I just have instructions before the VEC instruction.

Upon reflection, this doesn't waste a huge amount of time, so it is
better to go with it than including fields for stride value and a
second counter register in the loop start instruction.

Since the preambles do execute serially, the "end preamble"
instruction would point to the loop start instruction. Instead of full memory-reference, though, it would just include a short value that is
a negative program-relative address.

Iterations that execute in parallel, though, don't "branch back"
anywhere, so the loop end instruction has no parameters. At least
something is like your VVM.

That is why you want LOOP to execute under a different paradigm than
BC.

So this is how I take your VVM concept, and mess it up by making it unnecessarily complicated; basically, because I don't want to make an
ISA that requires implementations to be, so to speak, "intelligent".
(i.e. upon the first store into a register in the loop, categorize
that register as a node reference)

Do you have a night job as a stand up comedian ??

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Tue Jun 18 16:01:34 2024

On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

(1 bit) This indicates whether we're talking about the integer
registers or the floating-point ones.

Loops controlled by floating point indexes do not vectorize, however
the
body of the loop can be any mix of int, logical, memory, or FP
instructions.

Oh no, my index is always an integer. This bit applies to the
"live-in" bits - if the loop performs floating-point computation, it's floating-point registers that I want to mark as forwarding nodes.

And so you indicate this explicitly in VVM as well. I tended to assume
only a limited number of registers would be needed to live in, plus I
have both floating and integer register files, hence the differences.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Stephen Fuld@21:1/5 to Thomas Koenig on Tue Jun 18 22:42:32 2024

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Also, this looping instruction is strictly a way to directly

encode >>> the FORTRAN DO loop. It does not attempt any vectorization.

Which one, the FORTRAN 66 one or the one since FORTRAN 77?

FORTRAN IV (or 66) indeed.

It was actually not defined in the standard, in practice it
was usually implemented by a test at the bottom of the loop,
and programs depended on that.

FORTRAN 77 fixed that, so now

DO 100 I=1,0

...
100 CONTINUE

is executed zero times.

How does VVM handle that? It sems you must "waste" some time, not
executing the loop body until the furst LOOP instruction tells you
whether to or not, or perhaps not actually updating the values the
first time through the loop. Neither seems optimal. :-(

--
- Stephen Fuld
(e-mail address disguised to prevent spam)

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Tue Jun 18 23:57:54 2024

John Savard wrote:

On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

(1 bit) This indicates whether we're talking about the integer
registers or the floating-point ones.

Loops controlled by floating point indexes do not vectorize, however
the body of the loop can be any mix of int, logical, memory, or FP >>instructions.

Oh no, my index is always an integer. This bit applies to the
"live-in" bits - if the loop performs floating-point computation, it's floating-point registers that I want to mark as forwarding nodes.

See, I do not have this distinction, there is but one file.

And so you indicate this explicitly in VVM as well. I tended to assume
only a limited number of registers would be needed to live in, plus I
have both floating and integer register files, hence the differences.

It ends up that the majority of register uses in a loop do not need to
be visible outside of the loop. This is almost the contrapositive of
annotating which registers are temporary in the loop. 90%+ of loops do
not even need the index register to be live outside of the loop.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Jun 18 23:53:32 2024

Stephen Fuld wrote:

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Also, this looping instruction is strictly a way to directly

encode >>> the FORTRAN DO loop. It does not attempt any vectorization.

Which one, the FORTRAN 66 one or the one since FORTRAN 77?

FORTRAN IV (or 66) indeed.

It was actually not defined in the standard, in practice it
was usually implemented by a test at the bottom of the loop,
and programs depended on that.

FORTRAN 77 fixed that, so now

DO 100 I=1,0

...
100 CONTINUE

is executed zero times.

How does VVM handle that? It sems you must "waste" some time, not
executing the loop body until the furst LOOP instruction tells you
whether to or not, or perhaps not actually updating the values the
first time through the loop. Neither seems optimal. :-(

There is a check at the top of the loop which branches around the
VEC--LOOP bookends--most common loops get this optimized away.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jun 19 00:12:40 2024

Stephen Fuld wrote:

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Also, this looping instruction is strictly a way to directly

encode >>> the FORTRAN DO loop. It does not attempt any vectorization.

Which one, the FORTRAN 66 one or the one since FORTRAN 77?

FORTRAN IV (or 66) indeed.

It was actually not defined in the standard, in practice it
was usually implemented by a test at the bottom of the loop,
and programs depended on that.

FORTRAN 77 fixed that, so now

DO 100 I=1,0

...
100 CONTINUE

is executed zero times.

How does VVM handle that? It sems you must "waste" some time, not
executing the loop body until the furst LOOP instruction tells you
whether to or not, or perhaps not actually updating the values the
first time through the loop. Neither seems optimal. :-(

Compiler emits a check at the top of the loop and branches around
VEC-LOOP if the loop is not supposed to be run.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jun 19 00:36:22 2024

Stephen Fuld wrote:

Thomas Koenig wrote:

John Savard <quadibloc@servername.invalid> schrieb:

On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

John Savard <quadibloc@servername.invalid> schrieb:

Also, this looping instruction is strictly a way to directly

encode >>> the FORTRAN DO loop. It does not attempt any vectorization.

Which one, the FORTRAN 66 one or the one since FORTRAN 77?

FORTRAN IV (or 66) indeed.

It was actually not defined in the standard, in practice it
was usually implemented by a test at the bottom of the loop,
and programs depended on that.

FORTRAN 77 fixed that, so now

DO 100 I=1,0

...
100 CONTINUE

is executed zero times.

How does VVM handle that? It sems you must "waste" some time, not
executing the loop body until the furst LOOP instruction tells you
whether to or not, or perhaps not actually updating the values the
first time through the loop. Neither seems optimal. :-(

The compiler emits code at the top of the loop to branch around the
VEC-LOOP bookends.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed Jun 19 00:28:24 2024

John Savard wrote:

On Tue, 18 Jun 2024 16:54:04 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

The semantics of instructions in a loop are subtly altered such
that they can be vectorized and to execute multi-lane style.

I've decided that I will not be able to use the one from the original Concertina, and will need to design a VVM-like instruction for
Concertina II from scratch.

Unlike yours, it won't be...subtle.

The action of the instruction which begins the loop will, I think, be basically the same as yours. It willl issue successive iterations of
the loop starting in consecutive cycles.

To do so, though, that instruction will contain a number of fields in
which to specify parameters:

(3 bits) An index register, which is initialized to zero at the start
of the loop, and "incremented" (the quote marks are, of course,
because it won't really be the same register on each iteration) for subsequent iterations.

Ri is provided in the LOOP instruction

(3 bits) The power of two which is to serve as the increment.

There is no such need in VVM, increment is either a constant or a
register and is not restricted to powers of 2.

(8 bits) A register mask, in which a 1 bit corresponds to a register
used for intermediate results within the loop. This will become a
forwarding node rather than a register; all other registers can only
be read, and serve as constant values only. The index register set up previously does not need to be indicated by this.

The contrapositive of this is provided for in VEC.

(2 bits) This indicates which of the four groups of 8 registers in a
bank of 32 registers the register mask applies to.

I found no need.

(1 bit) This indicates whether we're talking about the integer
registers or the floating-point ones.

I have but 1 register file.

In addition, in the long version of the instruction, there's a 16-bit register mask for the short vector registers.

I also have not software addressable vector registers.

Because iterations are independent, one can't handle a stride in the
natural efficient manner of adding the stride value to a second
pointer register. This could be a common source of error, so I feel
the need to make some provision for this.

for( i = 0; i < max; i +=7 )

falls out for free. But also note::

for( i = 0; i < max; i++ )
a[i] = b[i];

is always faster than:

for( i = 0; i < max; i++ )
*ap++ = *bp++;

The top loop is 3 instruction, the bottom one is 5.

One scheme I am considering would be to include one bit in the
instruction that begins a loop to indicate the loop contains a
preamble. The preambles execute serially, and when they conclude,
everything that follows is issued immediately, to execute in parallel
(but now with a multi-cycle offset) to previous iterations.

VVM just has instruction before the VEC instruction to deal with this.

Upon reflection, this doesn't waste a huge amount of time, so it is
better to go with it than including fields for stride value and a
second counter register in the loop start instruction.

Since the preambles do execute serially, the "end preamble"
instruction would point to the loop start instruction. Instead of full memory-reference, though, it would just include a short value that is
a negative program-relative address.

Iterations that execute in parallel, though, don't "branch back"
anywhere, so the loop end instruction has no parameters. At least
something is like your VVM.

By considering the the branch back to the top as a return, those loops
which were executed simultaneously just die instead of returning to the

top, only the MOD-N lane returns to the top.

So this is how I take your VVM concept, and mess it up by making it unnecessarily complicated; basically, because I don't want to make an
ISA that requires implementations to be, so to speak, "intelligent".
(i.e. upon the first store into a register in the loop, categorize
that register as a node reference)

LOL but have fun.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Tue Jun 18 21:36:06 2024

On Tue, 18 Jun 2024 23:57:54 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

And so you indicate this explicitly in VVM as well. I tended to assume
only a limited number of registers would be needed to live in, plus I
have both floating and integer register files, hence the differences.

It ends up that the majority of register uses in a loop do not need to
be visible outside of the loop. This is almost the contrapositive of >annotating which registers are temporary in the loop. 90%+ of loops do
not even need the index register to be live outside of the loop.

You have convinced me here to learn from your wisdom: I will do two
things. One is to add a bit that decides whether my 1 bits (confined
to a single group of 8 registers) are live-in or live-out bits. The
other is to specify clearly to implementors that if a register is
specified as "live-in" but is never actually used in a loop, this must
not cause any problems.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Jun 19 07:58:08 2024

mitchalsup@aol.com (MitchAlsup1) writes:
[the instruction that ends the loop]

No, it is not a memref--it is a return ! using the register from the
VEC instruction. You "return" to the top of the loop. There is no
reason to use IP+Disp, and the fact there is no register nor disp-
lacement in LOOP enables it all to fit. In addition, when VEC executes,

IP is pointing at the top of the loop, requiring no calculation
whatsoever.

On a related note, about a year ago I have started research on the
performance effect of (programming language) virtual-machine IP
updates in interpreters. The dependence chains of these IP updates
create a lower bound for the execution time of the program, and it
turns out that, if the interpreter is otherwise efficient enough, this
lower bound determines performance, and that we see speedups by up to
a factor of 3 (depending on benchmark and microarchitecture) by
optimizing these IP updates.

One of the optimizations we tried out was to break the dependence
chain be saving the IP on loop entry, and using that IP when starting
the next iteration; this eliminates the IP updates of one iteration
from the dependence chain of the next iteration.

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Wed Jun 19 13:26:30 2024

MitchAlsup1 wrote:

John Savard wrote:

On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

(1 bit) This indicates whether we're talking about the integer
registers or the floating-point ones.

Loops controlled by floating point indexes do not vectorize, however
the body of the loop can be any mix of int, logical, memory, or FP
instructions.

Oh no, my index is always an integer. This bit applies to the
"live-in" bits - if the loop performs floating-point computation, it's
floating-point registers that I want to mark as forwarding nodes.

See, I do not have this distinction, there is but one file.

And so you indicate this explicitly in VVM as well. I tended to assume
only a limited number of registers would be needed to live in, plus I
have both floating and integer register files, hence the differences.

It ends up that the majority of register uses in a loop do not need to
be visible outside of the loop. This is almost the contrapositive of annotating which registers are temporary in the loop. 90%+ of loops do
not even need the index register to be live outside of the loop.

This is partly due to programming languages that applies lifetimes to variables, so that an index register which is defined in the scaffolding
of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as soon
as the loop terminates.

Without such a restriction, there are many times when it would be very
natural to inspect the index in order to determine if this was a normal (counting) exit, or an early exit due to some internal test.

Personally, I have still not settled on my preferred way to handle cases
like this, but I possibly will do so after I retire.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Terje Mathisen on Wed Jun 19 16:04:40 2024

Terje Mathisen wrote:

MitchAlsup1 wrote:

John Savard wrote:

On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

(1 bit) This indicates whether we're talking about the integer
registers or the floating-point ones.

Loops controlled by floating point indexes do not vectorize, however
the body of the loop can be any mix of int, logical, memory, or FP
instructions.

Oh no, my index is always an integer. This bit applies to the
"live-in" bits - if the loop performs floating-point computation, it's
floating-point registers that I want to mark as forwarding nodes.

See, I do not have this distinction, there is but one file.

And so you indicate this explicitly in VVM as well. I tended to assume
only a limited number of registers would be needed to live in, plus I
have both floating and integer register files, hence the differences.

It ends up that the majority of register uses in a loop do not need to
be visible outside of the loop. This is almost the contrapositive of
annotating which registers are temporary in the loop. 90%+ of loops do
not even need the index register to be live outside of the loop.

This is partly due to programming languages that applies lifetimes to variables, so that an index register which is defined in the
scaffolding

of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as soon
as the loop terminates.

There are loops for which the last index and the last inbound data
reference
want to remain visible--search loops for example. But in general, the
amount
of data wanted outside of the loop is very small indeed.

Without such a restriction, there are many times when it would be very natural to inspect the index in order to determine if this was a normal

(counting) exit, or an early exit due to some internal test.

The most important thing is that the live-outs of the loop are few
while
the loop-temps are many.

Personally, I have still not settled on my preferred way to handle
cases
like this, but I possibly will do so after I retire.

Terje

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Wed Jun 19 11:01:26 2024

On Tue, 18 Jun 2024 14:15:48 -0600, John Savard
<quadibloc@servername.invalid> wrote:

Ah. I can't include that fix now, as I've changed things so that one
of the parameters is at the end of the loop, so the instruction that
heads the loop doesn't know if the "step" parameter is negative or
not.

The change has not yet been posted.

I have now updated my loop instruction so that there's no need for a
Step instruction. There was one problem: the new Iterate instruction
takes more opcode space, and that took away some opcode space used for
headers. Fortunately, I had some available opcode space now among
operate instructions instead of memory-reference instructions that I
could use instead, so I moved them over.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to Terje Mathisen on Wed Jun 19 17:18:07 2024

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

MitchAlsup1 wrote:

John Savard wrote:

On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

(1 bit) This indicates whether we're talking about the integer
registers or the floating-point ones.

Loops controlled by floating point indexes do not vectorize, however
the body of the loop can be any mix of int, logical, memory, or FP
instructions.

Oh no, my index is always an integer. This bit applies to the
"live-in" bits - if the loop performs floating-point computation, it's
floating-point registers that I want to mark as forwarding nodes.

See, I do not have this distinction, there is but one file.

And so you indicate this explicitly in VVM as well. I tended to assume
only a limited number of registers would be needed to live in, plus I
have both floating and integer register files, hence the differences.

It ends up that the majority of register uses in a loop do not need to
be visible outside of the loop. This is almost the contrapositive of
annotating which registers are temporary in the loop. 90%+ of loops do
not even need the index register to be live outside of the loop.

This is partly due to programming languages that applies lifetimes to variables, so that an index register which is defined in the scaffolding
of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as soon
as the loop terminates.

This makes things more clear to anybody reading the code (and
unambiguous to the compiler). However, lifetime analysis has
also become very good, and if the value is not used afterwards,
I expect no difference in practice.

Without such a restriction, there are many times when it would be very natural to inspect the index in order to determine if this was a normal (counting) exit, or an early exit due to some internal test.

Hmm... do you mean for the programmer, or for the compiler?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Jun 19 18:31:12 2024

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

MitchAlsup1 wrote:

John Savard wrote:

On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

(1 bit) This indicates whether we're talking about the integer
registers or the floating-point ones.

Loops controlled by floating point indexes do not vectorize, however >>>>> the body of the loop can be any mix of int, logical, memory, or FP
instructions.

Oh no, my index is always an integer. This bit applies to the
"live-in" bits - if the loop performs floating-point computation, it's >>>> floating-point registers that I want to mark as forwarding nodes.

See, I do not have this distinction, there is but one file.

And so you indicate this explicitly in VVM as well. I tended to assume >>>> only a limited number of registers would be needed to live in, plus I
have both floating and integer register files, hence the differences.

It ends up that the majority of register uses in a loop do not need to
be visible outside of the loop. This is almost the contrapositive of
annotating which registers are temporary in the loop. 90%+ of loops do
not even need the index register to be live outside of the loop.

This is partly due to programming languages that applies lifetimes to
variables, so that an index register which is defined in the
scaffolding

of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as soon
as the loop terminates.

This makes things more clear to anybody reading the code (and
unambiguous to the compiler). However, lifetime analysis has
also become very good, and if the value is not used afterwards,
I expect no difference in practice.

When one writes::

for( uint64_t i = 0; i < max; i++ )

the lifetime of i is explicit--it terminates with the loop.

Without such a restriction, there are many times when it would be very
natural to inspect the index in order to determine if this was a normal

(counting) exit, or an early exit due to some internal test.

Hmm... do you mean for the programmer, or for the compiler?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Wed Jun 19 22:49:40 2024

MitchAlsup1 wrote:

Thomas Koenig wrote:

Terje Mathisen <terje.mathisen@tmsw.no> schrieb:

This is partly due to programming languages that applies lifetimes to
variables, so that an index register which is defined in the
scaffolding

of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as
soon as the loop terminates.

This makes things more clear to anybody reading the code (and
unambiguous to the compiler). However, lifetime analysis has
also become very good, and if the value is not used afterwards,
I expect no difference in practice.

When one writes::

for( uint64_t i = 0; i < max; i++ )

the lifetime of i is explicit--it terminates with the loop.

Without such a restriction, there are many times when it would be
very natural to inspect the index in order to determine if this was a
normal

(counting) exit, or an early exit due to some internal test.

Hmm... do you mean for the programmer, or for the compiler?

This is probably my asm background shining trough:

All asm loops have the counting register available directly after loop
exit, until it is reused. When I want to do the same in C I just have to
define the variable before the loop starts, instead of inside the ().

Terje
--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Terje Mathisen@21:1/5 to All on Wed Jun 19 22:52:41 2024

MitchAlsup1 wrote:

Terje Mathisen wrote:

MitchAlsup1 wrote:

John Savard wrote:

On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

(1 bit) This indicates whether we're talking about the integer
registers or the floating-point ones.

Loops controlled by floating point indexes do not vectorize, however >>>>> the body of the loop can be any mix of int, logical, memory, or FP
instructions.

Oh no, my index is always an integer. This bit applies to the
"live-in" bits - if the loop performs floating-point computation, it's >>>> floating-point registers that I want to mark as forwarding nodes.

See, I do not have this distinction, there is but one file.

And so you indicate this explicitly in VVM as well. I tended to assume >>>> only a limited number of registers would be needed to live in, plus I
have both floating and integer register files, hence the differences.

It ends up that the majority of register uses in a loop do not need to
be visible outside of the loop. This is almost the contrapositive of
annotating which registers are temporary in the loop. 90%+ of loops
do not even need the index register to be live outside of the loop.

This is partly due to programming languages that applies lifetimes to
variables, so that an index register which is defined in the
scaffolding

of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as soon
as the loop terminates.

There are loops for which the last index and the last inbound data
reference
want to remain visible--search loops for example. But in general, the
amount of data wanted outside of the loop is very small indeed.

Right.

Without such a restriction, there are many times when it would be very
natural to inspect the index in order to determine if this was a normal

(counting) exit, or an early exit due to some internal test.

The most important thing is that the live-outs of the loop are few
while
the loop-temps are many.

Also almost always true.

Terje

Personally, I have still not settled on my preferred way to handle
cases
like this, but I possibly will do so after I retire.

Terje

--
- <Terje.Mathisen at tmsw.no>
"almost all programming can be viewed as an exercise in caching"

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed Jun 19 22:04:04 2024

On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

On Tue, 18 Jun 2024 16:54:04 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

The semantics of instructions in a loop are subtly altered such
that they can be vectorized and to execute multi-lane style.

I've decided that I will not be able to use the one from the original
Concertina, and will need to design a VVM-like instruction for
Concertina II from scratch.

Unlike yours, it won't be...subtle.

LOL

I wrote that before I learned you had explicit opt-out bits in your
VVM instruction.

Also, I've checked. There's nothing resembling VVM in my original
Concertina design, as I had mistakenly thought.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Wed Jun 19 21:39:11 2024

On Tue, 18 Jun 2024 21:36:06 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Tue, 18 Jun 2024 23:57:54 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

And so you indicate this explicitly in VVM as well. I tended to assume
only a limited number of registers would be needed to live in, plus I
have both floating and integer register files, hence the differences.

It ends up that the majority of register uses in a loop do not need to
be visible outside of the loop. This is almost the contrapositive of >>annotating which registers are temporary in the loop. 90%+ of loops do
not even need the index register to be live outside of the loop.

You have convinced me here to learn from your wisdom: I will do two
things. One is to add a bit that decides whether my 1 bits (confined
to a single group of 8 registers) are live-in or live-out bits. The
other is to specify clearly to implementors that if a register is
specified as "live-in" but is never actually used in a loop, this must
not cause any problems.

I have not yet added my attempt at an imitation of VVM to Concertina
II. However, I have now laid some important groundwork for it.

In my architecture, there are already Cray-style long vectors. They
are intended to nbe the principal and most efficient way of working
with vector quantities in the architecture. So if my VVM-alike was
disjoint from them, and could only interact with them through memory,
this would be an awkwardness in the ISA that needlessly constrains
performance.

So I've added operate instructions that allow operations where one
operand is in a normal register, and the other operand is in a
selected element of a vector register. The element is itself specified
by the contents of an integer register, for convenient use within
loops.

Thus, a VVM-alike loop, instead of going from some vectors in memory
to other vectors in memory, could go from some vector registers to
other vector registers. The vectors aren't virtual any more.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Sat Jun 22 23:04:28 2024

On Wed, 19 Jun 2024 21:39:11 -0600, John Savard
<quadibloc@servername.invalid> wrote:

So I've added operate instructions that allow operations where one
operand is in a normal register, and the other operand is in a
selected element of a vector register. The element is itself specified
by the contents of an integer register, for convenient use within
loops.

Thus, a VVM-alike loop, instead of going from some vectors in memory
to other vectors in memory, could go from some vector registers to
other vector registers. The vectors aren't virtual any more.

Because it seemed to me that any VVM-alike instruction I had would
have to have at least an alternate form longer than 32 bits, despite
my efforts to squeeze it in to much less space than you use... I felt
that I needed to go back to an earlier iteration of Concertina for a
method of making it easier to use long instructions in programs.

Doing that, though, required me to reserve some opcode space, and one
of the consequence is that the instructions referred to above had to
be moved to an alternate instruction set!

I haven['t yet added the additional long instructions to the pages. If
I'm reserving that much opcode space (1/32nd of the total opcode
space) I'm thinking I should do something amazing with it, not
something ho-hum.

Meanwhile, though, I have added something "amazing" to the ISA for a
very tiny cost in opcode space. I've added an eleventh header type
which applies *four* prefix bits to every 16 bits in what's left of
the block after the header.

What does this do?

Well, it used to be I had 16-bit instructions occupying 1/4 of the
opcode space which included register-to-register instructions that
could involve only two registers from the same group of eight
registers.

Partly because I was told this was a very bad thing, and because I
needed to take that 1/4 of the opcode space back so I could have
load-store instructions that were not heavily restricted to squeeze
them into less space, I used prefix bits to change the 15-bit
instructions to 17-bit instructions that could use any two registers.

Well, the new header type adds the option to also, by using some
prefix bits, assign a 19-bit instruction to a 16-bit slot... and these
19-bit instructions add memory-reference instructions to the half-word instructions.

So now, in addition to containing up to 8 ordinary 32-bit
instructions, a 256-bit block can contain up to 24 instructions
belonging to a mix of 17-bit and 19-bit instructions, short
instructions that now are a complete set, including load and store
memory reference instructions.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Sun Jun 23 16:19:27 2024

John Savard wrote:

On Tue, 18 Jun 2024 21:36:06 -0600, John Savard <quadibloc@servername.invalid> wrote:

On Tue, 18 Jun 2024 23:57:54 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

And so you indicate this explicitly in VVM as well. I tended to assume >>>> only a limited number of registers would be needed to live in, plus I
have both floating and integer register files, hence the differences.

It ends up that the majority of register uses in a loop do not need to
be visible outside of the loop. This is almost the contrapositive of >>>annotating which registers are temporary in the loop. 90%+ of loops do >>>not even need the index register to be live outside of the loop.

You have convinced me here to learn from your wisdom: I will do two
things. One is to add a bit that decides whether my 1 bits (confined
to a single group of 8 registers) are live-in or live-out bits. The
other is to specify clearly to implementors that if a register is
specified as "live-in" but is never actually used in a loop, this must
not cause any problems.

I have not yet added my attempt at an imitation of VVM to Concertina
II. However, I have now laid some important groundwork for it.

In my architecture, there are already Cray-style long vectors. They
are intended to nbe the principal and most efficient way of working
with vector quantities in the architecture. So if my VVM-alike was
disjoint from them, and could only interact with them through memory,
this would be an awkwardness in the ISA that needlessly constrains performance.

While the vectorizing HW certainly has CRAY-like vector flip-flops
they are not addressable by SW. The code within the VEC--LOOP
brackets reads as if scalar:: So, My 66000 consumes exactly 2
OpCodes to provide an entire vector instruction set--one that
works as well as possible across various implementations.

So I've added operate instructions that allow operations where one
operand is in a normal register, and the other operand is in a
selected element of a vector register. The element is itself specified
by the contents of an integer register, for convenient use within
loops.

Thus, a VVM-alike loop, instead of going from some vectors in memory
to other vectors in memory, could go from some vector registers to
other vector registers. The vectors aren't virtual any more.

A VVM Loop is just a bunch of normal instruction between 2 brackets
that can be executed as fast as dependencies allow and as many times
as the loop count and condition allow.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sun Jun 23 15:26:15 2024

On Sun, 23 Jun 2024 16:19:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

While the vectorizing HW certainly has CRAY-like vector flip-flops
they are not addressable by SW. The code within the VEC--LOOP
brackets reads as if scalar:: So, My 66000 consumes exactly 2
OpCodes to provide an entire vector instruction set--one that
works as well as possible across various implementations.

Oh, yes, your VVM is wonderful.
My attempt at an imitation of VVM, at least, if not the real thing
that you have in your 66000, would be inferior in one important way to Cray-style vector registers.
A virtual vector loop would take input vector values from memory, and
return results to memory. Yes, there are multiple operations within
the loop, but I am still assuming that the length and complexity of
such loops is constrained.
So if you have Cray-style vector registers, you have a place to store intermediate results _between_ these loops that avoids referring to
memory.
In addition, one potentially catastrophic limitation is that, because
the meaning of register specifications in instructions is changed,
_there can't be any subroutine calls in such loops_. (Now that it's
typical for computers to have instructions that do log and trig
functions, this is slightly _less_ catastrophic, though.) Branches
within the loops and instruction predication, though, would still be
permitted.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Sun Jun 23 23:46:23 2024

John Savard wrote:

On Sun, 23 Jun 2024 16:19:27 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

While the vectorizing HW certainly has CRAY-like vector flip-flops
they are not addressable by SW. The code within the VEC--LOOP
brackets reads as if scalar:: So, My 66000 consumes exactly 2
OpCodes to provide an entire vector instruction set--one that
works as well as possible across various implementations.

Oh, yes, your VVM is wonderful.

Well, at lest is keeps the R in RISC meaning Reduced instead of
Ridiculous.

It also alters cache semantics to avoid a single vector from erasing
the whole data cache. If the data is not going to be used again,
then it is not put in the cache (both inbound and outbound.)

My attempt at an imitation of VVM, at least, if not the real thing
that you have in your 66000, would be inferior in one important way to Cray-style vector registers.

A virtual vector loop would take input vector values from memory, and
return results to memory. Yes, there are multiple operations within
the loop, but I am still assuming that the length and complexity of
such loops is constrained.

So if you have Cray-style vector registers, you have a place to store intermediate results _between_ these loops that avoids referring to
memory.

Vector reduction is about the only realistic limitation, even here,
CRAY-like vectors "have their own problems". Performing a Summation
over an array consumes memory in order but performs FADDs in modulo
order whereas VVM performs the FADDs in program order. IEEE went so
far as to specify augmented addition which greatly ameliorates the
addition order problems.

These are loops from memory but to registers. About the only loops
that are from registers and to memory are memset()-like--which
easily vectorizes.

What VVM does not provide is the non-looping individual instructions.

In addition, one potentially catastrophic limitation is that, because
the meaning of register specifications in instructions is changed,
_there can't be any subroutine calls in such loops_. (Now that it's
typical for computers to have instructions that do log and trig
functions, this is slightly _less_ catastrophic, though.) Branches
within the loops and instruction predication, though, would still be permitted.

First most trig functions have become instructions not subroutine
calls,
so that issue is ameliorated.

But, yes, VVM <as of now> only vectorizes the inner most loop.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sun Jun 23 19:50:44 2024

On Sun, 23 Jun 2024 23:46:23 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

But, yes, VVM <as of now> only vectorizes the inner most loop.

I don't regard _that_ as an issue or limitation, at least in itself.
But keeping code I don't expect to vectorize from using memory is
still a gain.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Mon Jun 24 19:52:28 2024

On Sat, 22 Jun 2024 23:04:28 -0600, John Savard
<quadibloc@servername.invalid> wrote:

Because it seemed to me that any VVM-alike instruction I had would
have to have at least an alternate form longer than 32 bits, despite
my efforts to squeeze it in to much less space than you use... I felt
that I needed to go back to an earlier iteration of Concertina for a
method of making it easier to use long instructions in programs.

Doing that, though, required me to reserve some opcode space, and one
of the consequence is that the instructions referred to above had to
be moved to an alternate instruction set!

I decided that this was unacceptable, and that I did not need to
reserve so much space for an alternate way of encoding long
instructions.

Instead of changing how most long instructions are encoded, I've kept
this new way of encoding long instructions, with less opcode space
reserved for it, for a special use: long instructions that might need
to vary in length in a complicated fashion. Instructions that are
entirely in the instruction stream as variable-length instructions
can't be like that, but if the excess over 32 bits is accessed by a supplementary pointer in the same reserved area as used for
pseudo-immediate values, then it doesn't matter if the length of the instruction varies because various fields are included or omitted in a complicated fashion.

So now long instructions of this type only need a small amount of
opcode space, as only a few special ones are encoded this way.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	547
Nodes:	16 (2 / 14)
Uptime:	71:39:46
Calls:	10,398
Files:	14,070
Messages:	6,417,621

Concertlina II: Full Circle

Who's Online

System Info