• Concertlina II: Full Circle

    From John Savard@21:1/5 to All on Mon Jun 17 11:33:45 2024
    I've noted earlier that I felt I had been going around in circles with Concertina II, changing the instruction format back and forth, instead
    of making progress to flesh it out.
    Recently, I added a new instruction to facilitate looping.
    But the trouble was that it took up tooo much opcode space.
    One thing that occured to me was that if I went back to an old method
    of specifying instructions longer than 32-bits: using a 4-bit pSupp
    field to point into the same reserved area in the block as used for pseudo-immediates, that would suit this instruction very well.

    The reason is that if that techique were used, then I could use the
    header that's also an instruction to just squeeze in the three-bit
    decode field, and so access to the Loop instruction would be easy as
    befits its importance.

    Then I went back, and looked up an older version of Concertina II
    which had it. It had complicated block headers. But worse than that,
    it had _four_ different versions of the complete instruction set!
    Which version was used depended on the header.The idea, of course,
    that some headers required a pared-down version of the instruction set
    so as to squeeze in more stuff.
    It was also interesting to see how much further along I had gotten in
    fleshing out that older version of the instruction set.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Mon Jun 17 20:20:03 2024
    John Savard wrote:

    I've noted earlier that I felt I had been going around in circles with Concertina II, changing the instruction format back and forth, instead
    of making progress to flesh it out.

    At least no one can say you give up easily.

    Recently, I added a new instruction to facilitate looping.
    But the trouble was that it took up tooo much opcode space.

    Yes, indeed...

    One thing that occured to me was that if I went back to an old method
    of specifying instructions longer than 32-bits: using a 4-bit pSupp
    field to point into the same reserved area in the block as used for pseudo-immediates, that would suit this instruction very well.

    The reason is that if that techique were used, then I could use the
    header that's also an instruction to just squeeze in the three-bit
    decode field, and so access to the Loop instruction would be easy as
    befits its importance.

    Then I went back, and looked up an older version of Concertina II
    which had it. It had complicated block headers. But worse than that,
    it had _four_ different versions of the complete instruction set!
    Which version was used depended on the header.The idea, of course,
    that some headers required a pared-down version of the instruction set
    so as to squeeze in more stuff.
    It was also interesting to see how much further along I had gotten in fleshing out that older version of the instruction set.

    As to looping, I faced the same delimma and came to a different
    conclusion::
    You don't do it in 1 instruction, instead, you do it in a way where
    your
    2 instruction encoding executes one of the instructions only once. I
    call
    this bookending the loop.

    So, I have an instruction called VEC, which donates a register and
    provides other guidance to the loop. And I have an instruction called
    LOOP which performs the bottom of loop calculations. The register
    donated
    by VEC is given the address of the top of the loop so that the loop
    terminating instruction is relieved of needing to supply it in the
    form of a displacement,...

    VEC is executed once at the top of the loop, and provides guidance as
    to which registers from within the loop are live-out of the loop.
    This allows HW to avoid writing everything into RF and facilitates
    running the loop across multiple lanes of function units.

    LOOP, then, performs the ADD, a CMP, and a BC to the top of the loop.

    I ended up with 3 kinds of LOOPs:
    a) counted -- for( i = 0; i < max; i++ )
    b) searching -- for( i = 0; a[i] > 13; i++ )
    c) both -- for( i = 0; i < max && a[i]; i++ )

    This coves the majority of loops where the looping condition is
    encodable.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Mon Jun 17 15:57:53 2024
    On Mon, 17 Jun 2024 20:20:03 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    As to looping, I faced the same delimma and came to a different
    conclusion::
    You don't do it in 1 instruction, instead, you do it in a way where
    your
    2 instruction encoding executes one of the instructions only once. I
    call
    this bookending the loop.

    I considered something like that.

    My problem was that encoding the parameters of the loop in one
    instruction takes too much space. So the first thing I thought of was
    to put some of them in the instruction that repeats the loop.

    The proiblem was, though, that since the instruction that repeats the
    loop points to the start of the loop in memory, it's a
    memory-reference instruction, so there isn't much extra room left in
    it.

    However, there is a little room left, so I may indeed go back and
    explore that possibility some more.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Mon Jun 17 23:17:27 2024
    John Savard wrote:

    On Mon, 17 Jun 2024 20:20:03 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    As to looping, I faced the same delimma and came to a different >>conclusion::
    You don't do it in 1 instruction, instead, you do it in a way where
    your
    2 instruction encoding executes one of the instructions only once. I
    call
    this bookending the loop.

    I considered something like that.

    My problem was that encoding the parameters of the loop in one
    instruction takes too much space. So the first thing I thought of was
    to put some of them in the instruction that repeats the loop.

    The proiblem was, though, that since the instruction that repeats the
    loop points to the start of the loop in memory, it's a
    memory-reference instruction, so there isn't much extra room left in
    it.

    No, it is not a memref--it is a return ! using the register from the
    VEC instruction. You "return" to the top of the loop. There is no
    reason to use IP+Disp, and the fact there is no register nor disp-
    lacement in LOOP enables it all to fit. In addition, when VEC executes,

    IP is pointing at the top of the loop, requiring no calculation
    whatsoever.

    However, there is a little room left, so I may indeed go back and
    explore that possibility some more.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to quadibloc@servername.invalid on Tue Jun 18 10:11:40 2024
    On Tue, 18 Jun 2024 10:01:20 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    On Mon, 17 Jun 2024 23:17:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    No, it is not a memref--it is a return ! using the register from the
    VEC instruction.

    As should not surprise you, I was referring to the end-of-loop
    instruction in my current Concertina II, not the one in your MY 66000.

    I try to avoid stacks, and reserving extra registers, as much as I
    can.

    Also, this looping instruction is strictly a way to directly encode
    the FORTRAN DO loop. It does not attempt any vectorization.

    At one point, in the original Concertina, I did have a sort of
    loop/vectorize instruction with a functionality that may be somewhat
    similar to your VVM. I am definitely going to look at adding that to
    Concertina II, as this will perhaps clarify the discussion.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Tue Jun 18 10:01:20 2024
    On Mon, 17 Jun 2024 23:17:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    No, it is not a memref--it is a return ! using the register from the
    VEC instruction.

    As should not surprise you, I was referring to the end-of-loop
    instruction in my current Concertina II, not the one in your MY 66000.

    I try to avoid stacks, and reserving extra registers, as much as I
    can.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jun 18 16:54:04 2024
    John Savard wrote:

    On Tue, 18 Jun 2024 10:01:20 -0600, John Savard <quadibloc@servername.invalid> wrote:

    On Mon, 17 Jun 2024 23:17:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    No, it is not a memref--it is a return ! using the register from the
    VEC instruction.

    As should not surprise you, I was referring to the end-of-loop
    instruction in my current Concertina II, not the one in your MY 66000.

    I try to avoid stacks, and reserving extra registers, as much as I
    can.

    Also, this looping instruction is strictly a way to directly encode
    the FORTRAN DO loop. It does not attempt any vectorization.

    The semantics of instructions in a loop are subtly altered such
    that they can be vectorized and to execute multi-lane style.

    At one point, in the original Concertina, I did have a sort of
    loop/vectorize instruction with a functionality that may be somewhat
    similar to your VVM. I am definitely going to look at adding that to Concertina II, as this will perhaps clarify the discussion.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Tue Jun 18 16:17:33 2024
    John Savard <quadibloc@servername.invalid> schrieb:
    On Tue, 18 Jun 2024 10:01:20 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    On Mon, 17 Jun 2024 23:17:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    No, it is not a memref--it is a return ! using the register from the
    VEC instruction.

    As should not surprise you, I was referring to the end-of-loop
    instruction in my current Concertina II, not the one in your MY 66000.

    I try to avoid stacks, and reserving extra registers, as much as I
    can.

    Also, this looping instruction is strictly a way to directly encode
    the FORTRAN DO loop. It does not attempt any vectorization.

    Which one, the FORTRAN 66 one or the one since FORTRAN 77?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jun 18 16:52:22 2024
    John Savard wrote:

    On Mon, 17 Jun 2024 23:17:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    No, it is not a memref--it is a return ! using the register from the
    VEC instruction.

    As should not surprise you, I was referring to the end-of-loop
    instruction in my current Concertina II, not the one in your MY 66000.

    It may surprise you to know that I knew and know that you are talking
    about Concer-tina-tanic.

    I was merely trying to show you another way to get back to the top
    of a loop--one that takes way fewer bits to encode.

    I try to avoid stacks, and reserving extra registers, as much as I
    can.

    My LOOP has no stack.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to tkoenig@netcologne.de on Tue Jun 18 13:17:41 2024
    On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:
    John Savard <quadibloc@servername.invalid> schrieb:

    Also, this looping instruction is strictly a way to directly encode
    the FORTRAN DO loop. It does not attempt any vectorization.

    Which one, the FORTRAN 66 one or the one since FORTRAN 77?

    FORTRAN IV (or 66) indeed.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Tue Jun 18 13:38:23 2024
    On Tue, 18 Jun 2024 16:54:04 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    The semantics of instructions in a loop are subtly altered such
    that they can be vectorized and to execute multi-lane style.

    I've decided that I will not be able to use the one from the original Concertina, and will need to design a VVM-like instruction for
    Concertina II from scratch.

    Unlike yours, it won't be...subtle.

    The action of the instruction which begins the loop will, I think, be
    basically the same as yours. It willl issue successive iterations of
    the loop starting in consecutive cycles.

    To do so, though, that instruction will contain a number of fields in
    which to specify parameters:

    (3 bits) An index register, which is initialized to zero at the start
    of the loop, and "incremented" (the quote marks are, of course,
    because it won't really be the same register on each iteration) for
    subsequent iterations.
    (3 bits) The power of two which is to serve as the increment.
    (8 bits) A register mask, in which a 1 bit corresponds to a register
    used for intermediate results within the loop. This will become a
    forwarding node rather than a register; all other registers can only
    be read, and serve as constant values only. The index register set up previously does not need to be indicated by this.
    (2 bits) This indicates which of the four groups of 8 registers in a
    bank of 32 registers the register mask applies to.
    (1 bit) This indicates whether we're talking about the integer
    registers or the floating-point ones.

    In addition, in the long version of the instruction, there's a 16-bit
    register mask for the short vector registers.

    Because iterations are independent, one can't handle a stride in the
    natural efficient manner of adding the stride value to a second
    pointer register. This could be a common source of error, so I feel
    the need to make some provision for this.

    One scheme I am considering would be to include one bit in the
    instruction that begins a loop to indicate the loop contains a
    preamble. The preambles execute serially, and when they conclude,
    everything that follows is issued immediately, to execute in parallel
    (but now with a multi-cycle offset) to previous iterations.

    Upon reflection, this doesn't waste a huge amount of time, so it is
    better to go with it than including fields for stride value and a
    second counter register in the loop start instruction.

    Since the preambles do execute serially, the "end preamble"
    instruction would point to the loop start instruction. Instead of full memory-reference, though, it would just include a short value that is
    a negative program-relative address.

    Iterations that execute in parallel, though, don't "branch back"
    anywhere, so the loop end instruction has no parameters. At least
    something is like your VVM.

    So this is how I take your VVM concept, and mess it up by making it unnecessarily complicated; basically, because I don't want to make an
    ISA that requires implementations to be, so to speak, "intelligent".
    (i.e. upon the first store into a register in the loop, categorize
    that register as a node reference)

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to John Savard on Tue Jun 18 19:40:48 2024
    John Savard <quadibloc@servername.invalid> schrieb:
    On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:
    John Savard <quadibloc@servername.invalid> schrieb:

    Also, this looping instruction is strictly a way to directly encode
    the FORTRAN DO loop. It does not attempt any vectorization.

    Which one, the FORTRAN 66 one or the one since FORTRAN 77?

    FORTRAN IV (or 66) indeed.

    It was actually not defined in the standard, in practice it
    was usually implemented by a test at the bottom of the loop,
    and programs depended on that.

    FORTRAN 77 fixed that, so now

    DO 100 I=1,0

    ...
    100 CONTINUE

    is executed zero times.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to tkoenig@netcologne.de on Tue Jun 18 14:15:48 2024
    On Tue, 18 Jun 2024 19:40:48 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:

    John Savard <quadibloc@servername.invalid> schrieb:
    On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig >><tkoenig@netcologne.de> wrote:
    John Savard <quadibloc@servername.invalid> schrieb:

    Also, this looping instruction is strictly a way to directly encode
    the FORTRAN DO loop. It does not attempt any vectorization.

    Which one, the FORTRAN 66 one or the one since FORTRAN 77?

    FORTRAN IV (or 66) indeed.

    It was actually not defined in the standard, in practice it
    was usually implemented by a test at the bottom of the loop,
    and programs depended on that.

    FORTRAN 77 fixed that, so now

    DO 100 I=1,0

    ...
    100 CONTINUE

    is executed zero times.

    Ah. I can't include that fix now, as I've changed things so that one
    of the parameters is at the end of the loop, so the instruction that
    heads the loop doesn't know if the "step" parameter is negative or
    not.

    The change has not yet been posted.

    I thought you were asking about whether I included stuff like DO
    WHILE. That would have to be done using old-fashioned conditional
    branch instructions.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jun 18 21:23:57 2024
    John Savard wrote:

    On Tue, 18 Jun 2024 16:54:04 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    The semantics of instructions in a loop are subtly altered such
    that they can be vectorized and to execute multi-lane style.

    I've decided that I will not be able to use the one from the original Concertina, and will need to design a VVM-like instruction for
    Concertina II from scratch.

    Unlike yours, it won't be...subtle.

    LOL

    The action of the instruction which begins the loop will, I think, be basically the same as yours. It willl issue successive iterations of
    the loop starting in consecutive cycles.

    To do so, though, that instruction will contain a number of fields in
    which to specify parameters:

    (3 bits) An index register, which is initialized to zero at the start
    of the loop, and "incremented" (the quote marks are, of course,
    because it won't really be the same register on each iteration) for subsequent iterations.

    This is in the LOOP at the end.

    (3 bits) The power of two which is to serve as the increment.

    The increment is in the LOOP at the end and can be any random value
    and is not necessarily fixed from iteration to iteration.

    (8 bits) A register mask, in which a 1 bit corresponds to a register
    used for intermediate results within the loop. This will become a
    forwarding node rather than a register; all other registers can only
    be read, and serve as constant values only. The index register set up previously does not need to be indicated by this.

    The inverse of this is in VEC at the top. VEC provides a bit vector of registers the compiler wants as Live-Out of the loop. That is, every-
    thing else is temporary. This list rarely annotates more than 2
    live-outs.

    (2 bits) This indicates which of the four groups of 8 registers in a
    bank of 32 registers the register mask applies to.

    I have no register restraints.

    (1 bit) This indicates whether we're talking about the integer
    registers or the floating-point ones.

    Loops controlled by floating point indexes do not vectorize, however
    the
    body of the loop can be any mix of int, logical, memory, or FP
    instructions.

    In addition, in the long version of the instruction, there's a 16-bit register mask for the short vector registers.

    Because iterations are independent, one can't handle a stride in the
    natural efficient manner of adding the stride value to a second
    pointer register. This could be a common source of error, so I feel
    the need to make some provision for this.

    Are you using stride in the sense of::

    for( i = 0; i < max; i +=7 )
    a[i] = b[i];

    ??
    It gives VVM no problem whatsoever, however multilane execution
    is more difficult, but semantically, the results remain correct.

    One scheme I am considering would be to include one bit in the
    instruction that begins a loop to indicate the loop contains a
    preamble. The preambles execute serially, and when they conclude,
    everything that follows is issued immediately, to execute in parallel
    (but now with a multi-cycle offset) to previous iterations.

    I just have instructions before the VEC instruction.

    Upon reflection, this doesn't waste a huge amount of time, so it is
    better to go with it than including fields for stride value and a
    second counter register in the loop start instruction.

    Since the preambles do execute serially, the "end preamble"
    instruction would point to the loop start instruction. Instead of full memory-reference, though, it would just include a short value that is
    a negative program-relative address.

    Iterations that execute in parallel, though, don't "branch back"
    anywhere, so the loop end instruction has no parameters. At least
    something is like your VVM.

    That is why you want LOOP to execute under a different paradigm than
    BC.

    So this is how I take your VVM concept, and mess it up by making it unnecessarily complicated; basically, because I don't want to make an
    ISA that requires implementations to be, so to speak, "intelligent".
    (i.e. upon the first store into a register in the loop, categorize
    that register as a node reference)

    Do you have a night job as a stand up comedian ??

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Tue Jun 18 16:01:34 2024
    On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    (1 bit) This indicates whether we're talking about the integer
    registers or the floating-point ones.

    Loops controlled by floating point indexes do not vectorize, however
    the
    body of the loop can be any mix of int, logical, memory, or FP
    instructions.

    Oh no, my index is always an integer. This bit applies to the
    "live-in" bits - if the loop performs floating-point computation, it's floating-point registers that I want to mark as forwarding nodes.

    And so you indicate this explicitly in VVM as well. I tended to assume
    only a limited number of registers would be needed to live in, plus I
    have both floating and integer register files, hence the differences.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Stephen Fuld@21:1/5 to Thomas Koenig on Tue Jun 18 22:42:32 2024
    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:
    On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:
    John Savard <quadibloc@servername.invalid> schrieb:

    Also, this looping instruction is strictly a way to directly
    encode >>> the FORTRAN DO loop. It does not attempt any vectorization.

    Which one, the FORTRAN 66 one or the one since FORTRAN 77?

    FORTRAN IV (or 66) indeed.

    It was actually not defined in the standard, in practice it
    was usually implemented by a test at the bottom of the loop,
    and programs depended on that.

    FORTRAN 77 fixed that, so now

    DO 100 I=1,0

    ...
    100 CONTINUE

    is executed zero times.


    How does VVM handle that? It sems you must "waste" some time, not
    executing the loop body until the furst LOOP instruction tells you
    whether to or not, or perhaps not actually updating the values the
    first time through the loop. Neither seems optimal. :-(





    --
    - Stephen Fuld
    (e-mail address disguised to prevent spam)

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Tue Jun 18 23:57:54 2024
    John Savard wrote:

    On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    (1 bit) This indicates whether we're talking about the integer
    registers or the floating-point ones.

    Loops controlled by floating point indexes do not vectorize, however
    the body of the loop can be any mix of int, logical, memory, or FP >>instructions.

    Oh no, my index is always an integer. This bit applies to the
    "live-in" bits - if the loop performs floating-point computation, it's floating-point registers that I want to mark as forwarding nodes.

    See, I do not have this distinction, there is but one file.

    And so you indicate this explicitly in VVM as well. I tended to assume
    only a limited number of registers would be needed to live in, plus I
    have both floating and integer register files, hence the differences.

    It ends up that the majority of register uses in a loop do not need to
    be visible outside of the loop. This is almost the contrapositive of
    annotating which registers are temporary in the loop. 90%+ of loops do
    not even need the index register to be live outside of the loop.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Tue Jun 18 23:53:32 2024
    Stephen Fuld wrote:

    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:
    On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:
    John Savard <quadibloc@servername.invalid> schrieb:

    Also, this looping instruction is strictly a way to directly
    encode >>> the FORTRAN DO loop. It does not attempt any vectorization.

    Which one, the FORTRAN 66 one or the one since FORTRAN 77?

    FORTRAN IV (or 66) indeed.

    It was actually not defined in the standard, in practice it
    was usually implemented by a test at the bottom of the loop,
    and programs depended on that.

    FORTRAN 77 fixed that, so now

    DO 100 I=1,0

    ...
    100 CONTINUE

    is executed zero times.


    How does VVM handle that? It sems you must "waste" some time, not
    executing the loop body until the furst LOOP instruction tells you
    whether to or not, or perhaps not actually updating the values the
    first time through the loop. Neither seems optimal. :-(

    There is a check at the top of the loop which branches around the
    VEC--LOOP bookends--most common loops get this optimized away.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jun 19 00:12:40 2024
    Stephen Fuld wrote:

    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:
    On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:
    John Savard <quadibloc@servername.invalid> schrieb:

    Also, this looping instruction is strictly a way to directly
    encode >>> the FORTRAN DO loop. It does not attempt any vectorization.

    Which one, the FORTRAN 66 one or the one since FORTRAN 77?

    FORTRAN IV (or 66) indeed.

    It was actually not defined in the standard, in practice it
    was usually implemented by a test at the bottom of the loop,
    and programs depended on that.

    FORTRAN 77 fixed that, so now

    DO 100 I=1,0

    ...
    100 CONTINUE

    is executed zero times.


    How does VVM handle that? It sems you must "waste" some time, not
    executing the loop body until the furst LOOP instruction tells you
    whether to or not, or perhaps not actually updating the values the
    first time through the loop. Neither seems optimal. :-(

    Compiler emits a check at the top of the loop and branches around
    VEC-LOOP if the loop is not supposed to be run.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Stephen Fuld on Wed Jun 19 00:36:22 2024
    Stephen Fuld wrote:

    Thomas Koenig wrote:

    John Savard <quadibloc@servername.invalid> schrieb:
    On Tue, 18 Jun 2024 16:17:33 -0000 (UTC), Thomas Koenig
    <tkoenig@netcologne.de> wrote:
    John Savard <quadibloc@servername.invalid> schrieb:

    Also, this looping instruction is strictly a way to directly
    encode >>> the FORTRAN DO loop. It does not attempt any vectorization.

    Which one, the FORTRAN 66 one or the one since FORTRAN 77?

    FORTRAN IV (or 66) indeed.

    It was actually not defined in the standard, in practice it
    was usually implemented by a test at the bottom of the loop,
    and programs depended on that.

    FORTRAN 77 fixed that, so now

    DO 100 I=1,0

    ...
    100 CONTINUE

    is executed zero times.


    How does VVM handle that? It sems you must "waste" some time, not
    executing the loop body until the furst LOOP instruction tells you
    whether to or not, or perhaps not actually updating the values the
    first time through the loop. Neither seems optimal. :-(


    The compiler emits code at the top of the loop to branch around the
    VEC-LOOP bookends.

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Wed Jun 19 00:28:24 2024
    John Savard wrote:

    On Tue, 18 Jun 2024 16:54:04 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    The semantics of instructions in a loop are subtly altered such
    that they can be vectorized and to execute multi-lane style.

    I've decided that I will not be able to use the one from the original Concertina, and will need to design a VVM-like instruction for
    Concertina II from scratch.

    Unlike yours, it won't be...subtle.

    The action of the instruction which begins the loop will, I think, be basically the same as yours. It willl issue successive iterations of
    the loop starting in consecutive cycles.

    To do so, though, that instruction will contain a number of fields in
    which to specify parameters:

    (3 bits) An index register, which is initialized to zero at the start
    of the loop, and "incremented" (the quote marks are, of course,
    because it won't really be the same register on each iteration) for subsequent iterations.

    Ri is provided in the LOOP instruction

    (3 bits) The power of two which is to serve as the increment.

    There is no such need in VVM, increment is either a constant or a
    register and is not restricted to powers of 2.

    (8 bits) A register mask, in which a 1 bit corresponds to a register
    used for intermediate results within the loop. This will become a
    forwarding node rather than a register; all other registers can only
    be read, and serve as constant values only. The index register set up previously does not need to be indicated by this.

    The contrapositive of this is provided for in VEC.

    (2 bits) This indicates which of the four groups of 8 registers in a
    bank of 32 registers the register mask applies to.

    I found no need.

    (1 bit) This indicates whether we're talking about the integer
    registers or the floating-point ones.

    I have but 1 register file.

    In addition, in the long version of the instruction, there's a 16-bit register mask for the short vector registers.

    I also have not software addressable vector registers.

    Because iterations are independent, one can't handle a stride in the
    natural efficient manner of adding the stride value to a second
    pointer register. This could be a common source of error, so I feel
    the need to make some provision for this.

    for( i = 0; i < max; i +=7 )

    falls out for free. But also note::

    for( i = 0; i < max; i++ )
    a[i] = b[i];

    is always faster than:

    for( i = 0; i < max; i++ )
    *ap++ = *bp++;

    The top loop is 3 instruction, the bottom one is 5.

    One scheme I am considering would be to include one bit in the
    instruction that begins a loop to indicate the loop contains a
    preamble. The preambles execute serially, and when they conclude,
    everything that follows is issued immediately, to execute in parallel
    (but now with a multi-cycle offset) to previous iterations.

    VVM just has instruction before the VEC instruction to deal with this.

    Upon reflection, this doesn't waste a huge amount of time, so it is
    better to go with it than including fields for stride value and a
    second counter register in the loop start instruction.

    Since the preambles do execute serially, the "end preamble"
    instruction would point to the loop start instruction. Instead of full memory-reference, though, it would just include a short value that is
    a negative program-relative address.

    Iterations that execute in parallel, though, don't "branch back"
    anywhere, so the loop end instruction has no parameters. At least
    something is like your VVM.

    By considering the the branch back to the top as a return, those loops
    which were executed simultaneously just die instead of returning to the

    top, only the MOD-N lane returns to the top.

    So this is how I take your VVM concept, and mess it up by making it unnecessarily complicated; basically, because I don't want to make an
    ISA that requires implementations to be, so to speak, "intelligent".
    (i.e. upon the first store into a register in the loop, categorize
    that register as a node reference)

    LOL but have fun.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Tue Jun 18 21:36:06 2024
    On Tue, 18 Jun 2024 23:57:54 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    And so you indicate this explicitly in VVM as well. I tended to assume
    only a limited number of registers would be needed to live in, plus I
    have both floating and integer register files, hence the differences.

    It ends up that the majority of register uses in a loop do not need to
    be visible outside of the loop. This is almost the contrapositive of >annotating which registers are temporary in the loop. 90%+ of loops do
    not even need the index register to be live outside of the loop.

    You have convinced me here to learn from your wisdom: I will do two
    things. One is to add a bit that decides whether my 1 bits (confined
    to a single group of 8 registers) are live-in or live-out bits. The
    other is to specify clearly to implementors that if a register is
    specified as "live-in" but is never actually used in a loop, this must
    not cause any problems.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Anton Ertl@21:1/5 to mitchalsup@aol.com on Wed Jun 19 07:58:08 2024
    mitchalsup@aol.com (MitchAlsup1) writes:
    [the instruction that ends the loop]
    No, it is not a memref--it is a return ! using the register from the
    VEC instruction. You "return" to the top of the loop. There is no
    reason to use IP+Disp, and the fact there is no register nor disp-
    lacement in LOOP enables it all to fit. In addition, when VEC executes,

    IP is pointing at the top of the loop, requiring no calculation
    whatsoever.

    On a related note, about a year ago I have started research on the
    performance effect of (programming language) virtual-machine IP
    updates in interpreters. The dependence chains of these IP updates
    create a lower bound for the execution time of the program, and it
    turns out that, if the interpreter is otherwise efficient enough, this
    lower bound determines performance, and that we see speedups by up to
    a factor of 3 (depending on benchmark and microarchitecture) by
    optimizing these IP updates.

    One of the optimizations we tried out was to break the dependence
    chain be saving the IP on loop entry, and using that IP when starting
    the next iteration; this eliminates the IP updates of one iteration
    from the dependence chain of the next iteration.

    - anton
    --
    'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
    Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed Jun 19 13:26:30 2024
    MitchAlsup1 wrote:
    John Savard wrote:

    On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    (1 bit) This indicates whether we're talking about the integer
    registers or the floating-point ones.

    Loops controlled by floating point indexes do not vectorize, however
    the body of the loop can be any mix of int, logical, memory, or FP
    instructions.

    Oh no, my index is always an integer. This bit applies to the
    "live-in" bits - if the loop performs floating-point computation, it's
    floating-point registers that I want to mark as forwarding nodes.

    See, I do not have this distinction, there is but one file.

    And so you indicate this explicitly in VVM as well. I tended to assume
    only a limited number of registers would be needed to live in, plus I
    have both floating and integer register files, hence the differences.

    It ends up that the majority of register uses in a loop do not need to
    be visible outside of the loop. This is almost the contrapositive of annotating which registers are temporary in the loop. 90%+ of loops do
    not even need the index register to be live outside of the loop.

    This is partly due to programming languages that applies lifetimes to variables, so that an index register which is defined in the scaffolding
    of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as soon
    as the loop terminates.

    Without such a restriction, there are many times when it would be very
    natural to inspect the index in order to determine if this was a normal (counting) exit, or an early exit due to some internal test.

    Personally, I have still not settled on my preferred way to handle cases
    like this, but I possibly will do so after I retire.

    Terje

    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Terje Mathisen on Wed Jun 19 16:04:40 2024
    Terje Mathisen wrote:

    MitchAlsup1 wrote:
    John Savard wrote:

    On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    (1 bit) This indicates whether we're talking about the integer
    registers or the floating-point ones.

    Loops controlled by floating point indexes do not vectorize, however
    the body of the loop can be any mix of int, logical, memory, or FP
    instructions.

    Oh no, my index is always an integer. This bit applies to the
    "live-in" bits - if the loop performs floating-point computation, it's
    floating-point registers that I want to mark as forwarding nodes.

    See, I do not have this distinction, there is but one file.

    And so you indicate this explicitly in VVM as well. I tended to assume
    only a limited number of registers would be needed to live in, plus I
    have both floating and integer register files, hence the differences.

    It ends up that the majority of register uses in a loop do not need to
    be visible outside of the loop. This is almost the contrapositive of
    annotating which registers are temporary in the loop. 90%+ of loops do
    not even need the index register to be live outside of the loop.

    This is partly due to programming languages that applies lifetimes to variables, so that an index register which is defined in the
    scaffolding

    of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as soon
    as the loop terminates.

    There are loops for which the last index and the last inbound data
    reference
    want to remain visible--search loops for example. But in general, the
    amount
    of data wanted outside of the loop is very small indeed.

    Without such a restriction, there are many times when it would be very natural to inspect the index in order to determine if this was a normal

    (counting) exit, or an early exit due to some internal test.

    The most important thing is that the live-outs of the loop are few
    while
    the loop-temps are many.

    Personally, I have still not settled on my preferred way to handle
    cases
    like this, but I possibly will do so after I retire.

    Terje

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to quadibloc@servername.invalid on Wed Jun 19 11:01:26 2024
    On Tue, 18 Jun 2024 14:15:48 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    Ah. I can't include that fix now, as I've changed things so that one
    of the parameters is at the end of the loop, so the instruction that
    heads the loop doesn't know if the "step" parameter is negative or
    not.

    The change has not yet been posted.

    I have now updated my loop instruction so that there's no need for a
    Step instruction. There was one problem: the new Iterate instruction
    takes more opcode space, and that took away some opcode space used for
    headers. Fortunately, I had some available opcode space now among
    operate instructions instead of memory-reference instructions that I
    could use instead, so I moved them over.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Thomas Koenig@21:1/5 to Terje Mathisen on Wed Jun 19 17:18:07 2024
    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    MitchAlsup1 wrote:
    John Savard wrote:

    On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    (1 bit) This indicates whether we're talking about the integer
    registers or the floating-point ones.

    Loops controlled by floating point indexes do not vectorize, however
    the body of the loop can be any mix of int, logical, memory, or FP
    instructions.

    Oh no, my index is always an integer. This bit applies to the
    "live-in" bits - if the loop performs floating-point computation, it's
    floating-point registers that I want to mark as forwarding nodes.

    See, I do not have this distinction, there is but one file.

    And so you indicate this explicitly in VVM as well. I tended to assume
    only a limited number of registers would be needed to live in, plus I
    have both floating and integer register files, hence the differences.

    It ends up that the majority of register uses in a loop do not need to
    be visible outside of the loop. This is almost the contrapositive of
    annotating which registers are temporary in the loop. 90%+ of loops do
    not even need the index register to be live outside of the loop.

    This is partly due to programming languages that applies lifetimes to variables, so that an index register which is defined in the scaffolding
    of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as soon
    as the loop terminates.

    This makes things more clear to anybody reading the code (and
    unambiguous to the compiler). However, lifetime analysis has
    also become very good, and if the value is not used afterwards,
    I expect no difference in practice.

    Without such a restriction, there are many times when it would be very natural to inspect the index in order to determine if this was a normal (counting) exit, or an early exit due to some internal test.

    Hmm... do you mean for the programmer, or for the compiler?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to Thomas Koenig on Wed Jun 19 18:31:12 2024
    Thomas Koenig wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    MitchAlsup1 wrote:
    John Savard wrote:

    On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    (1 bit) This indicates whether we're talking about the integer
    registers or the floating-point ones.

    Loops controlled by floating point indexes do not vectorize, however >>>>> the body of the loop can be any mix of int, logical, memory, or FP
    instructions.

    Oh no, my index is always an integer. This bit applies to the
    "live-in" bits - if the loop performs floating-point computation, it's >>>> floating-point registers that I want to mark as forwarding nodes.

    See, I do not have this distinction, there is but one file.

    And so you indicate this explicitly in VVM as well. I tended to assume >>>> only a limited number of registers would be needed to live in, plus I
    have both floating and integer register files, hence the differences.

    It ends up that the majority of register uses in a loop do not need to
    be visible outside of the loop. This is almost the contrapositive of
    annotating which registers are temporary in the loop. 90%+ of loops do
    not even need the index register to be live outside of the loop.

    This is partly due to programming languages that applies lifetimes to
    variables, so that an index register which is defined in the
    scaffolding

    of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as soon
    as the loop terminates.

    This makes things more clear to anybody reading the code (and
    unambiguous to the compiler). However, lifetime analysis has
    also become very good, and if the value is not used afterwards,
    I expect no difference in practice.

    When one writes::

    for( uint64_t i = 0; i < max; i++ )

    the lifetime of i is explicit--it terminates with the loop.

    Without such a restriction, there are many times when it would be very
    natural to inspect the index in order to determine if this was a normal

    (counting) exit, or an early exit due to some internal test.

    Hmm... do you mean for the programmer, or for the compiler?

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed Jun 19 22:49:40 2024
    MitchAlsup1 wrote:
    Thomas Koenig wrote:

    Terje Mathisen <terje.mathisen@tmsw.no> schrieb:
    This is partly due to programming languages that applies lifetimes to
    variables, so that an index register which is defined in the
    scaffolding

    of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as
    soon as the loop terminates.

    This makes things more clear to anybody reading the code (and
    unambiguous to the compiler).  However, lifetime analysis has
    also become very good, and if the value is not used afterwards,
    I expect no difference in practice.

    When one writes::

        for( uint64_t i = 0; i < max; i++ )

    the lifetime of i is explicit--it terminates with the loop.

    Without such a restriction, there are many times when it would be
    very natural to inspect the index in order to determine if this was a
    normal

    (counting) exit, or an early exit due to some internal test.

    Hmm... do you mean for the programmer, or for the compiler?

    This is probably my asm background shining trough:

    All asm loops have the counting register available directly after loop
    exit, until it is reused. When I want to do the same in C I just have to
    define the variable before the loop starts, instead of inside the ().

    Terje
    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From Terje Mathisen@21:1/5 to All on Wed Jun 19 22:52:41 2024
    MitchAlsup1 wrote:
    Terje Mathisen wrote:

    MitchAlsup1 wrote:
    John Savard wrote:

    On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    (1 bit) This indicates whether we're talking about the integer
    registers or the floating-point ones.

    Loops controlled by floating point indexes do not vectorize, however >>>>> the body of the loop can be any mix of int, logical, memory, or FP
    instructions.

    Oh no, my index is always an integer. This bit applies to the
    "live-in" bits - if the loop performs floating-point computation, it's >>>> floating-point registers that I want to mark as forwarding nodes.

    See, I do not have this distinction, there is but one file.

    And so you indicate this explicitly in VVM as well. I tended to assume >>>> only a limited number of registers would be needed to live in, plus I
    have both floating and integer register files, hence the differences.

    It ends up that the majority of register uses in a loop do not need to
    be visible outside of the loop. This is almost the contrapositive of
    annotating which registers are temporary in the loop. 90%+ of loops
    do not even need the index register to be live outside of the loop.

    This is partly due to programming languages that applies lifetimes to
    variables, so that an index register which is defined in the
    scaffolding

    of the loop (i.e. for (i = 0; i < limit; i++) {}) is invisible as soon
    as the loop terminates.

    There are loops for which the last index and the last inbound data
    reference
    want to remain visible--search loops for example. But in general, the
    amount of data wanted outside of the loop is very small indeed.

    Right.

    Without such a restriction, there are many times when it would be very
    natural to inspect the index in order to determine if this was a normal

    (counting) exit, or an early exit due to some internal test.

    The most important thing is that the live-outs of the loop are few
    while
    the loop-temps are many.

    Also almost always true.

    Terje

    Personally, I have still not settled on my preferred way to handle
    cases
    like this, but I possibly will do so after I retire.

    Terje


    --
    - <Terje.Mathisen at tmsw.no>
    "almost all programming can be viewed as an exercise in caching"

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Wed Jun 19 22:04:04 2024
    On Tue, 18 Jun 2024 21:23:57 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    John Savard wrote:

    On Tue, 18 Jun 2024 16:54:04 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    The semantics of instructions in a loop are subtly altered such
    that they can be vectorized and to execute multi-lane style.

    I've decided that I will not be able to use the one from the original
    Concertina, and will need to design a VVM-like instruction for
    Concertina II from scratch.

    Unlike yours, it won't be...subtle.

    LOL

    I wrote that before I learned you had explicit opt-out bits in your
    VVM instruction.

    Also, I've checked. There's nothing resembling VVM in my original
    Concertina design, as I had mistakenly thought.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to quadibloc@servername.invalid on Wed Jun 19 21:39:11 2024
    On Tue, 18 Jun 2024 21:36:06 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    On Tue, 18 Jun 2024 23:57:54 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    And so you indicate this explicitly in VVM as well. I tended to assume
    only a limited number of registers would be needed to live in, plus I
    have both floating and integer register files, hence the differences.

    It ends up that the majority of register uses in a loop do not need to
    be visible outside of the loop. This is almost the contrapositive of >>annotating which registers are temporary in the loop. 90%+ of loops do
    not even need the index register to be live outside of the loop.

    You have convinced me here to learn from your wisdom: I will do two
    things. One is to add a bit that decides whether my 1 bits (confined
    to a single group of 8 registers) are live-in or live-out bits. The
    other is to specify clearly to implementors that if a register is
    specified as "live-in" but is never actually used in a loop, this must
    not cause any problems.

    I have not yet added my attempt at an imitation of VVM to Concertina
    II. However, I have now laid some important groundwork for it.

    In my architecture, there are already Cray-style long vectors. They
    are intended to nbe the principal and most efficient way of working
    with vector quantities in the architecture. So if my VVM-alike was
    disjoint from them, and could only interact with them through memory,
    this would be an awkwardness in the ISA that needlessly constrains
    performance.

    So I've added operate instructions that allow operations where one
    operand is in a normal register, and the other operand is in a
    selected element of a vector register. The element is itself specified
    by the contents of an integer register, for convenient use within
    loops.

    Thus, a VVM-alike loop, instead of going from some vectors in memory
    to other vectors in memory, could go from some vector registers to
    other vector registers. The vectors aren't virtual any more.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to quadibloc@servername.invalid on Sat Jun 22 23:04:28 2024
    On Wed, 19 Jun 2024 21:39:11 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    So I've added operate instructions that allow operations where one
    operand is in a normal register, and the other operand is in a
    selected element of a vector register. The element is itself specified
    by the contents of an integer register, for convenient use within
    loops.

    Thus, a VVM-alike loop, instead of going from some vectors in memory
    to other vectors in memory, could go from some vector registers to
    other vector registers. The vectors aren't virtual any more.

    Because it seemed to me that any VVM-alike instruction I had would
    have to have at least an alternate form longer than 32 bits, despite
    my efforts to squeeze it in to much less space than you use... I felt
    that I needed to go back to an earlier iteration of Concertina for a
    method of making it easier to use long instructions in programs.

    Doing that, though, required me to reserve some opcode space, and one
    of the consequence is that the instructions referred to above had to
    be moved to an alternate instruction set!

    I haven['t yet added the additional long instructions to the pages. If
    I'm reserving that much opcode space (1/32nd of the total opcode
    space) I'm thinking I should do something amazing with it, not
    something ho-hum.

    Meanwhile, though, I have added something "amazing" to the ISA for a
    very tiny cost in opcode space. I've added an eleventh header type
    which applies *four* prefix bits to every 16 bits in what's left of
    the block after the header.

    What does this do?

    Well, it used to be I had 16-bit instructions occupying 1/4 of the
    opcode space which included register-to-register instructions that
    could involve only two registers from the same group of eight
    registers.

    Partly because I was told this was a very bad thing, and because I
    needed to take that 1/4 of the opcode space back so I could have
    load-store instructions that were not heavily restricted to squeeze
    them into less space, I used prefix bits to change the 15-bit
    instructions to 17-bit instructions that could use any two registers.

    Well, the new header type adds the option to also, by using some
    prefix bits, assign a 19-bit instruction to a 16-bit slot... and these
    19-bit instructions add memory-reference instructions to the half-word instructions.

    So now, in addition to containing up to 8 ordinary 32-bit
    instructions, a 256-bit block can contain up to 24 instructions
    belonging to a mix of 17-bit and 19-bit instructions, short
    instructions that now are a complete set, including load and store
    memory reference instructions.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Sun Jun 23 16:19:27 2024
    John Savard wrote:

    On Tue, 18 Jun 2024 21:36:06 -0600, John Savard <quadibloc@servername.invalid> wrote:

    On Tue, 18 Jun 2024 23:57:54 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:
    John Savard wrote:

    And so you indicate this explicitly in VVM as well. I tended to assume >>>> only a limited number of registers would be needed to live in, plus I
    have both floating and integer register files, hence the differences.

    It ends up that the majority of register uses in a loop do not need to
    be visible outside of the loop. This is almost the contrapositive of >>>annotating which registers are temporary in the loop. 90%+ of loops do >>>not even need the index register to be live outside of the loop.

    You have convinced me here to learn from your wisdom: I will do two
    things. One is to add a bit that decides whether my 1 bits (confined
    to a single group of 8 registers) are live-in or live-out bits. The
    other is to specify clearly to implementors that if a register is
    specified as "live-in" but is never actually used in a loop, this must
    not cause any problems.

    I have not yet added my attempt at an imitation of VVM to Concertina
    II. However, I have now laid some important groundwork for it.

    In my architecture, there are already Cray-style long vectors. They
    are intended to nbe the principal and most efficient way of working
    with vector quantities in the architecture. So if my VVM-alike was
    disjoint from them, and could only interact with them through memory,
    this would be an awkwardness in the ISA that needlessly constrains performance.

    While the vectorizing HW certainly has CRAY-like vector flip-flops
    they are not addressable by SW. The code within the VEC--LOOP
    brackets reads as if scalar:: So, My 66000 consumes exactly 2
    OpCodes to provide an entire vector instruction set--one that
    works as well as possible across various implementations.

    So I've added operate instructions that allow operations where one
    operand is in a normal register, and the other operand is in a
    selected element of a vector register. The element is itself specified
    by the contents of an integer register, for convenient use within
    loops.

    Thus, a VVM-alike loop, instead of going from some vectors in memory
    to other vectors in memory, could go from some vector registers to
    other vector registers. The vectors aren't virtual any more.

    A VVM Loop is just a bunch of normal instruction between 2 brackets
    that can be executed as fast as dependencies allow and as many times
    as the loop count and condition allow.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Sun Jun 23 15:26:15 2024
    On Sun, 23 Jun 2024 16:19:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    While the vectorizing HW certainly has CRAY-like vector flip-flops
    they are not addressable by SW. The code within the VEC--LOOP
    brackets reads as if scalar:: So, My 66000 consumes exactly 2
    OpCodes to provide an entire vector instruction set--one that
    works as well as possible across various implementations.

    Oh, yes, your VVM is wonderful.
    My attempt at an imitation of VVM, at least, if not the real thing
    that you have in your 66000, would be inferior in one important way to Cray-style vector registers.
    A virtual vector loop would take input vector values from memory, and
    return results to memory. Yes, there are multiple operations within
    the loop, but I am still assuming that the length and complexity of
    such loops is constrained.
    So if you have Cray-style vector registers, you have a place to store intermediate results _between_ these loops that avoids referring to
    memory.
    In addition, one potentially catastrophic limitation is that, because
    the meaning of register specifications in instructions is changed,
    _there can't be any subroutine calls in such loops_. (Now that it's
    typical for computers to have instructions that do log and trig
    functions, this is slightly _less_ catastrophic, though.) Branches
    within the loops and instruction predication, though, would still be
    permitted.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From MitchAlsup1@21:1/5 to John Savard on Sun Jun 23 23:46:23 2024
    John Savard wrote:

    On Sun, 23 Jun 2024 16:19:27 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    While the vectorizing HW certainly has CRAY-like vector flip-flops
    they are not addressable by SW. The code within the VEC--LOOP
    brackets reads as if scalar:: So, My 66000 consumes exactly 2
    OpCodes to provide an entire vector instruction set--one that
    works as well as possible across various implementations.

    Oh, yes, your VVM is wonderful.

    Well, at lest is keeps the R in RISC meaning Reduced instead of
    Ridiculous.

    It also alters cache semantics to avoid a single vector from erasing
    the whole data cache. If the data is not going to be used again,
    then it is not put in the cache (both inbound and outbound.)

    My attempt at an imitation of VVM, at least, if not the real thing
    that you have in your 66000, would be inferior in one important way to Cray-style vector registers.

    A virtual vector loop would take input vector values from memory, and
    return results to memory. Yes, there are multiple operations within
    the loop, but I am still assuming that the length and complexity of
    such loops is constrained.

    So if you have Cray-style vector registers, you have a place to store intermediate results _between_ these loops that avoids referring to
    memory.

    Vector reduction is about the only realistic limitation, even here,
    CRAY-like vectors "have their own problems". Performing a Summation
    over an array consumes memory in order but performs FADDs in modulo
    order whereas VVM performs the FADDs in program order. IEEE went so
    far as to specify augmented addition which greatly ameliorates the
    addition order problems.

    These are loops from memory but to registers. About the only loops
    that are from registers and to memory are memset()-like--which
    easily vectorizes.

    What VVM does not provide is the non-looping individual instructions.

    In addition, one potentially catastrophic limitation is that, because
    the meaning of register specifications in instructions is changed,
    _there can't be any subroutine calls in such loops_. (Now that it's
    typical for computers to have instructions that do log and trig
    functions, this is slightly _less_ catastrophic, though.) Branches
    within the loops and instruction predication, though, would still be permitted.

    First most trig functions have become instructions not subroutine
    calls,
    so that issue is ameliorated.

    But, yes, VVM <as of now> only vectorizes the inner most loop.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to All on Sun Jun 23 19:50:44 2024
    On Sun, 23 Jun 2024 23:46:23 +0000, mitchalsup@aol.com (MitchAlsup1)
    wrote:

    But, yes, VVM <as of now> only vectorizes the inner most loop.

    I don't regard _that_ as an issue or limitation, at least in itself.
    But keeping code I don't expect to vectorize from using memory is
    still a gain.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)
  • From John Savard@21:1/5 to quadibloc@servername.invalid on Mon Jun 24 19:52:28 2024
    On Sat, 22 Jun 2024 23:04:28 -0600, John Savard
    <quadibloc@servername.invalid> wrote:

    Because it seemed to me that any VVM-alike instruction I had would
    have to have at least an alternate form longer than 32 bits, despite
    my efforts to squeeze it in to much less space than you use... I felt
    that I needed to go back to an earlier iteration of Concertina for a
    method of making it easier to use long instructions in programs.

    Doing that, though, required me to reserve some opcode space, and one
    of the consequence is that the instructions referred to above had to
    be moved to an alternate instruction set!

    I decided that this was unacceptable, and that I did not need to
    reserve so much space for an alternate way of encoding long
    instructions.

    Instead of changing how most long instructions are encoded, I've kept
    this new way of encoding long instructions, with less opcode space
    reserved for it, for a special use: long instructions that might need
    to vary in length in a complicated fashion. Instructions that are
    entirely in the instruction stream as variable-length instructions
    can't be like that, but if the excess over 32 bits is accessed by a supplementary pointer in the same reserved area as used for
    pseudo-immediate values, then it doesn't matter if the length of the instruction varies because various fields are included or omitted in a complicated fashion.

    So now long instructions of this type only need a small amount of
    opcode space, as only a few special ones are encoded this way.

    John Savard

    --- SoupGate-Win32 v1.05
    * Origin: fsxNet Usenet Gateway (21:1/5)