Forum: >>> Magnum BBS <<<

Oops (Concertina II Going Around in Circles)

From John Savard@21:1/5 to All on Wed Apr 24 23:49:25 2024

I keep changing the basic design of Concertina II, instead of going
forward and completing the task of fleshing it out.

The reason for that... has been obvious all along. None of my attempts
have satisfied me. I had goals for the architecture, some of which
weren't being met by each iteration. So I kept going back and forth
between compromising one set of goals, or compromising another set of
goals.

If I could make up my mind on what was most important to me, perhaps I
could stop somewhere.

Looking back at the various iterations, I did see that two goals were
very important to me.

I wanted to be able to have 16-bit instructions, at least in pairs
within a 32-bit instruction slot, available without the overhead of a
block header, in the basic instruction set. For this, I need to
reserve 1/4 of the opcode space.

Also, I wanted to have the basic load-store memory-reference
instructions be able to use 16-bit displacements, have a three-bit
index register field and a three-bit base register field, and be able
to use all 32 registers in a normal register bank as destinations.
This takes 3/4 of the opcode space.

As 3/4 plus 1/4 is _not_ greater than 1, having both of these things
in a design simultaneously is not impossible.

And I've found some tiny scraps of opcode space left (in the 3/4 part;
flexible auto-increment with an odd index register, since only even
index registers are allowed in that mode) which are barely enough...

for two-address register to register operate instructions, _and_ for a
block header.

The block header, while rudimentary, would be enough to allow...

indicating some instruction slots as containing instructions from a
secondary instruction set, so as to allow things like three-address
operate instructions, multiple-register load and store instructions,

and also allowing pseudo-immediates...

and instructions longer than 32 bits.

I have two unused opcodes in the load/store memory reference
instructions, so I can use one of them for jump to subroutine (offset
in the index register field, return address register in the
destination register field) - and one for conditional jump. Since the
condition code can go in the destinatin register field, and it only
needs four bits, not five... I can also have a Load Address
instruction, with the limitation that only registers 0-7 and 24-31 can
be used as destinations (the ones used as index registers and the
usual base registers).

However, requiring the block header mechanism even for load and store
multiple registers, basic to subroutine calls, means that the basic
instruction set is... only _barely_ a complete one.

So this is unlikely to satisfy me for very long either.

One other possibility: stick with the current design - 1/4 of the
opcode space for 16-bit instructions and 1/4 of the opcode space for instructions longer than 32 bits, so as to reduce their overhead and
possibly allow the mechanism to also be used for prefixing
instructions (not needed, though, if I decide to return to having
block headers in a less vestigial form)...

I would have to squeeze the "rest" of the instruction set a bit more
if I switched from aligned-only load and store instructions to going
to using only four base registers for them (the least painful of the restrictions I've considered so far), but it should be doable.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu Apr 25 16:00:14 2024

John Savard wrote:

I keep changing the basic design of Concertina II, instead of going
forward and completing the task of fleshing it out.

The reason for that... has been obvious all along. None of my attempts
have satisfied me. I had goals for the architecture, some of which
weren't being met by each iteration. So I kept going back and forth
between compromising one set of goals, or compromising another set of
goals.

If I could make up my mind on what was most important to me, perhaps I
could stop somewhere.

Looking back at the various iterations, I did see that two goals were
very important to me.

I wanted to be able to have 16-bit instructions, at least in pairs
within a 32-bit instruction slot, available without the overhead of a
block header, in the basic instruction set. For this, I need to
reserve 1/4 of the opcode space.

Also, I wanted to have the basic load-store memory-reference
instructions be able to use 16-bit displacements, have a three-bit
index register field and a three-bit base register field, and be able
to use all 32 registers in a normal register bank as destinations.
This takes 3/4 of the opcode space.

As 3/4 plus 1/4 is _not_ greater than 1, having both of these things
in a design simultaneously is not impossible.

Not impossible, sure: but reserving so much for so little is gonna hurt.

And I've found some tiny scraps of opcode space left (in the 3/4 part; flexible auto-increment with an odd index register, since only even
index registers are allowed in that mode) which are barely enough...

In my opinion, your first cut at an ISA encoding should not consume more
than ½ of the available encodings. Concer-tina-tanic is already full to
the brim and you are still just fleshing it out.

for two-address register to register operate instructions, _and_ for a
block header.

The block header, while rudimentary, would be enough to allow...

indicating some instruction slots as containing instructions from a
secondary instruction set, so as to allow things like three-address
operate instructions, multiple-register load and store instructions,

and also allowing pseudo-immediates...

and instructions longer than 32 bits.

I have two unused opcodes in the load/store memory reference
instructions, so I can use one of them for jump to subroutine (offset
in the index register field, return address register in the
destination register field) - and one for conditional jump. Since the condition code can go in the destinatin register field, and it only
needs four bits, not five... I can also have a Load Address
instruction, with the limitation that only registers 0-7 and 24-31 can
be used as destinations (the ones used as index registers and the
usual base registers).

However, requiring the block header mechanism even for load and store multiple registers, basic to subroutine calls, means that the basic instruction set is... only _barely_ a complete one.

So this is unlikely to satisfy me for very long either.

Sigh....

One other possibility: stick with the current design - 1/4 of the
opcode space for 16-bit instructions and 1/4 of the opcode space for instructions longer than 32 bits, so as to reduce their overhead and
possibly allow the mechanism to also be used for prefixing
instructions (not needed, though, if I decide to return to having
block headers in a less vestigial form)...

I would have to squeeze the "rest" of the instruction set a bit more
if I switched from aligned-only load and store instructions to going
to using only four base registers for them (the least painful of the restrictions I've considered so far), but it should be doable.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Thu Apr 25 12:41:23 2024

On Thu, 25 Apr 2024 16:00:14 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

In my opinion, your first cut at an ISA encoding should not consume more
than � of the available encodings. Concer-tina-tanic is already full to
the brim and you are still just fleshing it out.

Basically, I think that the reasonable length that a computer
instruction should occupy is that which a similar instruction occupied
on the IBM System/360 - which, in its day, was not regarded highly for
its code density.

However, I have banks of 32 registers instead of 16, and 16-bit
displacements instead of 12 bits. Having only load and store
memory-reference instructions, of course, helps to make up for this.

That's why I can only use 8 of the 32 registers as base registers and
as index registers, too.

For wanting the impossible, of course I basically deserve what I get.
If I _could_ manaage to pull it off, of course, the result would be of
some practical use; an instruction set that's plain, clear, and simple
(at least when compared to monstrosities like Itanium and x86) and
which is parsimonious in its use of memory is of some value.

While I'm rearranging the deck chairs, maybe I'll come up with an
original idea.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Thu Apr 25 14:50:08 2024

On Thu, 25 Apr 2024 12:41:23 -0600, John Savard
<quadibloc@servername.invalid> wrote:

While I'm rearranging the deck chairs, maybe I'll come up with an
original idea.

This latest proposal, which does differ from my previous attempts,
does have _one_ advantage.

In this case, as in the previous attempts, I will need to use the
block header to indicate that some 32-bit instruction slots contain
32-bit instructions in an "alternate" format.

When the memory-reference instructions in the main format were
compromised, that alternate format included uncompromised
memory-reference instructions. So the extended instruction set,
normal plus alternate, included the normal instructions twice.

Here, I avoid that. Of course, though, the main format includes a
severely compromised version of the register-to-register operate
instructions. The alternate format would include the full version of
those.

Same thing, right?

Well, not really - because the compromised version of
register-to-register operate instructions contains only *one*
instruction format. So there _is_ less duplication and waste, the
instruction decode unit isn't set up to decode both the full version
of the operate instructions and a second compromised format which is
equally complex, but just has one bit trimmed off everywhere.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Sun May 5 00:57:44 2024

On Wed, 24 Apr 2024 23:49:25 -0600, John Savard
<quadibloc@servername.invalid> wrote:

So this is unlikely to satisfy me for very long either.

And, given that, I've thought long and hard about what really is
needed.

The main opcode space for 32-bit instructions is now divided as
follows:

3/4 for uncompromised memory-reference instructions.

3/16 for uncompromised register-to-register operate instructions.

1/16 for the header required for variable-length instructions.

The variable-length instructions will allow, with 32 bits of overhead
per block, arbitrary mixing of 17-bit short instructions (the extra
bit goes into the two-bit prefix field in the header) and 32-bit
instructions - and longer instructions.

00 and 01 indicate 17-bit instructions starting with 0 and 1
respectively.

10 indicates a 16-bit extent that contains the start of an instruction
32 bits long or longer.

11 indicates a 16-bit extent that is not the start of an instruction.
In addition to the remaining parts of an instruction, space reserved
for pseudo-immediates can be indicated by this.

There will be three forms of header.

One just has a three-bit field indicating the number of 32-bit
instruction slots reserved for pseudo-immediates, in a restricted register-to-register operate instruction squeezed into an odd bit of
leftover opcode space.

The other will provide VLIW functionality for code consisting only of
32-bit instructions: predication, and explicit indication of
parallelism.

The final one is 1111 that allows 17-bit instructions, 48, 64, 80, and
96 bit instructions, and their arbitrary mixing.

This has the advantage of providing all the functionality I'm looking
for - a large, extensible instruction set, compactness of code through
16-bit instructions that don't restict which registers can be used,
and memory-reference instructions that make full use of a 32-bit
length being the only version of those instructions, instead of having
to include both a cut-down form and a full-form, the latter only
accessible with a header.

Finally, this seems to be something that I will be forced to admit
that further restructurings won't be able to improve upon - this will
be the best way to squeeze everything I want into the 8-bit byte and
the 32-bit word.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Mon May 6 11:10:17 2024

On Sun, 05 May 2024 00:57:44 -0600, John Savard
<quadibloc@servername.invalid> wrote:

The main opcode space for 32-bit instructions is now divided as
follows:

3/4 for uncompromised memory-reference instructions.

3/16 for uncompromised register-to-register operate instructions.

1/16 for the header required for variable-length instructions.

This is not quite right.

3/4 for uncompromised basic memory-reference instructions.

1/8 for other memory-reference instructions.

1/16 for uncompromised register-to-register operate instructions.

1/16 for the header required for variable-length instructions.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Mon May 6 11:06:44 2024

On Sun, 05 May 2024 00:57:44 -0600, John Savard
<quadibloc@servername.invalid> wrote:

The final one is 1111 that allows 17-bit instructions, 48, 64, 80, and
96 bit instructions, and their arbitrary mixing.

However, one thing I wanted to do was have the 48-bit and longer
instructions also available outside of the variable-length format.

Previously, I had done this by having a second format of long
instructions. I wanted to avoid that, this time.

I came up with an idea. Just as 1111 _after_ the header in
variable-length mode was used to indicate long instructions, for other
modes, let 1111 after the header indicate each of two instruction
slots in which three 18-bit units from variable-length are
encapsulated. So a 48-bit instruction, taking up 64 bits, could be
placed in any of the other modes.

But because that code conflicts with the header, these things aren't first-class citizens! I tried freeing up 1110 as well, but that was
clearly not going to work acceptably. So I took other measures that
only partly addressed that issue but consumed far less opcode space.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Mon May 6 19:45:09 2024

John Savard wrote:

On Sun, 05 May 2024 00:57:44 -0600, John Savard <quadibloc@servername.invalid> wrote:

The main opcode space for 32-bit instructions is now divided as
follows:

3/4 for uncompromised memory-reference instructions.

3/16 for uncompromised register-to-register operate instructions.

1/16 for the header required for variable-length instructions.

This is not quite right.

3/4 for uncompromised basic memory-reference instructions.

1/8 for other memory-reference instructions.

1/16 for uncompromised register-to-register operate instructions.

1/16 for the header required for variable-length instructions.

In comparison::

1/8 for [Rbase+@disp16]
1/8 for Rd = Rs1 OP imm16
1/64 for [Rbase,Ri<<scale,#disp]
1/64 for Rd = Rs1 OP Rs2
1/64 for Rd = 3OP( Rs1,Rs2,Rs3)
1/64 for Rd = 1OP( Rs1 )
1/64 for PRED
1/64 for <w:o>
1/8 for branching

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Tue May 7 02:21:54 2024

On Mon, 06 May 2024 11:06:44 -0600, John Savard
<quadibloc@servername.invalid> wrote:

But because that code conflicts with the header, these things aren't >first-class citizens! I tried freeing up 1110 as well, but that was
clearly not going to work acceptably. So I took other measures that
only partly addressed that issue but consumed far less opcode space.

Although I had limited long vector and short vector operate
instructions in the basic 32 bit instruction set, I didn't have long
vector and short vector load and store instructions of any kind. Do I
needed to add them in some form in order for the basic 32 bit
instruction set to be complete.

However, if I were to include a 6-bit length field in the long vector
load and store instructions, once again I would have had to free up
1/16 of the opcode space. Instead of completely doing without the
ability to load and store any but full-length vectors, I eventually
was able to include a two-bit length register field to the long vector
load and store instructions.

So this new instruction set has survived another challenge.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Tue May 7 21:49:10 2024

On Tue, 07 May 2024 02:21:54 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Mon, 06 May 2024 11:06:44 -0600, John Savard ><quadibloc@servername.invalid> wrote:

But because that code conflicts with the header, these things aren't >>first-class citizens! I tried freeing up 1110 as well, but that was
clearly not going to work acceptably. So I took other measures that
only partly addressed that issue but consumed far less opcode space.

Although I had limited long vector and short vector operate
instructions in the basic 32 bit instruction set, I didn't have long
vector and short vector load and store instructions of any kind. Do I
needed to add them in some form in order for the basic 32 bit
instruction set to be complete.

However, if I were to include a 6-bit length field in the long vector
load and store instructions, once again I would have had to free up
1/16 of the opcode space. Instead of completely doing without the
ability to load and store any but full-length vectors, I eventually
was able to include a two-bit length register field to the long vector
load and store instructions.

So this new instruction set has survived another challenge.

And I've vinally gotten around, therefore, to updating my web site to
present this latest incarnation as Concertina II.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed May 8 01:46:41 2024

On Thu, 25 Apr 2024 16:00:14 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

In my opinion, your first cut at an ISA encoding should not consume more
than � of the available encodings. Concer-tina-tanic is already full to
the brim and you are still just fleshing it out.

This is a point I think I should address.

Why are my various iterations of Concertina II _all_, consistently,
"full to the brim"?

This is true if I compromise the basic load/store instructions, say by
limiting them to three base registers for 16-bit displacements, so I
can reserve 1/4 of the opcode space for paired 16-bit short
instructions - which was one of the most common combinations -

or if I reserve half the opcode space for two kinds of 16-bit short instructions,

or if I don't compromise the basic load/store instructions, and only
allow 16-bit instructions with a special header.

These are the three basic variants of Concertina II that I have been vacillating between. But whichever I choose, I use nearly all possible
opcode space, at least for basic 32-bit instructions.

That didn't worry me much for two reasons.

If I need an enormous amount of opcode space for some new kind of
instructions for some new feature...

I would still have _enormous_ amounts of opcode space available up in
the stratosphere of 64-bit instructions and longer. (In some
iterations, I did manage to use nearly all the 48-bit opcode space,
because I tried to squeeze a form of string and packed decimal
instructions there.)

But what if the new feature was so important that I needed to have
*short* instructions for the operations using that feature - 32-bit
long instructions?

Well, because of the block structure of Concertina II, which is
primarily present to support pseudo-immediates (my idea of how to
reconcile immediate values in instructions, which you've pointed out
are very valuable, with my Concertina II design goal of fully
independent and parallel decoding of every instruction in a block) and secondarily to allow VLIW features...

I can always add one new type of header which specifies alternate
instructions with fairly low overhead... and then, at a modest cost,
even the most enormous new feature can have its own 32-bit
instructions!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Wed May 8 02:01:07 2024

On Wed, 08 May 2024 01:46:41 -0600, John Savard
<quadibloc@servername.invalid> wrote:

Why are my various iterations of Concertina II _all_, consistently,
"full to the brim"?

I can always add one new type of header which specifies alternate >instructions with fairly low overhead... and then, at a modest cost,
even the most enormous new feature can have its own 32-bit
instructions!

That only answersl a part of that question - why I feel I can _get
away_ with having an ISA that is "full to the brim". But why did I let
it get that way in the first place?

Well, the reason for that is actually quite simple. Because a major
design goal of Concertina II is to offer as much as possible of the
basic operations required of a computer in instructions of the
shortest possible length.

16-bit displacements are the norm in microprocessor instruction sets,
so I offer them. I offer base-index addressing - which microprocessors
usually don't - because I feel it's needed for using arrays. And I
have register banks of 32 registers because that's what today's RISC
machines do.

All of that means that the load and store instructions - particularly
when integer load and store also include load unsigned and insert -
take up 3/4 of all 32-bit instructions (approximately; one doesn't
need unsigned load and insert for the 64-bit integer type, because it
fills the register). And that's with using only 8 of the 32 registers
for the base register and the index register each.

Some parts of the instruction set do have slack. Two-address register-to-register operate instructions have a large opcode field,
so there is some room for future expansion in parts of the instruction
set.

But, basically, it takes all the available bits to offer the level of functionality I am trying to provide with the basic 32-bit instruction
set. Since that covers the traditional functionality of a CPU -
floating-point and integer types - nothing basic is missing.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Wed May 8 09:17:34 2024

On Wed, 08 May 2024 01:46:41 -0600, John Savard
<quadibloc@servername.invalid> wrote:

I can always add one new type of header which specifies alternate >instructions with fairly low overhead... and then, at a modest cost,
even the most enormous new feature can have its own 32-bit
instructions!

And, naturally, after saying this, I had to go and prove it was
possible by revising the ISA to add one alternate set of 32-bit
instructions. Two more such sets are reserved for future expansion,
however.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed May 8 21:46:37 2024

John Savard wrote:

On Thu, 25 Apr 2024 16:00:14 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

In my opinion, your first cut at an ISA encoding should not consume more >>than ½ of the available encodings. Concer-tina-tanic is already full to >>the brim and you are still just fleshing it out.

This is a point I think I should address.

Why are my various iterations of Concertina II _all_, consistently,
"full to the brim"?

This is true if I compromise the basic load/store instructions, say by limiting them to three base registers for 16-bit displacements, so I
can reserve 1/4 of the opcode space for paired 16-bit short
instructions - which was one of the most common combinations -

or if I reserve half the opcode space for two kinds of 16-bit short instructions,

or if I don't compromise the basic load/store instructions, and only
allow 16-bit instructions with a special header.

These are the three basic variants of Concertina II that I have been vacillating between. But whichever I choose, I use nearly all possible
opcode space, at least for basic 32-bit instructions.

This should hint that you are long down the dark alley.

That didn't worry me much for two reasons.

Perhaps you feel save down the dark alley....

If I need an enormous amount of opcode space for some new kind of instructions for some new feature...

I would still have _enormous_ amounts of opcode space available up in
the stratosphere of 64-bit instructions and longer. (In some
iterations, I did manage to use nearly all the 48-bit opcode space,
because I tried to squeeze a form of string and packed decimal
instructions there.)

So, why do you need a header AT ALL !!

{Notice that I get a full functional ISA where the instruction specifier
is always 32-bits and I still have room for constants and for extensions.}

If your bail out position is:: "some instructions can be 64-bits" --
S T A R T with that as an assumption !!

But what if the new feature was so important that I needed to have
*short* instructions for the operations using that feature - 32-bit
long instructions?

G A S P ........why do I even try.....

Well, because of the block structure of Concertina II, which is
primarily present to support pseudo-immediates (my idea of how to
reconcile immediate values in instructions, which you've pointed out
are very valuable, with my Concertina II design goal of fully
independent and parallel decoding of every instruction in a block) and secondarily to allow VLIW features...

I can always add one new type of header which specifies alternate instructions with fairly low overhead... and then, at a modest cost,
even the most enormous new feature can have its own 32-bit
instructions!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Wed May 8 21:59:54 2024

John Savard wrote:

On Wed, 08 May 2024 01:46:41 -0600, John Savard <quadibloc@servername.invalid> wrote:

Why are my various iterations of Concertina II _all_, consistently,
"full to the brim"?

I can always add one new type of header which specifies alternate >>instructions with fairly low overhead... and then, at a modest cost,
even the most enormous new feature can have its own 32-bit
instructions!

That only answersl a part of that question - why I feel I can _get
away_ with having an ISA that is "full to the brim". But why did I let
it get that way in the first place?

Well, the reason for that is actually quite simple. Because a major
design goal of Concertina II is to offer as much as possible of the
basic operations required of a computer in instructions of the
shortest possible length.

May I suggest that sacrificing 16-bit instructions may give you the room whereby typical applications require less space without the 16-bit insts
that with them !?!

But this begs the question::

Would your implementations perform better by executing FEWER instructions
or executing MORE instructions at a faster rate ??? The tradeoffs are complicated and subtle. In 1986±, Mark Horowitz stated that <Stanford>
MIPS executed 1.5× as many instructions as VAX 11/780 at 6× the frequency
to achieve a 4× performance advantage.

My 66000, on the other hand is executing 1.1× as many instructions as
VAX 11/780 and has a 5% (1/20) per pipeline stage gate overhead compared
to RISC-V (maybe) for a 35% performance advantage over RISC-V.

I say (maybe) because the pipeline designs I see for RISC-V use a 2 cycle latency LD pipeline with set associative caches. This puts a lot of gates between AGEN and LD forwarding to fit in 2 cycles. My pipelines give this
loop 3 cycles.

16-bit displacements are the norm in microprocessor instruction sets,
so I offer them. I offer base-index addressing - which microprocessors usually don't - because I feel it's needed for using arrays. And I
have register banks of 32 registers because that's what today's RISC
machines do.

So, you are getting eaten alive by the extra bit of register specifier !! which, then, is forcing you into extreme encoding positions--gotcha.

All of that means that the load and store instructions - particularly
when integer load and store also include load unsigned and insert -
take up 3/4 of all 32-bit instructions (approximately; one doesn't
need unsigned load and insert for the 64-bit integer type, because it
fills the register). And that's with using only 8 of the 32 registers
for the base register and the index register each.

Do not put into ISA that which compiler CANNOT use !!
Oh, wait, you have no ability to know what the compiler can use--either.

Some parts of the instruction set do have slack. Two-address register-to-register operate instructions have a large opcode field,
so there is some room for future expansion in parts of the instruction
set.

But, basically, it takes all the available bits to offer the level of functionality I am trying to provide with the basic 32-bit instruction
set. Since that covers the traditional functionality of a CPU - floating-point and integer types - nothing basic is missing.

Tisk.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed May 8 20:39:09 2024

On Wed, 8 May 2024 21:59:54 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

May I suggest that sacrificing 16-bit instructions may give you the room >whereby typical applications require less space without the 16-bit insts
that with them !?!

Your suggestions are always welcome, given your great breadth of
knowledge.

My latest "extreme encoding position" means that 16-bit instructinos
are now relegated to a secondary instruction format that must be
indicated by a header. However, now they're 17 bits long instead of 15
bits long, so they can operate on any two registers in a 32-register
bank.

Having 14 instructions in a block instead of 8 instructions normally
lets me do more. I know that in your MY 66000 architecture, the
instructions have extra functionality that lets you combine things
like negation and increment with an operation. While I certainly could
try to add such a feature to my architecture - in fact, I did try that
in one Concertina II iteration - I'm afraid that, not having your
knowledge, I wouldn't be able to do it in a way that resulted in any significant savings in the number of instructions required for a
program.

And if I tried to add flexibility, I'd end up with an instruction set
that looked like that of the VAX 11/780, which is not a direction to
go in if performance is a concern.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed May 8 20:48:20 2024

On Wed, 8 May 2024 21:46:37 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

So, why do you need a header AT ALL !!

Assuming I don't want to ever allow the circuits of my computer to try
decoding an instruction that turns out later to be data (unless the
programmer has made an error, in which case the penalty of the program
being aborted is no problem)...

and I want the computer to be able to decode all the instructions in a
block in parallel, as a way to improve performance,

then I need a block header to say 'here are the instructions to
decode' IF I don't want to be absolutely limited to all instructions
having the same length.

While I could still have a pair of 16-bit instructions in a 32-bit
word, without headers I couldn't have immediates (at least not of most lengths), or other instructions longer than 32 bits.

And headers let me add instruction predication, which is also good, as
branches do cause difficulties which predication partly avoids.
(There's still a dependency on what is being predicated upon.)

The header facilitates fast decoding of a flexible instruction set,
and allows VLIW features allowing the ISA to be used for embedded
processors.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Thu May 9 03:05:55 2024

John Savard wrote:

On Wed, 8 May 2024 21:46:37 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

So, why do you need a header AT ALL !!

Assuming I don't want to ever allow the circuits of my computer to try decoding an instruction that turns out later to be data (unless the programmer has made an error, in which case the penalty of the program
being aborted is no problem)...

and I want the computer to be able to decode all the instructions in a
block in parallel, as a way to improve performance,

What makes you think My 66000 ISA cannot be decoded in parallel ??
Over the last year I have illustrated how up to 16 instructions,
each variable length from 1..5 words, can be decoded in parallel.

First you select a suitable buffer which presents instructions to be
decoded. A 6-wide machine will be using 16 words.

Each word (320-gates of flip flops) has a 40-gate size decoder,
and this size is used to select its successor.

The first instruction starts at IP, the next is selected from the
decode of the first instructions (4 gates of delay). Here after,
the selection of the second instruction selects instructions 3 and
4. Next we select instructions 5 through 8, then 9 through 16.
8 total gates of logic and several gates of fan-out buffering.

I happen to call this parse--instructions are parsed into individual
starting points. and up to 16 instructions are presented to 16
instruction decoders. Each of these decoders decodes the entire ISA.
{Although there are ways to route instructions to more specialized
decoders if desired.}}

You are using a header, I am using logic.

By using logic, there is no waste of bits in the instruction encoding.

then I need a block header to say 'here are the instructions to
decode' IF I don't want to be absolutely limited to all instructions
having the same length.

Seems like a horrible plan going forward with your goals in mind.

While I could still have a pair of 16-bit instructions in a 32-bit
word, without headers I couldn't have immediates (at least not of most lengths), or other instructions longer than 32 bits.

And headers let me add instruction predication, which is also good, as branches do cause difficulties which predication partly avoids.
(There's still a dependency on what is being predicated upon.)

I added predication without any such need.

The header facilitates fast decoding of a flexible instruction set,
and allows VLIW features allowing the ISA to be used for embedded
processors.

The header allows subtracting 1 stage from the 12+ stage k-wide pipeline,
AND is causing all sorts of "other issues" to remain present.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed May 8 21:17:00 2024

On Wed, 8 May 2024 21:46:37 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

But what if the new feature was so important that I needed to have
*short* instructions for the operations using that feature - 32-bit
long instructions?

G A S P ........why do I even try.....

I'm not sure why what I said _there_ was so shocking.

But, yes, I do freely admit that Concertina II is _not_ an ISA that
"makes sense" from your point of view... or, indeed, the point of view
of many other people who value simplicity and elegance in a computer architecture.

Instead, right from the start, it gives the appearance of having
accumulated the kind of cruft that usually is acquired though decades
of maintaining backwards compatibility.

Still, I know that what I'm leading up to is shocking.

The ISA looks - at first glance - like a plain old 32-bit RISC
architecture. With a few little peculiarities... base-index
addressing, like the 360, but not like any RISC architecture, for
example.

And then people notice the headers.

Code is divided into 256-bit blocks, so that instructions can have "pseudo-immediates"; these values can be stacked at the end of a block
so that they're all aligned, and they don't cause the instructions
themselves to vary in length, so decoding is simple.

Could that be regarded as tolerable?

And the headers also allow... explicit indication of when instructions
can execute in parallel, and instruction predication. Oh, so it's
VLIW, too?

And then they notice the killer. Perhaps they, too, will "gasp" in
shock.

There's also a header type that allows code where 16, 32, 48, 64...
bit instructions can be combined in any order, for tracitional
CISC-like code with a variable instruction size. But there's a 12.5%
overhead penalty so that fast parallel decoding remains available.

But that header does something else.

It changes the instruction stream from being composed of 32-bit words
to one composed of 36-bit words, divided into 18-bit halfwords.

And if that isn't enough, the last two header types let you switch to
38-bit words composed of two 19-bit halfwords. *That's* what I do to
add a bunch of extra 32-bit instructions to the ISA, if some new
feature is so important that I don't want the instructions that deal
with it to have to be 48 bits long at least.

And, yes, I can indeed understand why you might gasp in horror at that
stage. But you said I was having problems running out of opcode space,
so I had to demonstrate that I could pull new opcode space out of thin
air, as it were, should I feel the need to do so.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Wed May 8 23:03:24 2024

On Thu, 9 May 2024 03:05:55 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

What makes you think My 66000 ISA cannot be decoded in parallel ??
Over the last year I have illustrated how up to 16 instructions,
each variable length from 1..5 words, can be decoded in parallel.

You are using a header, I am using logic.

One of the things I'm doing is trying to make my ISA capable of
efficient implementations by implementors who aren't necessarily as
smart as you are; with headers, it's obvious how instructions can be
decoded in parallel.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Wed May 8 23:09:09 2024

On Wed, 08 May 2024 21:17:00 -0600, John Savard
<quadibloc@servername.invalid> wrote:

And if that isn't enough, the last two header types let you switch to
38-bit words composed of two 19-bit halfwords. *That's* what I do to
add a bunch of extra 32-bit instructions to the ISA, if some new
feature is so important that I don't want the instructions that deal
with it to have to be 48 bits long at least.

I decided to plan ahead, and expand the opcode space even further by
adding another header type.

Now, one has access to three alternate instruction sets, but instead
of those being fixed, the first two can be chosen from a pool of
sixteen... and the third from a set of 128 different possibilities.

Also, I've noted that each of those alternate instruction sets, while
billed as sets of 32-bit instructions, can actually have opcode space
reserved for longer instructions, just as is done in the primary
instruction set.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Thu May 9 07:21:33 2024

On Thu, 09 May 2024 07:16:58 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Wed, 08 May 2024 23:09:09 -0600, John Savard ><quadibloc@servername.invalid> wrote:

Now, one has access to three alternate instruction sets, but instead
of those being fixed, the first two can be chosen from a pool of
sixteen... and the third from a set of 128 different possibilities.

Of course, this sort of thing may leave you gasping in shock and
horror. But look at the bright side. While 128 is a somewhat large
number, it isn't astronomical; I haven't provided for an opcode space
so large that there isn't enough matter in the whole Universe to
print a programmer's manual for the architecture.

Now, _that_ would be genuinely impracitcal!

Of course, as these many additional sets of instructions get fleshed
out, were the ISA to be implemented, such an ISA would lend new
meaning to the term "dark silicon", since, having so many instructions available, they could hardly all be in common use.

Indeed, the situation could even be described with the catchy book
title...

Fifty Shades of Dark Silicon

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Thu May 9 07:16:58 2024

On Wed, 08 May 2024 23:09:09 -0600, John Savard
<quadibloc@servername.invalid> wrote:

Now, one has access to three alternate instruction sets, but instead
of those being fixed, the first two can be chosen from a pool of
sixteen... and the third from a set of 128 different possibilities.

Of course, this sort of thing may leave you gasping in shock and
horror. But look at the bright side. While 128 is a somewhat large
number, it isn't astronomical; I haven't provided for an opcode space
so large th at there isn't enough matter in the whole Universe to
print a programmer's manual for the architecture.

Now, _that_ would be genuinely impracitcal!

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Thu May 9 17:23:08 2024

John Savard <quadibloc@servername.invalid> schrieb:

On Thu, 9 May 2024 03:05:55 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

What makes you think My 66000 ISA cannot be decoded in parallel ??
Over the last year I have illustrated how up to 16 instructions,
each variable length from 1..5 words, can be decoded in parallel.

You are using a header, I am using logic.

One of the things I'm doing is trying to make my ISA capable of
efficient implementations by implementors who aren't necessarily as
smart as you are; with headers, it's obvious how instructions can be
decoded in parallel.

If you include a description of how to decode things in parallel
in the description of your ISA, as Mitch has done for his, then
implementers need not be particularly clever, they only need to
follow what you write.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to John Savard on Thu May 9 20:28:00 2024

In article <fajp3j12esafhpn3e27ntfq5f538jmb3q7@4ax.com>, quadibloc@servername.invalid (John Savard) wrote:

Of course, this sort of thing may leave you gasping in shock and
horror. But look at the bright side. While 128 is a somewhat large
number, it isn't astronomical; I haven't provided for an opcode
space so large that there isn't enough matter in the whole Universe to >print a programmer's manual for the architecture.

Now, _that_ would be genuinely impracitcal!

Of course, as these many additional sets of instructions get fleshed
out, were the ISA to be implemented

I think you've just added another couple of orders of magnitude to the
odds against that happening.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to John Savard on Thu May 9 15:46:39 2024

John Savard wrote:

On Wed, 8 May 2024 21:46:37 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

So, why do you need a header AT ALL !!

Assuming I don't want to ever allow the circuits of my computer to try decoding an instruction that turns out later to be data (unless the programmer has made an error, in which case the penalty of the program
being aborted is no problem)...

and I want the computer to be able to decode all the instructions in a
block in parallel, as a way to improve performance,

It is ok to *try* decoding a length from a token that might be an
instruction as long as you toss it away when you later find that it wasn't.

You use the tail of the first instruction to select the start of the second. You use the tail of the first pair to select the start of the second pair.
You use the tail of the first quad to select the start of the second quad.

For example, if instructions can be 1..4 tokens long
then the next instruction comes from one of 4 following tokens,
the next instruction pair comes from one of 7 following instruction pairs,
the next instruction quad comes from one of 13 following instruction quads.

Decode0 Decode1 Decode2 Decode3 Decode4 Decode5...
| | | | | |
v v v v v v
Length0->[--------4:1 Select Mux----------][----------...
| | | | | |
v v | | | |
Inst0 Inst1 v v v v
Length1->[----------7:1 Select Mux---------------------]
| | | |
v v v v
Inst2 Inst3 [----------13:1 Select Mux-----------]
| | | |
v v v v
Inst4 Inst5 Inst6 Inst7

<---first pair---><--second pair--><--third pair---><---fourth pair--->
<-----------first quad------------><--------second quad--------------->

Its mostly done with wires and muxes, and a little glue logic.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Thu May 9 15:09:11 2024

On Thu, 9 May 2024 20:28 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

I think you've just added another couple of orders of magnitude to the
odds against that happening.

What, you don't think that an ISA that is capable of handlling an
instruction set two orders of magnitude larger than ordinary
instruction sets wouildn't have a highly sought-after feature, at
least for some niches?

Instructions are multiples of 16 bits in length, like on a Motorola
68000 or an IBM System/360, not multiples of eight bits like on x86...
so headers provide a way to add just a few bits to instructions
instead of adding a whole 16 bits, when that isn't needed.

And after devising a mechanism to use _three_ extra opcode spaces in
the instruction set... I merely decided to be proactive, and give the architecture room for further expansion, by generalizing it a tad
more, and allow an additional 123 opcode spaces, potentially of equal
or larger size. (Larger because an additional opcode space could have
the bigger than 32 bit instructions all start with 1 instead of 1111,
and thus have more opcode space because of having a larger proportion
of longer instructions.)

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Thu May 9 21:08:48 2024

EricP wrote:

It is ok to *try* decoding a length from a token that might be an
instruction as long as you toss it away when you later find that it wasn't.

You use the tail of the first instruction to select the start of the second. You use the tail of the first pair to select the start of the second pair. You use the tail of the first quad to select the start of the second quad.

For example, if instructions can be 1..4 tokens long
then the next instruction comes from one of 4 following tokens,
the next instruction pair comes from one of 7 following instruction pairs, the next instruction quad comes from one of 13 following instruction quads.

Decode0 Decode1 Decode2 Decode3 Decode4 Decode5...
| | | | | |
v v v v v v
Length0->[--------4:1 Select Mux----------][----------...
| | | | | |
v v | | | |
Inst0 Inst1 v v v v
Length1->[----------7:1 Select Mux---------------------]
| | | |
v v v v
Inst2 Inst3 [----------13:1 Select Mux-----------]
| | | |
v v v v
Inst4 Inst5 Inst6 Inst7

<---first pair---><--second pair--><--third pair---><---fourth pair--->
<-----------first quad------------><--------second quad--------------->

Treeifying::

Decode0 Decode1 Decode2 Decode3 Decode4 Decode5...
| | | | | |
| | | Pinst3->[--------4:1 Select Mux-
| | | | | |
| | Pinst2->[--------4:1 Select Mux----------]
| | | | | |
| Pinst1->[--------4:1 Select Mux----------]
| Length1 | | | |
v v v v v v
Length0->[--------4:1 Select Mux----------]
| | | | | |
v v | | | |
Inst0 Inst1 v v v v
Length1->[----------2:1×4 Select Mux----------------]
| | | |
v v v v
Inst2 Inst3 [----------2:1×4 Select Mux-----------]
| | | |
v v v v
Inst4 Inst5 Inst6 Inst7

<---first pair---><--second pair--><--third pair---><---fourth pair--->
<-----------first quad------------><--------second quad--------------->

Where Pinsti is a purported instruction decode which may or may not
be selected as an instruction starting point. This gets rid of the
wide multiplexers at the cost of additional 4:1 multiplexers.

And thanks for taking the time to ASCII-art the figure.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From EricP@21:1/5 to All on Thu May 9 19:40:26 2024

MitchAlsup1 wrote:

EricP wrote:

It is ok to *try* decoding a length from a token that might be an
instruction as long as you toss it away when you later find that it
wasn't.

You use the tail of the first instruction to select the start of the
second.
You use the tail of the first pair to select the start of the second
pair.
You use the tail of the first quad to select the start of the second
quad.

For example, if instructions can be 1..4 tokens long
then the next instruction comes from one of 4 following tokens,
the next instruction pair comes from one of 7 following instruction
pairs,
the next instruction quad comes from one of 13 following instruction
quads.

Decode0 Decode1 Decode2 Decode3 Decode4 Decode5...
| | | | | |
v v v v v v
Length0->[--------4:1 Select Mux----------][----------...
| | | | | |
v v | | | |
Inst0 Inst1 v v v v
Length1->[----------7:1 Select Mux---------------------]
| | | |
v v v v
Inst2 Inst3 [----------13:1 Select
Mux-----------]
| | | |
v v v v
Inst4 Inst5 Inst6 Inst7

<---first pair---><--second pair--><--third pair---><---fourth
pair--->
<-----------first quad------------><--------second
quad--------------->

Treeifying::

Decode0 Decode1 Decode2 Decode3 Decode4 Decode5...
| | | | | |
| | | Pinst3->[--------4:1 Select Mux-
| | | | | |
| | Pinst2->[--------4:1 Select Mux----------]
| | | | | |
| Pinst1->[--------4:1 Select Mux----------]
| Length1 | | | |
v v v v v v
Length0->[--------4:1 Select Mux----------]
| | | | | |
v v | | | |
Inst0 Inst1 v v v v
Length1->[----------2:1×4 Select Mux----------------]
| | | |
v v v v
Inst2 Inst3 [----------2:1×4 Select Mux-----------]
| | | |
v v v v
Inst4 Inst5 Inst6 Inst7

<---first pair---><--second pair--><--third pair---><---fourth pair--->
<-----------first quad------------><--------second quad--------------->

Where Pinsti is a purported instruction decode which may or may not
be selected as an instruction starting point. This gets rid of the
wide multiplexers at the cost of additional 4:1 multiplexers.

And thanks for taking the time to ASCII-art the figure.

I should have mentioned those muxes are replicated horizontally across
the input token buffer for each offset a pair or quad could start at.
In the above case, the input buffer has space for 8 instruction * 4 tokens,
The first token is offset 0, the first possible pair starts at offset 1,
the last possible pair starts at offset 28, so thats 28 sets of 4:1 muxes
* 4 tokens per instruction * bits-per-token (plus sundry housekeeping bits).

Also I used one-hot select muxes, that is the 4:1 mux has a 4-bit
one-hot select control and the 7:1 mux has a 7-bit select control,
as it is easier to shift a one-hot enable out to the next position,
and it eliminates the mux binary decoder and length adders for
figuring out where the next pair or quad starts from.

So those wide muxes are really just a layer of AND gates enabled by
one of the select control bits, and a 4 or 7 or 13 input OR.
There are no length adders inside the selection routing tree,
just at the end to sum up the total length of valid instruction bytes
so we know what to increment the fetch RIP by.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to John Savard on Fri May 10 00:19:00 2024

In article <ofeq3j9ni63e7tmccf2qbkb9t0naui44ei@4ax.com>, quadibloc@servername.invalid (John Savard) wrote:

On Thu, 9 May 2024 20:28 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

I think you've just added another couple of orders of magnitude to
the odds against that happening.

What, you don't think that an ISA that is capable of handlling an
instruction set two orders of magnitude larger than ordinary
instruction sets wouildn't have a highly sought-after feature, at
least for some niches?

Not that justified the costs of implementing such a huge instruction set.
All the transistors that go into that are not going into performance
(caches, functional units, and OoO pool size) and are pushing up the size
of the minimal implementation.

Also, teaching development tools about vast instruction sets is likely to demonstrate the RISC lesson again: compilers only use the simple parts.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to EricP on Fri May 10 01:09:10 2024

EricP wrote:

MitchAlsup1 wrote:

EricP wrote:

It is ok to *try* decoding a length from a token that might be an
instruction as long as you toss it away when you later find that it
wasn't.

You use the tail of the first instruction to select the start of the
second.
You use the tail of the first pair to select the start of the second
pair.
You use the tail of the first quad to select the start of the second
quad.

For example, if instructions can be 1..4 tokens long
then the next instruction comes from one of 4 following tokens,
the next instruction pair comes from one of 7 following instruction
pairs,
the next instruction quad comes from one of 13 following instruction
quads.

Decode0 Decode1 Decode2 Decode3 Decode4 Decode5...
| | | | | |
v v v v v v
Length0->[--------4:1 Select Mux----------][----------...
| | | | | |
v v | | | |
Inst0 Inst1 v v v v
Length1->[----------7:1 Select Mux---------------------]
| | | |
v v v v
Inst2 Inst3 [----------13:1 Select
Mux-----------]
| | | |
v v v v
Inst4 Inst5 Inst6 Inst7

<---first pair---><--second pair--><--third pair---><---fourth
pair--->
<-----------first quad------------><--------second
quad--------------->

Treeifying::

Decode0 Decode1 Decode2 Decode3 Decode4 Decode5...
| | | | | |
| | | Pinst3->[--------4:1 Select Mux-
| | | | | |
| | Pinst2->[--------4:1 Select Mux----------]
| | | | | |
| Pinst1->[--------4:1 Select Mux----------]
| Length1 | | | |
v v v v v v
Length0->[--------4:1 Select Mux----------]
| | | | | |
v v | | | |
Inst0 Inst1 v v v v
Length1->[----------2:1×4 Select Mux----------------]
| | | |
v v v v
Inst2 Inst3 [----------2:1×4 Select
Mux-----------]
| | | |
v v v v
Inst4 Inst5 Inst6 Inst7

<---first pair---><--second pair--><--third pair---><---fourth pair---> >> <-----------first quad------------><--------second quad---------------> >>

Where Pinsti is a purported instruction decode which may or may not
be selected as an instruction starting point. This gets rid of the
wide multiplexers at the cost of additional 4:1 multiplexers.

And thanks for taking the time to ASCII-art the figure.

I should have mentioned those muxes are replicated horizontally across
the input token buffer for each offset a pair or quad could start at.
In the above case, the input buffer has space for 8 instruction * 4 tokens, The first token is offset 0, the first possible pair starts at offset 1,
the last possible pair starts at offset 28, so thats 28 sets of 4:1 muxes
* 4 tokens per instruction * bits-per-token (plus sundry housekeeping bits).

Also I used one-hot select muxes,

To a logic designer, the difference between a 1-hot mux and a binary
mux is a binary to 1-hot decoder--the part actually doing the muxing
is identical.

Also note: to the logic designer, a Find-First circuit produces a
unary (1-hot) output and if you want binary output you put the 1-hot
through aa 1-hot to binary encoder.

that is the 4:1 mux has a 4-bit
one-hot select control and the 7:1 mux has a 7-bit select control,

A 4:1 mux is 1 gate of delay (and one logic inversion)
a 7:1 mux is 2 gates of delay (and two logic inversions)
A 13:1 mux is 3 gates of delay (and 3 logic inversions)

By treeifying the logic all muxes (as above) become 1 gate delay.

as it is easier to shift a one-hot enable out to the next position,

99% of selection logic anywhere in a pipeline is 1-hot.

and it eliminates the mux binary decoder and length adders for
figuring out where the next pair or quad starts from.

Exactly.

So those wide muxes are really just a layer of AND gates enabled by
one of the select control bits, and a 4 or 7 or 13 input OR.
There are no length adders inside the selection routing tree,
just at the end to sum up the total length of valid instruction bytes
so we know what to increment the fetch RIP by.

Basically, you let each word determine its output and you decode the
LOBs of IP to get your starting point.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Thu May 9 21:02:51 2024

On Fri, 10 May 2024 00:19 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

Not that justified the costs of implementing such a huge instruction set.

Well, having a huge instruction set defined and implementing all of it
are two different things.

Look at x86, how MMX got replaced by SSE which got replaced by AVX.

So if one is going to include instructions that will later become
obsolete, and be replaced by other instructions, not re-using the same
opcodes helps with upwards compatibility.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Fri May 10 17:27:10 2024

John Savard wrote:

On Fri, 10 May 2024 00:19 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

Not that justified the costs of implementing such a huge instruction set.

Well, having a huge instruction set defined and implementing all of it
are two different things.

Look at x86, how MMX got replaced by SSE which got replaced by AVX.

So if one is going to include instructions that will later become
obsolete, and be replaced by other instructions, not re-using the same opcodes helps with upwards compatibility.

Or skip to the end and only invent AVX while skipping the soon-to-be
redundant intermediate stages.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Dallman@21:1/5 to John Savard on Fri May 10 20:51:00 2024

In article <be3r3jhr1kf9n1cdsbik5ejsuso7c3pmmk@4ax.com>, quadibloc@servername.invalid (John Savard) wrote:

On Fri, 10 May 2024 00:19 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

Not that justified the costs of implementing such a huge
instruction set.

Well, having a huge instruction set defined and implementing all of
it are two different things.

Look at x86, how MMX got replaced by SSE which got replaced by AVX.

So if one is going to include instructions that will later become
obsolete, and be replaced by other instructions, not re-using the
same opcodes helps with upwards compatibility.

Intel did not re-use the opcodes. MMX, SSE, SSE2 and so on are all still implemented and usable. Once a hardware feature has been used in software, getting rid of it is hard. I'm still building x86-32 software for SSE2
because AVX[2] doesn't do anything useful for it.

John

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Fri May 10 14:22:46 2024

On Fri, 10 May 2024 17:27:10 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Or skip to the end and only invent AVX while skipping the soon-to-be >redundant intermediate stages.

Well, I went to 256-bit short vectors as a permanent part of the
architecture, with long vectors as the next step.

But what about crypto assist instructions, as another example?

However, I think I will adjust this feature. You comlained I used up
too much of my opcode space, so I demonstrated that Concertina II had
the potential to have... a _lot_ of opcode space, even to ludicrous
lengths.

Now that I think I can finally wrap up Concertina II, having found how
to achieve its goals as best as possible, I can go on to Concertina
III... and, given your anguished pleas, I _will_ give up on block
structure for the next iteration.

In order to do that, though, it will have to be CISC, not RISC...
banks of 8 registes, sort of like Concertina I, but much less messy.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Fri May 10 21:06:58 2024

John Savard wrote:

On Fri, 10 May 2024 17:27:10 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Or skip to the end and only invent AVX while skipping the soon-to-be >>redundant intermediate stages.

Well, I went to 256-bit short vectors as a permanent part of the architecture, with long vectors as the next step.

But what about crypto assist instructions, as another example?

If used often enough, sure, they make a lot of sense--just make whatever
you put in applicable to a myriad of crypto functions.

However, I think I will adjust this feature. You comlained I used up
too much of my opcode space, so I demonstrated that Concertina II had
the potential to have... a _lot_ of opcode space, even to ludicrous
lengths.

Now that I think I can finally wrap up Concertina II, having found how
to achieve its goals as best as possible, I can go on to Concertina
III... and, given your anguished pleas, I _will_ give up on block
structure for the next iteration.

Would you like to read My 66000 ISA while taking a break between CT II and
CT III ??

In order to do that, though, it will have to be CISC, not RISC...
banks of 8 registes, sort of like Concertina I, but much less messy.

With MEM-OPs are you not already CISC ??

Plu8s, not only is CISC<->RISC not a proper metric but merely points
along a complexity spectrum--one the RISC camp has used to oversell
their case.

My point is that there is a point between CISC and RISC where it takes
fewer instructions to execute a given workload and simultaneously you
have not screwed up the pipeline frequency so the ISA gains drop all
the way to the bottom line.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Fri May 10 18:34:09 2024

On Fri, 10 May 2024 21:06:58 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

Now that I think I can finally wrap up Concertina II, having found how
to achieve its goals as best as possible, I can go on to Concertina
III... and, given your anguished pleas, I _will_ give up on block
structure for the next iteration.

Would you like to read My 66000 ISA while taking a break between CT II and
CT III ??

Oh, yes, indeed, although I don't promise to shamelessly steal all
your good ideas.

I am going to try to somehow squeeze immediates in while keeping
instuction length decoding relatively simple. That, I fear, is not
going to be easy for me, although I outlined a scheme before which I
feel is not simple enough.

In order to do that, though, it will have to be CISC, not RISC...
banks of 8 registes, sort of like Concertina I, but much less messy.

With MEM-OPs are you not already CISC ??

I should have been clearer, but to tell the truth would have taken
many words.

What I meant was that while Concertina II indeed is hardly RISC, it
still contains a near-RISC instruction set in the basic 32-bit
operations. Unlike typical RISC instruction sets, it has base plus
index addressing, though.

Then mem-ops are added in the first supplementary instruction set,
yes. Concertina II is intended to be "architecture-agnostic", being at
once sort of like RISC, but also VLIW and CISC.

What Concertina III would give up, to no longer be RISC at all, would
be register banks of 32 registers. Changing that to 8 registers
shortens certain fields, letting me switch to native variable-length instructions without the need for any block header mechanism.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Fri May 10 21:05:39 2024

On Fri, 10 May 2024 18:34:09 -0600, John Savard
<quadibloc@servername.invalid> wrote:

On Fri, 10 May 2024 21:06:58 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

With MEM-OPs are you not already CISC ??

I should have been clearer, but to tell the truth would have taken
many words.

What I meant was that while Concertina II indeed is hardly RISC, it
still contains a near-RISC instruction set in the basic 32-bit
operations. Unlike typical RISC instruction sets, it has base plus
index addressing, though.

Then mem-ops are added in the first supplementary instruction set,
yes. Concertina II is intended to be "architecture-agnostic", being at
once sort of like RISC, but also VLIW and CISC.

What Concertina III would give up, to no longer be RISC at all, would
be register banks of 32 registers. Changing that to 8 registers
shortens certain fields, letting me switch to native variable-length >instructions without the need for any block header mechanism.

The headers divide the architecture into its code types.

If no headers are used, the available instruction set is basically a
RISC instruction set... with more than the usual amount of
instructions, and with base + index addressing.

Using type I headers adds immediates.

If one is producing VLIW-style code, one will use the type II header.

If one uses the type III header, then one has a CISC instruction set,
with different lengths of instructions, memory to register operate instructions, string instructions, and so on. This is true of the type
VI and VIII headers.

The type VII header combines VLIW with CISC; then one will liikely
also use the encapsulation mechanism to place long instructions within
blocks with a type II header to use all the VLIW features.

The type III header extends the instruction set, without otherwise
departing from a RISC-style instruction set.

So a compiler, at least on any one code generation setting, would use
only a subset of the available headers - although one that can
generate both CISC and VLIW code as requested is certainly a
possibility.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Sat May 11 07:22:49 2024

John Savard <quadibloc@servername.invalid> schrieb:

On Fri, 10 May 2024 17:27:10 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Or skip to the end and only invent AVX while skipping the soon-to-be >>redundant intermediate stages.

Well, I went to 256-bit short vectors as a permanent part of the architecture, with long vectors as the next step.

But what about crypto assist instructions, as another example?

You will probably want to look at AES for this. AES operates on
16-byte blocks, so having 128-bit registers is natural.

AES256 also needs 15 separate keys, which should be kept in
registers if you are doing things on a CPU, so because you
also need intermedite results and also to load/store data,
so 32 128-bit registers would be a good fit. Look at POWER's
vcipher and vcipherlast as an example.

These register would also be a good fit for 128-bit IEEE floating
point, which only POWER at the moment supports in hardware, plus
those SIMD things that do not come in loops (aka SLP).

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to John Savard on Sat May 11 17:13:33 2024

John Savard wrote:

On Fri, 10 May 2024 21:06:58 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Would you like to read My 66000 ISA while taking a break between CT II and >>CT III ??

Oh, yes, indeed, although I don't promise to shamelessly steal all
your good ideas.

Send me an e-mail I can use as a return address.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Anton Ertl@21:1/5 to John Dallman on Sat May 11 17:48:51 2024

jgd@cix.co.uk (John Dallman) writes:

Also, teaching development tools about vast instruction sets is likely to >demonstrate the RISC lesson again: compilers only use the simple parts.

That needs some elaboration. There are several potential reasons for
that:

1) The compiler writers found it too hard to use the complex
instructions or addressing modes. For some kinds of instructions that
is the case (e.g, for the AES instructions in Intel and AMD CPUs), but
at least these days such instructions are there for use in libraries
written in assembly language/with intrinsics.

2) Some instructions are slower than a sequence of simpler
instructions, so compilers will avoid them even if they would
otherwise use them. That has been reported by both the IBM 801
project about some S/370 instructions and by the Berkeley RISC project
about the VAX. I don't remember any reports about addressing modes
with that problem.

3) Some instructions or addressing modes can be selected by compilers
and are beneficial when they are used, but they are selected rarely
because they fit the needs of the compiled program rarely.

IIRC the RISC papers mentioned mainly 2), with a little bit of 3).

- anton
--
'Anyone trying for "industrial quality" ISA should avoid undefined behavior.'
Mitch Alsup, <c17fcd89-f024-40e7-a594-88a85ac10d20o@googlegroups.com>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sat May 11 12:14:31 2024

On Sat, 11 May 2024 17:13:33 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

On Fri, 10 May 2024 21:06:58 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

Would you like to read My 66000 ISA while taking a break between CT II and >>>CT III ??

Oh, yes, indeed, although I don't promise to shamelessly steal all
your good ideas.

Send me an e-mail I can use as a return address.

All right; however, I still have the copy of your MY 66000
architecture description which you sent me in 2017.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to tkoenig@netcologne.de on Sat May 11 12:20:51 2024

On Sat, 11 May 2024 07:22:49 -0000 (UTC), Thomas Koenig
<tkoenig@netcologne.de> wrote:

You will probably want to look at AES for this. AES operates on
16-byte blocks, so having 128-bit registers is natural.

Oh, I've taken a very good look at AES, once upon a time...

http://www.quadibloc.com/crypto/co040401.htm

back when it was called Rijndael, in fact.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sat May 11 12:18:24 2024

On Fri, 10 May 2024 21:06:58 +0000, mitchalsup@aol.com (MitchAlsup1)
wrote:

John Savard wrote:

But what about crypto assist instructions, as another example?

If used often enough, sure, they make a lot of sense--just make whatever
you put in applicable to a myriad of crypto functions.

While that seems to be sound advice, often it's not possible to take
it.

Many crypto assist features work like this: there are instructions to
place keys in a secure area on the chip, and then instructions to
encrypt - or decrypt, depending on which secure area was chosen -
using one of those keys... of which a copy is no longer retained
anywhere in the computer.

The idea behind this is to keep the keys hidden from malicious
software. But if you don't have access to the key, actually performing
the encryption operation yourself is out of the question, so the
entire operation has to be done in a single operation.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Anton Ertl on Sat May 11 19:16:46 2024

Anton Ertl wrote:

jgd@cix.co.uk (John Dallman) writes:

Also, teaching development tools about vast instruction sets is likely to >>demonstrate the RISC lesson again: compilers only use the simple parts.

That needs some elaboration. There are several potential reasons for
that:

1) The compiler writers found it too hard to use the complex
instructions or addressing modes. For some kinds of instructions that
is the case (e.g, for the AES instructions in Intel and AMD CPUs), but
at least these days such instructions are there for use in libraries
written in assembly language/with intrinsics.

The 801 was correct on this::

The compiler must be developed at the same time as ISA, if ISA has it
and the compiler cannot use it then why is it there {yes there are
certain privileged instructions lacking this property} Conversely is
compiler could almost use an instruction but does not, then adjust
the instruction specification so the compiler can !!

2) Some instructions are slower than a sequence of simpler
instructions, so compilers will avoid them even if they would
otherwise use them.

VAX CALL instructions did more work than what was required, it did
the work it was specified to perform as rapidly as the HW could perform
the specified task. It took 10 years to figure out that the CALL/RET
overhead was excessive and wasteful.

That has been reported by both the IBM 801
project about some S/370 instructions and by the Berkeley RISC project
about the VAX. I don't remember any reports about addressing modes
with that problem.

The problem with address modes is their serial decode, not with the ability
to craft any operand the instruction needs. The second problem with VAX-like addressing modes is that it is overly expressive, all operands can be constants, whereas a good compiler will never need more than 1 constant
per instruction (because otherwise some constant arithmetic could be
performed at compile (or link) time.)

3) Some instructions or addressing modes can be selected by compilers
and are beneficial when they are used, but they are selected rarely
because they fit the needs of the compiled program rarely.

The following constructs are seen often enough that the memory reference instructions should support them well::

*p LD Rd,[Rp]

next LD Rd,[Rp,#next]
array[i] LD Rd,[Rp,Ri<<scale,#array] // RISC-V fails

p[i] LD Rd,[Rp,Ri<<scale] // RISC-V fails
p[i].field LD Rd,[Rp,Ri<<scale,#field] // RISC-V fails

Given the above: local variables just use SP and the variable's offset on
the stack or frame; global variables resolved by the linker are accessed
with 32-bit displacements off the IP (instruction pointer); while external variables are accessed through GOT as::

extern[i] LD Rp,[IP,,#GOT[i].address-.]
LD Rd,[Rp,#GOT[i].offsset]

or an entry point called with::

extern f() CALX [IP,,#GOT[f].address-.]

{{CALX is effectively LD IP,[address] while storing the return address in R0}}

CALX was invented/developed specifically to optimize external calling
of subroutines (and adjusted until it fit the requirements.) Thus, not only should the compiler be n development with ISA, so should the linker !!

IIRC the RISC papers mentioned mainly 2), with a little bit of 3).

- anton

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to All on Sun May 12 03:57:02 2024

On Thu, 9 May 2024 20:28 +0100 (BST), jgd@cix.co.uk (John Dallman)
wrote:

In article <fajp3j12esafhpn3e27ntfq5f538jmb3q7@4ax.com>, >quadibloc@servername.invalid (John Savard) wrote:

Of course, this sort of thing may leave you gasping in shock and
horror. But look at the bright side. While 128 is a somewhat large
number, it isn't astronomical; I haven't provided for an opcode
space so large that there isn't enough matter in the whole Universe to
print a programmer's manual for the architecture.

Now, _that_ would be genuinely impracitcal!

Of course, as these many additional sets of instructions get fleshed
out, were the ISA to be implemented

I think you've just added another couple of orders of magnitude to the
odds against that happening.

I've decided to claw some of those orders of magnitude back, even if
it hardly matters (zero divided by 100 is still zero).

Now I've changed the applicable header format to provide only _eight_ additional alternate instruction sets, instead of almost 128 of them,
using the available bits instead for something much more important -
allowing the explicit indication of parallelism to be easily combined
with the use of all of the first four instruction sets, as well as one
of the eight new ones.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Thomas Koenig@21:1/5 to John Savard on Sun May 12 11:45:27 2024

John Savard <quadibloc@servername.invalid> schrieb:

Now I've changed the applicable header format to provide only _eight_ additional alternate instruction sets,

Questions/remarks. Please feel free to answer/correct.

A header introduces a block, correct?

Once a block has been identified by a header, the format of all
instructions in that block is set. Correct?

The compiler (or assembler programmer) must then chose which
instructions go into which block, correct?

If a short instruction is followed by a long instruction, and
both are in a single block, what is the compiler to do?
I can only see either a) to chose a block for the longer
instruciton or b) fill the rest of the block for short
instruction with (short) NOPs.

How is this supposed to save space?

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Sun May 12 12:20:33 2024

On Wed, 24 Apr 2024 23:49:25 -0600, John Savard
<quadibloc@servername.invalid> wrote:

I keep changing the basic design of Concertina II, instead of going
forward and completing the task of fleshing it out.

I have recently added another page to the current iteration, at

http://www.quadibloc.com/arch/cab0102.htm

which describes the formats of the instructions longer than 32 bits.

Since I've been going around in circles, all I had to do was go back
and grab my files for

http://www.quadibloc.com/arch/cw0102.htm

(not currently a valid URL, but it can be found, no doubt, on the
Wayback Machine)

and make slight changes. Well, at least at first. I've now
significantly fleshed out the 48-bit instructions, so as to make the
extended register banks of 128 registers first-class citizens.

As befits the "Concertina-tanic", I suppose.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From John Savard@21:1/5 to quadibloc@servername.invalid on Sun May 12 14:14:44 2024

On Sun, 12 May 2024 12:20:33 -0600, John Savard
<quadibloc@servername.invalid> wrote:

Since I've been going around in circles, all I had to do was go back
and grab my files for

http://www.quadibloc.com/arch/cw0102.htm

(not currently a valid URL, but it can be found, no doubt, on the
Wayback Machine)

I tried, just to check, and, no, it's not there.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	547
Nodes:	16 (2 / 14)
Uptime:	71:23:50
Calls:	10,398
Files:	14,070
Messages:	6,417,621

Oops (Concertina II Going Around in Circles)

Who's Online

System Info