Forum: >>> Magnum BBS <<<

The Impending Return of Concertina III

From Quadibloc@21:1/5 to All on Tue Jan 23 04:07:50 2024

As I have noted, the original Concertina architecture was not a
serious proposal for a computer architecture, but merely a
description of an architecture intended to illustrate how
computers work.

Concertina II was a step above that; somewhat serious, but
not fully so; still too idiosyncratic to be taken seriously
as an alternative.

But in a discussion of Concertina II - or, rather, in a thread
that started with Concertina II, but went on to discussing
other things - it was noted that RISC-V is badly flawed.

In that case, an alternative is needed. I need to go beyond
Concertina II - with which I am satisfied now as meeting its
goals, finally - to something that could be considered genuinely
serious.

At the moment, only a link to Concertina III is present on my
main page, no content is yet present.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Tue Jan 23 06:41:20 2024

On Tue, 23 Jan 2024 04:07:50 +0000, I wrote:

At the moment, only a link to Concertina III is present on my
main page, no content is yet present.

The first few pages, with diagrams of this ultimate simplification
of Concertina II, are now present, starting at

http://www.quadibloc.com/arch/ct19int.htm

I've gone to 15-bit displacements, in order to avoid compromising
addressing modes, while allowing 16-bit instructions without
switching to an alternate instruction set.

Possibly using only three base registers is also sufficiently
non-violent to the addressing modes that I should have done that
instead, so I will likely give consideration to that option in
the days ahead.

Unfortunately, since pseudo-immediate values are something
of which I have been convinced of the necessity, I could not
get rid of block structure, which is, of course, as noted
the major impediment to this ISA being considered for widespread
adoption.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to All on Tue Jan 23 09:50:29 2024

On Tue, 23 Jan 2024 06:41:20 +0000, I wrote:

I've gone to 15-bit displacements, in order to avoid compromising
addressing modes, while allowing 16-bit instructions without
switching to an alternate instruction set.

Possibly using only three base registers is also sufficiently
non-violent to the addressing modes that I should have done that
instead, so I will likely give consideration to that option in
the days ahead.

I have indeed decided that using three base registers for the
basic load-store instructions is much preferable to shortening the
length of the displacement even by one bit.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Robert Finch on Tue Jan 23 13:07:49 2024

On Tue, 23 Jan 2024 07:06:47 -0500, Robert Finch wrote:

Packing and unpacking DFP numbers does not take a lot of logic, assuming
one of the common DPD packing methods.

Well, I'm thinking of the method used by IBM. It is true that method
was designed to use a minimal amount of logic.

The number of registers handling
DFP values could be doubled if they were unpacked and packed for each operation.

Not doubled, only increased from 24 to 32.

Since DFP arithmetic has a high latency anyway, for example
Q+ the DFP unit unpacks, performs the operation, then repacks the DFP
number. So, registers only need be 128-bit.

I don't believe in wasting any time. And the latency of DFP operations
can be reduced; it is possible to design a Wallace Tree multiplier for
BCD arithmetic.

256 bits seems a little narrow for a vector register.

The original Concertina architecture, which had short vector registers
of that size, was designed before AVX-512 was invented. Rather than attempting to keep revising the size of the short vector registers to keep up, the
ISA also includes long vector registers.

These are patterned after the vector registers of the Cray I, and have room
for 64 double-precision floating-point numbers each.

I have seen
several other architectures with vector registers supporting 16+ 32-bit values, or a length of 512-bits. This is also the width of a typical
cache line.

Having the base register implicitly encoded in the instruction is a way
to reduce the number of bits used to represent the base register.

Instead of base registers, then, there would be a code segment register
and a data segment register, like on x86. But then how do I access data belonging to another subroutine? Without variable length instructions,
segment prefixes like on x86 aren't an option. (There actually are
instruction prefixes in the ISA, but they're not intended to be
_common_!)

There
seems to be a lot of different base register usages. Will not that make
the compiler more difficult to write?

I suppose it could. The idea is basically that a program would pick
one memory model and stick with it - a normal program would use the
base registers connected with 16-bit displacements for everything...
except that, where different routines share access to a small area of
memory, then that pointer can be put in a base register for 12-bit displacements.

Does array addressing mode have memory indirect addressing? It seems
like a complex mode to support.

It does indeed use indirect addressing. The idea is that if your
program has a large number of arrays which are over 64K in size,
it shouldn't be necessary to either consume a base register for
each array, or freshly load a base register with the array address
every time it's referenced.

Using the mode is simple enough; basically, the address in the
instruction is effectively the name of the array instead of its
address, and the array is indexed normally.

Of course, there's the overhead of indirection on every access.

So in Concertina II, I had added a new addressing mode which
simply uses the same feature that allows immediate values to
tack a 64-bit absolute address on to an instruction. (Since it
looks like a 64-bit number, the linking loader can relocate it.)
That fancy feature, though, was too much complication for this
stripped-down ISA.

Block headers are tricky to use. They need to follow the output of the instructions in the assembler so that the assembler has time to generate
the appropriate bits for the header. The entire instruction block needs
to be flushed at the end of a function.

I don't see an alternative, though, to block structure to allow instructions
to have, in the instruction stream, immediate values of any length, and yet allow instructions to be rapidly decoded in parallel as if they were all
32 bits long.

And block structure also allows instruction parallelism to be explicitly indicated.

If you decide not to use the block header feature, though, what you have
left is still a perfectly good ISA. So people can support the architecture
with a basic compiler which doesn't make full use of the chip's features,
and then a fancier compiler which produces more optimal code can make the effort to handle the block headers.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Tue Jan 23 14:01:07 2024

On Tue, 23 Jan 2024 13:07:49 +0000, Quadibloc wrote:

So in Concertina II, I had added a new addressing mode which
simply uses the same feature that allows immediate values to
tack a 64-bit absolute address on to an instruction. (Since it
looks like a 64-bit number, the linking loader can relocate it.)
That fancy feature, though, was too much complication for this
stripped-down ISA.

This discussion has convinced me that this addressing mode,
although relegated to an alternate instruction set in Concertina II,
is important enough for maximizing performance that it does need
to be included in Concertina III, and the appropriate changes
have been made.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to BGB on Tue Jan 23 21:00:01 2024

On Tue, 23 Jan 2024 13:56:32 -0600, BGB wrote:

Agreed. Would not be in favor of block-headers or block structuring.
Linear instruction formats are preferable, preferably in 32-bit chunks.

The good news is that, although Concertina III still has block structure,
it gives you a choice. The ISA is similar to a RISC architecture, but
with a number of added features, if you just use 32-bit instructions.

On Concertina II, you need to use block structure for:

- 17-bit instructions
- Immediate constants other than 8-bit or 16-bit
- Absolute array addresses
- Instruction prefixes
- Explicit indication of parallelism
- Instruction predication

On Concertina III, you need to use block structure for immediate constants other
than 8 bit, but the 16-bit instructions and the absolute array addresses are available without block structure.

As it stands, Concertina III doesn't have instruction predication at all, which is a deficiency I will need to see if I can remedy.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Tue Jan 23 22:10:21 2024

BGB wrote:

On 1/23/2024 6:06 AM, Robert Finch wrote:

IME, the main address modes are:
(Rm, Disp) // ~ 66% +/- 10%
(Rm, Ro*FixSc) // ~ 33% +/- 10%
Where: FixSc matches the element size.
Pretty much everything else falls into the noise.

With dynamically linked libraries one needs:: k is constant at link time

LD Rd,[IP,GOT[k]] // get a pointer to the external variable
and
CALX [IP,GOT[k]] // call external entry point

But now that you have the above you can easily get::

CALX [IP,Ri<<3,Table] // call indexed method
// can also be used for threaded JITs

RISC-V only has the former, but kinda shoots itself in the foot:
GCC is good at eliminating most SP relative loads/stores;
That means, the nominal percentage of indexed is even higher...

A funny thing happens when you get rid of the "extra instructions"
most IRSC ISAs cause you to have in your instruction stream::
a) the number of instructions goes down
b) you get rid of the easy instructions
c) leaving all the complicated ones remaining

As a result, the code is basically left doing excessive amounts of
shifts and adds, which (vs BJX2) effectively dethrone the memory
load/store ops for top-place.

These are the easy instructions that are not necessary when ISA is
properly conceived.

Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
also shoots itself in the foot. Because, not only has one hit the limits
of the ALU and LD/ST ops, there are no cheap fallbacks for intermediate
range constants.

My 66000 has constants of all sizes for all instructions.

If my compiler, with its arguably poor optimizer and barely functional register allocation, is beating GCC for performance (when targeting
RISC-V), I don't really consider this a win for some of RISC-V's design choices.

When you benchmark against a strawman, cows get to eat.

And, if GCC in its great wisdom, is mostly loading constants from memory (having apparently offloaded most of them into the ".data" section),
this is also not a good sign.

Loading constants:
a) pollutes the data cache
b) wastes energy
c) wastes instructions

Also, needing to use shift-pairs to sign and zero extend things is a bit
weak as well, ...

See cows eat above.

Also, as a random annoyance, RISC-V's instruction layout is very
difficult to decipher from a hexadecimal view. One basically needs to
dump it in binary to make it viable to mentally parse and lookup instructions, which sucks.

When you consume 3/4ths of the instruction space for 16-bit instructions;
you create stress in other areas of ISA>

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Tue Jan 23 21:27:38 2024

On Tue, 23 Jan 2024 21:00:01 +0000, Quadibloc wrote:

the absolute array addresses are
available without block structure.

No; they may not be in an alternate instruction set, but
they still are like pseudo-immediates, so they do need
the block structure.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Scott Lurndal@21:1/5 to Chris M. Thomasson on Wed Jan 24 14:45:34 2024

"Chris M. Thomasson" <chris.m.thomasson.1@gmail.com> writes:

On 1/23/2024 4:11 PM, Brian G. Lucas wrote:

On 1/23/24 16:10, MitchAlsup1 wrote:

When you benchmark against a strawman, cows get to eat.

Not a farm boy I'll bet. Cows eat hay, but not straw.

https://en.wikipedia.org/wiki/Nord_and_Bert_Couldn%27t_Make_Head_or_Tail_of_It

Although a strawman can be made from hay or leaves and twigs, or any
other stuffing, straw, as a waste product from grain production,
is traditional.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Wed Jan 24 20:23:56 2024

BGB wrote:

On 1/23/2024 4:10 PM, MitchAlsup1 wrote:

Likewise, the moment one exceeds 12 bits on much of anything, RISC-V
also shoots itself in the foot. Because, not only has one hit the
limits of the ALU and LD/ST ops, there are no cheap fallbacks for
intermediate range constants.

My 66000 has constants of all sizes for all instructions.

------------------------

And, if GCC in its great wisdom, is mostly loading constants from
memory (having apparently offloaded most of them into the ".data"
section), this is also not a good sign.

Loading constants:
a) pollutes the data cache
b) wastes energy
c) wastes instructions

Yes.

But, I guess it does improve code density in this case... Because the constants are "somewhere else" and thus don't contribute to the size of '.text'; the program just puts a few kB worth of constants into '.data' instead...

Consider the store of a constant to a constant address::

array[7] = bigFPconstant;

RISC-V
text
aupic Ra,high(&bigFPconstant)
ldd Rd,[Ra+low(&bigFPconstant)]
aupic Ra,high(&array+48)
std Rd,[Ra+low(&array+48)]
data
double bigFPconstant

4 instructions 6 words of memory 2 registers

My 66000:
STD #bigFPconstant,[IP,,&array+48]

1 instruction 4 words of memory all in .text 0 registers

Also note: RISC-V has no real way to support 64-bit displacements other
than resorting to LDs of pointers (ala GOT and similar).

Does make the code density slightly less impressive.

Granted, one can argue the same of prolog/epilog compression in my case:
Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly repetitive).

ENTER and EXIT eliminate the additional control transfers and can allow
FETCH of the return address to start before the restores are finished.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Jan 25 17:18:08 2024

BGB wrote:

On 1/24/2024 2:23 PM, MitchAlsup1 wrote:

BGB wrote:

Granted, one can argue the same of prolog/epilog compression in my case: >>> Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly repetitive).

ENTER and EXIT eliminate the additional control transfers and can allow
FETCH of the return address to start before the restores are finished.

Possible, but branches are cheaper to implement in hardware, and would
have been implemented already...

Are you intentionally misreading what I wrote ??

There is a se

Granted, it is a similar thing to the recent addition of a memcpy()
slide for intermediate-sized memcpy.

Where, if one expresses the slide in reverse order, copying any multiple
of N bytes can be expressed as a branch into the slide (with less
overhead than a loop).

But, I guess in theory, the memcpy slide could be implemented in plain C
with a switch.
uint64_t *dst, *src;
uint64_t li0, li1, li2, li3;
... copy final bytes ...
switch(sz>>5)
{
...
case 2:
li0=src[4]; li1=src[5];
li2=src[6]; li3=src[7];
dst[4]=li0; dst[5]=li1;
dst[6]=li2; dst[7]=li3;
case 1:
li0=src[0]; li1=src[1];
li2=src[2]; li3=src[3];
dst[0]=li0; dst[1]=li1;
dst[2]=li2; dst[3]=li3;
case 0:
break;
}

Like, in theory one could have a special hardware feature, but a plain software solution is reasonably effective.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Jan 25 17:26:50 2024

BGB wrote:

On 1/24/2024 2:23 PM, MitchAlsup1 wrote:

BGB wrote:

Granted, one can argue the same of prolog/epilog compression in my case: >>> Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly repetitive).

ENTER and EXIT eliminate the additional control transfers and can allow
FETCH of the return address to start before the restores are finished.

Possible, but branches are cheaper to implement in hardware, and would
have been implemented already...

Are you intentionally misreading what I wrote ??

Epilogue is a sequence of loads leading to a jump to the return address.

Your ISA cannot jump to the return address while performing the loads
so FETCH does not get the return address and can't start fetching
instructions until the jump is performed.

Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
the return address from the stack and fetch the instructions at the
return address while still loading the preserved registers (that were
saved) so that the instructions are ready for execution by the time
the last LD is performed.

In addition, If one is performing an EXIT and fetch runs into a CALL;
it can fetch the Called address and if there is an ENTER instruction
there, it can cancel the remainder of EXIT and cancel some of ENTER
because the preserved registers are already on the stack where they are supposed to be.

Doing these with STs and LDs cannot save those cycles.

Granted, it is a similar thing to the recent addition of a memcpy()
slide for intermediate-sized memcpy.

Where, if one expresses the slide in reverse order, copying any multiple
of N bytes can be expressed as a branch into the slide (with less
overhead than a loop).

But, I guess in theory, the memcpy slide could be implemented in plain C
with a switch.
uint64_t *dst, *src;
uint64_t li0, li1, li2, li3;
... copy final bytes ...
switch(sz>>5)
{
...
case 2:
li0=src[4]; li1=src[5];
li2=src[6]; li3=src[7];
dst[4]=li0; dst[5]=li1;
dst[6]=li2; dst[7]=li3;
case 1:
li0=src[0]; li1=src[1];
li2=src[2]; li3=src[3];
dst[0]=li0; dst[1]=li1;
dst[2]=li2; dst[3]=li3;
case 0:
break;
}

Looks like Duff's device.

But why not just::

MM Rto,Rfrom,Rcount

Like, in theory one could have a special hardware feature, but a plain software solution is reasonably effective.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to BGB on Thu Jan 25 21:25:26 2024

BGB wrote:

On 1/25/2024 11:26 AM, MitchAlsup1 wrote:

BGB wrote:

On 1/24/2024 2:23 PM, MitchAlsup1 wrote:

BGB wrote:

Granted, one can argue the same of prolog/epilog compression in my
case:
Save some space on prolog/epilog by calling or branching to prior
versions (since the code to save and restore GPRs is fairly
repetitive).

ENTER and EXIT eliminate the additional control transfers and can allow >>>> FETCH of the return address to start before the restores are finished.

Possible, but branches are cheaper to implement in hardware, and would
have been implemented already...

Are you intentionally misreading what I wrote ??

?? I don't understand.

Epilogue is a sequence of loads leading to a jump to the return address.

Your ISA cannot jump to the return address while performing the loads
so FETCH does not get the return address and can't start fetching
instructions until the jump is performed.

You can put the load for the return address before the other loads.
Then, if the epilog is long enough (so that this load is no-longer in
flight once it hits the final jump), the branch-predictor will lead to
it start loading the post-return instructions before the jump is reached.

Yes, you can read RA early.
What you cannot do is JMP early so the FETCH stage fetches instructions
at return address early.
{{If you JMP early, then the rest of the LDs won't happen}}

This is likely a non-issue as I see it.

It is only really an issue if one demands that reloading the return
address be done as one of the final instructions in the epilog, and not
one of the first instructions.

I make no such demand--I merely demand the JMP RA is the last instruction.

Granted, one would have to do it as one of the final ops, if it were implemented as a slide, but it is not. There are "practical reasons" why
a slide would not be a workable strategy in this case.

So, generally, these parts of the prolog/epilog sequences are emitted
for every combination of saved/restored registers that had been encountered.

Though, granted, when used, does mean that any such function needs to effectively two two sets of stack-pointer adjustments:
One set for the save/restore area (in the reused part);
One part for the function (for its data and local/temporary variables
and similar).

Because the entire Epilogue is encapsulated in EXIT, My 66000 can LD
the return address from the stack and fetch the instructions at the
return address while still loading the preserved registers (that were
saved) so that the instructions are ready for execution by the time
the last LD is performed.

In addition, If one is performing an EXIT and fetch runs into a CALL;
it can fetch the Called address and if there is an ENTER instruction
there, it can cancel the remainder of EXIT and cancel some of ENTER
because the preserved registers are already on the stack where they are
supposed to be.

Doing these with STs and LDs cannot save those cycles.

I don't see why not, the branch-predictor can still do its thing
regardless of whether or not LD/ST ops were used.

Consider::

main:
...
CALL funct1
CALL funct2

funct2:
SUB Sp,SP,stackArea2
ST R0,[SP,offset20]
ST R0,[SP,offset20]
ST R30,[SP,offset230]
ST R29,[SP,offset229]
ST R28,[SP,offset228]
ST R27,[SP,offset227]
ST R26,[SP,offset226]
ST R25,[SP,offset225]
...

funct1:
...
LD R0,[SP,offset10]
LD R30,[SP,offset130]
LD R29,[SP,offset129]
LD R28,[SP,offset128]
LD R27,[SP,offset127]
LD R26,[SP,offset126]
LD R25,[SP,offset125]
LD R24,[SP,offset124]
LD R23,[SP,offset123]
LD R22,[SP,offset122]
LD R21,[SP,offset121]
ADD SP,SP,stackArea1
JMP R0

The above would have to observe that all offset1's are equal to all
offset2's in order to short circuit the data movements. A single::

LD R26,[SP,someotheroffset]

ruins the short circuit.

Whereas:

funct2:
ENTER R25,R0,stackArea2
...

funct1:
...
EXIT R21,R0,stackArea1

will have registers R0,R25..R30 in the same positions on the stack
guaranteed by ISA definition!!

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From Quadibloc@21:1/5 to Quadibloc on Fri Jan 26 08:21:04 2024

On Tue, 23 Jan 2024 09:50:29 +0000, Quadibloc wrote:

I have indeed decided that using three base registers for the
basic load-store instructions is much preferable to shortening the
length of the displacement even by one bit.

Another change has been made to Concertina III, based on the work
done for Concertina IV. The instruction prefix has been eliminated
as a possible meaning of the header word; instead, instruction
predication can be specified by the header.

John Savard

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Fri Jan 26 21:30:58 2024

Robert Finch wrote:

On 2024-01-25 4:25 p.m., MitchAlsup1 wrote:

Whereas:

funct2:
     ENTER   R25,R0,stackArea2
     ...

funct1:
     ...
     EXIT    R21,R0,stackArea1

will have registers R0,R25..R30 in the same positions on the stack
guaranteed by ISA definition!!

I like the ENTER / EXIT instructions and safe stack idea, and have incorporated them into Q+ called ENTER and LEAVE. EXIT makes me think of program exit(). They can improve code density. I gather that the stack
used for ENTER and EXIT is not the same stack as is available for the
rest of the app. This means managing two stack pointers, the regular
stack and the safe stack. Q+ could have the safe stack pointer as a
register that is not even accessible by the app and not part of the GPR
file.

LEAVE has older x86 connotations, so I used a different word.

Registers R16..R31 go on the safe stack (when enabled) SSP
Registers R01..R15 go on the regular stack SP

When safe stack is enabled, Return Address goes directly on safe stack
without passing through R0; and comes off of safe-stack without passing
through R0.

SSP requires privilege to access.
The safe stack pages are required to have RWE = 3'B000 rights; so SW
cannot read or write these containers directly or indirectly.

For ENTER/LEAVE Q+ has the number of registers to save specified as a four-bit number and saves only the saved registers, link register and
frame pointer according to the ABI. So, “ENTER 3,64” will save s0 to s2, the frame-pointer, link register and allocate 64 bytes plus the return
block on the stack. The return block contains the frame-pointer, link register and two slots that are zeroed out intended for exception
handlers. The saved registers are limited to s0 so s9.

I specify start and stop registers in ENTER and EXIT. In addition the
16-bit immediate field is used to allocate/deallocate space other than
the save/restored registers. Since the stack is always doubleword
aligned, the low order 3 bits are used "for special things"::
bit<0> decides if SP is saved on the stack (or not 99%)
bit<1> decides if FP is saved and updated (or restored)
bit<2> decides if a return is performed (used when SW walks a stack
back when doing try-throw-catch stuff.)

I use the HoB of register index to signal select stack pointer.

Q+ also has a PUSHA / POPA instructions to push or pop all the
registers, meant for interrupt handlers. PUSH and POP instructions by themselves can push or pop up to five registers.

By the time control arrives at interrupt dispatched, the old registers
have been saved and the registers of the ISR have been loaded; so have
ASID and ROOT,..... Thus an ISR can keep pointers in its register file
to quicken access when invoked.

Some thought has been given towards modifying ENTER and LEAVE to support interrupt handlers, rather than have separate PUSHA / POPA instructions. ENTER 15,0 would save all the registers, and LEAVE 15,0 would restore
them all and return using an interrupt return.

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

From MitchAlsup1@21:1/5 to Robert Finch on Sat Jan 27 17:25:59 2024

Robert Finch wrote:

On 2024-01-26 11:10 p.m., BGB wrote:

On 1/26/2024 10:58 AM, Robert Finch wrote:

<snip>

Admittedly, it can make sense for an ISA intended for higher-end
hardware, but not necessarily something intended to aim for similar
hardware costs to something like an in-order RISC-V core.

Once there is micro-code or a state machine to handle an instruction
with multiple micro-ops, it is not that costly to add other operations.
The Q+ micro-code cost something like < 1k LUTs. Many early micro's use micro-code.

The FMAC unit has a sequencer that performs FDIV, SQRT, and transcendental polynomials. The memory unit has a sequencer to perform LDM, STM, MM, and
ENTER and EXIT.

<snip>

Q+ uses a 128-bit system bus the bus tag is not the same tag as used for
the cache. Q+ burst loads the cache with 4 128-bit accesses for 512 bits
and the 64B cache line is tagged with a single tag. The instruction /
data cache controller takes care of adjusting the bus size between the
cache and system.

A four (4) Beat burst is de rigueur for FPGA implementations.

I think I suggested this before, and the idea got shot down, but I
cannot find the post. It is mystery operations where the opcode comes
from a register value. I was thinking of adding an instruction modifier
to do this. The instruction modifier would supply the opcode bits for
the next instruction from a register value. This would only be applied
to specific classes of instructions. In particular register-register
operate instructions. Many of the register-register functions are not
decoded until execute time. The function code is simply copied to the execution unit. It does not have to run through the decode and rename
stage. I think this field could easily come from a register. Seems like
it would be easy to update the opcode while the instruction is sitting
in the reorder buffer.

Classic 360 EXECUTE instruction ??
Basically, it sounds dangerous. {Side channels in plenty}

--- SoupGate-Win32 v1.05
* Origin: fsxNet Usenet Gateway (21:1/5)

Who's Online
Recent Visitors
- Gretchiie
  Mon Sep 15 05:16:29 2025
  from Derry, Nh via Telnet
- Fred Blogs
  Mon Sep 15 00:03:12 2025
  from Uk via SSH
- Plume
  Sun Sep 14 09:34:52 2025
  from Uk via Raw
- Gretchiie
  Sun Sep 14 06:07:30 2025
  from Derry, Nh via Telnet
- Thlc
  Sat Sep 13 17:11:34 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 17:04:03 2025
  from Rognac, France via Telnet
- Thlc
  Sat Sep 13 16:32:19 2025
  from Rognac, France via SSH
- Thlc
  Sat Sep 13 15:41:11 2025
  from Rognac, France via SSH

System Info

Sysop:	Keyop
Location:	Huddersfield, West Yorkshire, UK
Users:	546
Nodes:	16 (2 / 14)
Uptime:	04:23:41
Calls:	10,387
Calls today:	2
Files:	14,061
Messages:	6,416,782

The Impending Return of Concertina III

Who's Online

Recent Visitors

System Info